From: CUCSCA::B076769 6-MAR-1994 23:56:35.50 To: KWANTZEWAN CC: Subj: archive information From: IN%"lexical@crl.nmsu.edu" 5-MAR-1994 06:28:08.19 To: IN%"clrlist@crl.nmsu.edu" CC: IN%"lexical@crl.nmsu.edu" Subj: CLR Newsletter No. 11 Received: from NMSUVM1.BITNET (MAILER@NMSUVM1) by vax.csc.cuhk.hk (PMDF #12160) id <01H9LQAJKIPC8WWD4I@vax.csc.cuhk.hk>; Sat, 5 Mar 1994 06:27 +0800 Received: from NMSUVM1 (SMTP22) by NMSUVM1.BITNET (Mailer R2.08 R208004) with BSMTP id 2746; Fri, 04 Mar 94 15:18:08 EST Received: from NMSU.Edu by NMSUVM1.NMSU.EDU (IBM VM SMTP V2R2) with TCP; Fri, 04 Mar 94 15:18:06 EST Received: from crl.nmsu.edu by NMSU.Edu (4.1/NMSU-1.18) id AA03066; Fri, 4 Mar 94 15:14:17 MST Received: from ogygia.nmsu.edu by crl.nmsu.edu (4.1/SMI-4.1) id AA11136; Fri, 4 Mar 94 15:14:12 MST Received: by ogygia.nmsu.edu (4.1/SMI-4.1) id AA01081; Fri, 4 Mar 94 15:13:20 MST Date: Fri, 4 Mar 94 15:14:16 MST From: lexical@crl.nmsu.edu (Consortium for Lexical Research) Subject: CLR Newsletter No. 11 To: clrlist@crl.nmsu.edu Cc: lexical@crl.nmsu.edu Message-id: <9403042214.AA03066@NMSU.Edu> Reply-To: lexical@crl.nmsu.edu ******************************************************************** Greetings; Enclosed is the Consortium for Lexical Research Newsletter No. 11. This special edition is being mailed to a large compiled list of researchers in linguistics and computational linguistics. If you do not want to receive additional mailings, no response is necessary. This is a one time mailing and you will not be sent any future issues. If you would like to be put on our mailing list and receive a CLR newsletter every two months, please send your request to: lexical@crl.nmsu.edu. Hoping you enjoy the newsletter, Katherine Mitchell Consortium for Lexical Research ********************************************************************* ************************************************* Consortium for Lexical Research Newsletter 11 February 28, 1994 ************************************************* From the Computing Research Laboratory New Mexico State University Edited by: Jim Cowie and Katherine Mitchell Contributions and inquiries to: lexical@nmsu.edu FTP address for accessing materials: clr.nmsu.edu [128.123.1.12]. ************************************************* Introduction This newsletter discusses machine readable dictionaries made available through CLR. In addition some recently acquired parsers are described. The next newsletter will describe wordlists stored in our archives, and a new service to CLR members, called Resources, which will centralize information on ftp sites, organizations, projects, publications, etc. of interest to the natural language processing research community. For more information on the Consortium, please ftp to our site and get a copy of our catalog. It is available in plain ascii as `catalog' or in a postscript version `catalog.ps'. Any questions about the archives or on using the becoming a member of CLR can be obtained by emailing lexical@nmsu.edu. This newsletter is distributed in plain ASCII text and in postscript format. To obtain a copy please ftp to clr.nmsu.edu. The directory is CLR/newsletter and the files to get are news11.txt and news11.ps. ************************************************* Contents 1. Using FTP and Changes in our FTP site 2. Machine Readable Dictionaries 3. Recent Acquisitions 4. CLR Membership ************************************************* Using Anonymous FTP Materials stored in the CLR are constantly being updated and new acquisitions are available. If you are interested in learning what these items are, you are welcome to ftp our catalog. Anonymous ftp allows non-members access to the catalogs and some unrestricted data files. Here are the steps for using Anonymous FTP. It is recommended that you get the file README.clr.site for an introduction to using our archives. >ftp clr.nmsu.edu (or ftp 128.123.1.12) login: anonymous password: type in your email address; for ex: rose@ed.ac.uk ftp> cd CLR ftp> binary (it is very important to set the binary mode when you are downloading software programs. Failure to do this can cause poor data transfer and problems with the software when you use it) ftp> get README.clr.site (or get catalog) ftp> quit Members of CLR use a login name and a special password that they are assigned. Members can access certain directories that non-members are unable to use. Changes at our FTP site The archives have been reorganized and all CLR materials are now under the directory CLR. Within the CLR area there are reserved materials filed in the directories named members-only and MUC5, and freely available materials under the directories multiling, lexica, and tools. There are new README files, and the file README.clr.site will answer basic questions about the Consortium and accessing its materials through ftp. ************************************************* MACHINE READABLE DICTIONARIES ************************************************ Large dictionaries with full semantic information are not freely available. The first section below describes the MRD's which CLR distributes, along with their costs. Sample electronic files are available for some of these dictionaries; please email us if you would like to see electronic samples. The Consortium has very good working relationships with these publishers, and will facilitate the paperwork and expediting the materials. There are a variety of freely available dictionaries which have pronunciation information, some syntactic information, or some features information, but none with full definitions of headwords. As a service to CLR members we are gathering these dictionaries and trying to build a complete centralized archive of what is available. ****Dictionaries with Semantic Information: not freely available**** 1) Collins English Dictionary Collins English Dictionary, 3rd Edition, published in 1991. A revised edition will be issued later this year; Revised 3rd Edition. The CED3 contains 180,000 references, 190,000 numbered definitions, 14,000 new entries and updated entries from the previous edition, and 16,000 biographical and geographical entries; it has 3.5 million words of text. A very extensive resource, in many ways an encyclopedia as well as a dictionary. The vocabulary of science, technology, and other specialist areas is well covered.Older versions of the CED are fairly obsolete. The CED3 is primarily a British English dictionary, though it contains many American English spellings and meanings. Format: The machine readable dictionary is supplied on tape. It is in ascii text format and contains the typesetters codes. Cost: 2,000 pounds sterling for academic research; more for corporate research. Instructions: Harper Collins Publishers has an application form that is required; upon its approval a contract is drawn up; upon completion of the contract and payment of the fees, the tape is provided. Sample: an electronic sample is available from CLR. 2) Collins COBUILD English Language Dictionary Collins COBUILD English Language Dictionary is a Learners Dictionary, designed for instruction in the English language. COBUILD is developed from analysis of the Collins Bank of English, a corpus of more than 200 million words gathered from a wide range of spoken and written sources. The dictionary concentrates on contemporary, everday, non-specialist English and uses example sentences taken from real or spoken language. It has over 70,000 references, and over 90,000 examples. The Format, Cost, Instructions, and Samples are the same as above. 3) Collins Bilingual Dictionaries Harper Collins publishes a line of bilingual dictionaries for several languages which come in different sizes. The Gem series editions have about 40,000 references, and the Concise series editions have over 100,000. The Large editions typically have over 200,000 references and over 400,000 translations. German-English bilingual: Gem, Concise, Large Italian-English bilingual: Gem, Concise, Large Spanish-English bilingual: Gem, Concise, Large French-English bilingual: Gem, Greek-English bilingual: Gem Hindi-English bilingual: Gem Malay-English bilingual: Gem Portuguese-English bilingual: Gem Russian-English bilingual: Gem Format: The machine readable dictionaries are supplied on tape. They're in ascii text format and contain the typesetters codes. Cost: For academic research only; the Gem costs 1,000 pounds sterling, the Concise is 1,250 pounds, and the Large is 2,000 pounds. Corporate research pricing is higher. Instructions: are the same as those listed above for Collins English dictionaries. Samples: electronic samples are available for the Spanish-English and the German-English bilinguals in their Large editions. The samples are for the complete letter "N" in both languages. 4) Longmans Dictionary of Contemporary English LDOCE is available electronically as the second edition published in 1987 or as the first edition from 1978. The first edition is available as a typesetting tape or as a LISP version, and has semantic information which was not included in the second edition. For example, Box Codes, which hierarchically specify abstract or concrete, concrete branching to animate or inanimate, animate branching to plant, human, animals and etc., etc. Also marked are Subject Field Codes which indicate domains, such as Economics, Entertainment, or Basketball. LDOCE is a Learners Dictionary; the first edition has approximately 45,000 entries, and the second edition about 56,000 (including phrasals). Format: 1978 and 1987 editions: typesetting tape. 1978 edition: lisp version on tape, or typesetting tape. Cost: 1,000 pounds sterling for academic research; more for corporate research. Instructions: Longmans Publishers has an application form that is required; upon its approval a contract is drawn up; upon completion of the contract and payment of the fees, the tape is provided. 5) Roget's Thesaurus The original 1911 Roget's Thesaurus is freely available. A 1991 American English Edition of the thesaurus is available published by Harper Collins. For academic research purposes it costs 750 pounds sterling. Please write to inquire about the Collins version. You can ftp to the directory below to pick up the freely available 1911 version. Ftp Directory: lexica/roget_1911/ ****Dictionaries in English: freely available **** The following is a list of public domain dictionaries, none of which contain definitions or full semantic information. All have some accompanying documentation that helps explain the codes used and the syntactic information available. Each of these has strict copyright privileges reserving them for academic research and excluding them from incorporation into any commercial applications. 1) Collins English Dictionary Prolog FactBase Developed by Dr. Ed Fox and Dr. Robert Vance at Virginia Tech. Using the original 1974 Collins English Dictionary a set of Prolog facts were derived and a set of relations files created, one file for each "relation to the headword" identified in the structure of the dictionary. Examples of these files are:HEADWORD, headword entry; ALSO_CALLED, headword is also commonly called this; CATEGORY, semantic label of headword; POS, part of speech; PAST, past form of headword. Edinburgh standard Prolog syntax. Ftp Directory: CLR/members-only/lexica/CED.prolog 2) The MRC Psycholinguistic Database Originally prepared by Max Coltheart for a Medical Research Council grant as a database for psycholinguistic use. The file has 150,837 headwords with 26 linguistic properties, although information on every property is not available for very many words. Properties include number of phonemes and syllables, measures of frequency, pronunciation, and part of speech, to name a few. Ftp Directory: CLR/members-only/lexica/MRC.psycholing/ 3) The On-Line Dictionary of Computing A glossary of programming languages, architecture, networks, domain theory, mathematics, etc. Copyright Dennis Howe 1993, freely available for research use. Ftp Directory: CLR/members-only/lexica/OLDC/ 4) The Oxford Advanced Learners Dictionary of Contemporary English: Mitton's version This is a version of the 1974, 3rd edition OALDCE, prepared by Roger Mitton of the University of London specifically for use in computer applications. The dictionary contains no definitions; the spelling, pronunciation, and syntactic information from the original are retained. It has 35,000 headwords, about 2,500 added proper names, and an added section created by Dr. Mitton which has over 68,000 derived inflected forms. Ftp Directory: CLR/members-only/lexica/OALDCE/ 5) WordNet 1.4 Developed by Professor George Miller and his group at Princeton, WordNet is an on-line lexical reference system designed as a semantic network. English nouns, verbs, and adjectives are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets: synonymy, antonymy, meronomy, and hyponymy. Wordnet has brief definitions, and has the advantage of having been conceived and built explicitly for use in computer applications. CLR houses versions 1.2, 1.3, and 1.4. Ftp Directory: CLR/lexica/wordnet/ ****Pronunciation Dictionaries **** The Carnegie Mellon University Dictionary contains about 100,000 words and their phonetic transcriptions. The phoneset lists 26 phones that were used. Robert Weide and Peter Jansen from CMU generated the dictionary from a variety of sources including the UCLA Shoup dictionary. A second source for pronunciation is Chuck Wooster's (ICSI Berkeley) TIMIT database of 6100 words from TIMIT and their most common pronunciation. Homophones is not a pronunciation dictionary, but rather is a list of words that sound the same but are spelled differently. The list of homophones was provided by Evan Antworth from the Summer Institute of Linguistics. Ftp Directory: CLR/members-only/lexica/CMU-Dict.0.1/ Ftp Directory: CLR/lexica/TIMIT/ Ftp Directory: CLR/lexica/homophones/ ****Dictionaries: not English**** 1) EDICT This is a public domain Japanese/English dictionary intended originally for use with MOKE (Marks Own Kanji Editor) but is used today in a large number of packages. It uses EUC code for Kanji and Kana. EDICT has over 30,000 entries, and entries do have markers, such as transitive or intransitive verb, idiomatic expression, person name, etc. EDICT was started by Mark Edwards, but has been developed by James Breen. Ftp Directory: CLR/lexica/edictj 2) JDDICT A Japanese to German dictionary entered in by Helmut Goldstein. The dictionary has over 11,000 Japanese words and 22,000 German translations. Ftp Directory: CLR/lexica/jddict/ 3) The Japanese Morphological Dictionary This was made freely available by ICOT, and comes with both the dictionary and a search program to access it. The documentation is extensive, but it is all in Japanese. Ftp Directory: CLR/lexica/jmorphdict/ 4) Russian - English On-Line Dictionary This is an on-line dictionary for MS DOS developed by Leon Ungier. Ftp Directory: CLR/multiling/russian/ **************************************************** RECENT ACQUISTIONS **************************************************** Below are some new additions to the CLR archives. The acronym dictionary is mentioned because it is in keeping with this newsletter's theme. The other materials are parsers and grammar systems. --------------------------------------------------- ACRONYM DICTIONARY Ftp Directory: members-only/lexica/wordlists/acronyms/ An ascii text file of a very comprehensive list of acronyms; over 3300 entries. A wide variety of domains are covered, including business, science, medicine, government, and more. A brief sample from the letter `N': NAS National Academy of Sciences; NAS National Advanced Systems; NASA National (US) Aeronautics and Space Administration [Space]; NASDA NAtional (Japan) Space Development Agency [Space]; NASM National (US) Air and Space Museum [Space]; NASP National (US) AeroSpace Plane [Space]; NATO North Atlantic Treaty Organization. --------------------------------------------------- AV Parser Ftp Directory: members-only/tools/ling-analysis/syntax/AVparser/ The Attribute Value Parser provides a general tool for investigating unification-based theories of grammar, runs on Apple Macintosh computers, and was developed by Mark Johnson. It works with a user-defined grammar, specified in a file or constructed using the editor included, and constructs parse trees and feature structures from input sentences. Clicking on the nodes in the parse tree causes their associated feature structures to be displayed. There are two versions of the parser, corresponding to the two versions of Apple's CommonLisp environment that were used to create them. The 1.32 version was created with MACL version 1.32, and the 2.0p2 version was created with MCL 2.0p2. --------------------------------------------------- FUF and SURGE Ftp Directory: /members-only/tools/ling-analysis/syntax/ FUF 5.2 and SURGE 1.2 were developed by Michael Elhadad, currently at Ben Gurion University of the Negev. FUF is an extended implementation of the formalism of functional unification grammars (FUG's) introduced by Martin Kay, specialized to the task of natural language generation. SURGE is a large syntactic realization grammar of English, written in FUF. SURGE is developed to serve as a "black box" syntactic generation component in a larger generation system that encapsulates a rich knowledge of English syntax. SURGE can also be used as a platform for exploration of grammar writing with a generation perspective. --------------------------------------------------- LHIP PARSER Ftp Directory: members-only/tools/ling-analysis/syntax/ The LHIP parser (Left-Head corner Island Parser) was developed by Afzal Ballim, at ISSCO, the University of Geneva. LHIP is a system for incremental grammar development using an extended DCG formalism. The system uses a robust island-based parsing method controlled by user-defined performance thresholds which allows it to analyze what it can from the input, thus presenting the grammar developer with results at an early stage. The rules themselves are an extended version of the DCG rules, allowing optional constituents, negation, disjunction, the specification of adjacency, and the ability to mark multiple heads in a rule body. The latest version is 1.1. The lhip system requires an Edinburgh style Prolog. *************************************************** CLR MEMBERSHIP *************************************************** The members-only area of the CLR archives is rapidly increasing its volume with valuable materials and software available to lexical researchers, members of the consortium. If your interests lie in lexicology, lexicography and lexical research, we encourage your organization to become a member, promoting the use of these valuable resources for lexical research and ensuring that they can be maintained. Welcome to new CLR members: Edwin R. Addison, President, along with the staff of Conquest Software, Inc. in Columbia, Maryland. Dr. Jose M. Castano at the Departmento de Computacion, Universidad de Buenos Aires, Buenos Aires, Argentina. Dr. Jane Edwards of the Institute for Cognitive Studies, and Dr. Daniel Jurafsky of the International Computer Science Institute, at the University of California at Berkeley, Berkeley, California. Dr. Kemal Oflazer of the Computer Engineering Department, Bilkent University, Ankara, Turkey.