How 'proof' Works Craig Latta Experimental Computing Facility (XCF) 20 September 1991 (about tea-time) The program is based on the concept of left-associative parsing (Hausser, 1986), which states that humans comprehend speech causally. The essence of the parser is a set of rule packages, which are in turn made of up rules. Rules take two "categorizations" as input, and output a third. Each lexical entry consists of a surface form and one or more categorizations, which correspond to the different readings of the word. Categorizations consist of "segments", which are abbreviations for various possible linguistic attributes (e.g.: "w*-interrogative", "finite verb", etc.) The parser checks sequentially by sentence, assuming that all the words in the input text have lexical entries. Given a sentence, the lexical entry of the first word is designated the "sentence start". If the categorizations of the lexical entries for the sentence start and the next word are compatible (in the eyes of one of the rules in the inital rule package), then the next word is grouped into the surface form of the sentence start, and the output of the active rule becomes the category of the new sentence start. The name of the next rule package to operate on the sentence start is derived from the name of the rule which produced the sentence start. Succeeding words are combined into the sentence start, adding and cancelling grammatical "valencies", which are encoded in the category of the sentence start. [For example, the sentence start "The girl is reading a" has a valency for a noun phrase, which would be introduced by the rule which combined "a" into the sentence start.] This process continues until the end of the sentence is reached. Then, a final rule, depending on the tone of the sentence (declarative, interrogative, etc.) checks to see that there are only ignorable category segments left in the sentence start's category (i.e. no "dangling" valencies). If this is the case, the sentence is considered grammatical, and the next sentence is analyzed. If the analysis of the sentence has failed at some intermediate point, then the user is shown that point in the text, and is told what types of continuations to the sentence start are valid, and allowed to fix the sentence, or to proceed with the analysis from the beginning of the next sentence. If a sentence is grammatically incorrect, the earliest offending in the sentence is highlit, and the user is presented with several options. (S)he may either skip the word, skip the sentence, or choose a replacement word from a list suggested by 'proof'. 'proof' works with input from the command line, files, and with interactive input. It has both terminal and graphic (using X) user interfaces. It was not as as hard to write as it sounds. It should be rewritten in Smalltalk, as should all other programs which haven't been already. --- complexity.tex \documentstyle{cmu-art} \begin{document} \renewcommand{\topmargin}{2.0cm} \renewcommand{\evensidemargin}{2.0cm} \renewcommand{\oddsidemargin}{2.0cm} \begin{large} \noindent COMPLEXITY IN LEFT-ASSOCIATIVE GRAMMAR \end{large} \vspace{10pt} \noindent {\it ROLAND HAUSSER}\\ Friedrich-Alexander Universit\"{a}t\\ D-8520 Erlangen, Bismarckstr. 12, Germany \vspace{18pt} \begin{small} \begin{quote} This paper presents a mathematical definition of Left-Associative Grammar, and describes its formal properties.\footnote{Some of the results reported in this paper are published in Hausser (1989). Discussions with David Applegate, Gary Miller, Daniel Sleator, and Shanghua Teng at Carnegie Mellon University in October of 1989 resulted in a revised complexity result for ambiguous C-LAGs. I am especially indebted to David Applegate, who provided the crucial grammars for L$_{no}$, $a^{i!}$, Subset Sum, and SAT.} Conceptually, LA-grammar is based on the notion of possible {\it continuations}, in contrast to more traditional systems such as Phrase Structure Grammar and Categorial Grammar, which are linguistically motivated in terms of possible {\it substitutions}. It is shown that LA-grammar generates all and only the recursive languages. The Chomsky hierarchy of regular, context-free, and context-sensitive languages is reconstructed in LA-grammar by simulating finite state automata, push-down automata, and linearly bounded automata, respectively. Using alternative restrictions on LA-grammars, the new language hierarchy of A-LAGs, B-LAGs, C-LAGs is proposed. The class of C-LAGs is divided into three subclasses representing different degrees of ambiguity and associated computational complexity. The class of C-LAGs without recursive ambiguities (called the C1-LAGs) parses in linear time, and includes all deterministic CF-languages, plus CF-languages with non-recursive ambiguities, e.g., $a^{n} b^{n} c^{m} d^{m} \cup a^{n} b^{m} c^{m} d^{n}$, plus many context-sensitive languages, such as $a^{n} b^{n} c^{n}$, $a^{n} b^{n} c^{n} d^{n} e^{n}$, \{$a^{n} b^{n} c^{n}$\}$^{*}$, $a^{2^{i}}$, $a^{k}b^{m}c^{k \cdot m}$ and $a^{i!}$. The class of C-LAGs with recursive ``single return'' ambiguities (called C2-LAGs) parses in $n^{2}$, and includes certain non-deterministic CF-languages such as WW$^{R}$, plus context-sensitive languages like WW, WWW, WWWWW, and \{WWW\}$^{*}$. Finally, the class of unrestricted C-LAGs (called C3-LAGs) parses in exponential time and contains CF-languages like L$_{no}$ and the ``hardest context-free language'' HCFL, plus context-sensitive languages like $\cal NP$-complete Subset Sum and SAT. \end{quote} \end{small} \vspace{0.2cm} \begin{large} \noindent {\bf 1. Formal Rule Schemata of Generative Grammar } \end{large} \vspace{0.1cm} \noindent Left-associative grammar (LA-grammar) is a comparatively new formalism. By way of introduction, let us compare it with more widely known systems, namely phrase structure grammar (PS-grammar) and categorial grammar (C-grammar). The formalism of PS-grammar is based on the rewriting system of Post (1936). Rewriting rules have the following form: \vspace{0.1cm} \noindent {\bf (1.1) The Schema of Phrase-Structure Rewriting Rules} \begin{quote}\vspace{-0.2cm} A $\rightarrow$ B C \end{quote}\vspace{-0.2cm} \noindent By replacing (rewriting) the symbol A with B and C, this rule generates a tree structure with A dominating B and C. Conceptually, the derivation order of rewriting rules is top-down. The formalism of C-grammar is based on the categorial-canceling rules of Le\'{s}ni\-eswki (1929) and Ajdukiewicz (1935). Categorial-canceling rules have the following form: \vspace{0.1cm} \noindent {\bf (1.2) The Schema of Categorial Canceling Rules} \begin{quote}\vspace{-0.2cm} $\alpha_{\rm (Y \mid X)} \bullet \beta_{\rm (Y)} \Rightarrow \alpha \beta_{\rm (X)}$ \end{quote}\vspace{-0.2cm} \vspace{0.1cm} %\pagebreak \noindent This rule schema combines $\alpha$ and $\beta$ into $\alpha \beta$ by canceling the Y in the category of $\alpha$ with the corresponding category of $\beta$. The result is a tree structure with $\alpha \beta$ of category X dominating $\alpha$ and $\beta$. Conceptually, the derivation order of categorial-canceling rules is bottom-up. Finally consider the rule schema of LA-grammar. %\vspace{0.1cm} \noindent {\bf (1.3) The Schema of Left-Associative Concatenation Rules} \begin{quote}\vspace{-0.2cm} r$_{i}$: [CAT-1 CAT-2] $\Rightarrow$ [rp$_{i}$ CAT-3] \end{quote}\vspace{-0.2cm} \noindent A left-associative rule r$_{i}$ maps a sentence start (represented by its category CAT-1) and a next word (represented by its category CAT-2) into the rule package rp$_{i}$ and a new sentence start (represented by its category CAT-3). A {\it state} in LA-grammar is defined as an ordered pair, consisting of a rule package and a sentence start. In the next composition, the rules in the rule package are applied to the sentence start resulting from the last composition and a new next word. The three different rule schemata represent three different conceptual derivation orders: \vspace{0.1cm} \noindent {\bf (1.4) Three Grammatical Derivation Orders} \setlength{\unitlength}{1in} \begin{picture}(5,1.6) \put(.1,.3){\vector(1,1){.15}} \put(.5,.3){\vector(-1,1){.15}} \put(.3,.55){\vector(1,1){.15}} \put(.7,.55){\vector(-1,1){.15}} \put(.5,.8){\vector(1,1){.15}} \put(.9,.8){\vector(-1,1){.15}} \put(.7,1.05){\vector(1,1){.15}} \put(1.1,1.05){\vector(-1,1){.15}} \put(.25,1.4){LA-grammar} \put(2.6,.3){\vector(1,1){.15}} \put(3,.3){\vector(-1,1){.15}} \put(1.9,.55){\vector(1,1){.15}} \put(2.3,.55){\vector(-1,1){.15}} \put(2.4,.55){\vector(1,1){.15}} \put(2.8,.55){\vector(-1,1){.15}} \put(2.15,.8){\vector(1,1){.15}} \put(2.55,.8){\vector(-1,1){.15}} \put(2,1.4){C-grammar} \put(4.6,.5){\vector(1,-1){.15}} \put(4.5,.5){\vector(-1,-1){.15}} \put(4.4,.75){\vector(1,-1){.15}} \put(4.3,.75){\vector(-1,-1){.15}} \put(3.9,.75){\vector(1,-1){.15}} \put(3.8,.75){\vector(-1,-1){.15}} \put(4.15,1){\vector(1,-1){.15}} \put(4.05,1){\vector(-1,-1){.15}} \put(3.7,1.4){PS-grammar} \put(0,0){bottom-up left-assoc.} \put(1.6,0){bottom-up amalgamating} \put(3.5,0){top-down expanding} \end{picture} \vspace{0.1cm} \vspace{0.2cm} LA-grammars are input-output equivalent to their parsers and generators in that (i) they take the same input (i.e., an unanalyzed surface string), (ii) generate the same output (a left-associative syntactic analysis), and (iii) use the same rules in the same derivation order. In other words, LA-grammar achieves ``absolute type transparency'' [Berwick \& Weinberg (1984), p. 41] because it is strongly input-output equivalent to its parsers and generators. PS-grammar and C-grammar, on the other hand, are unsuitable for direct parsing. Parsers for context-free PS-grammars, for example, cannot possibly apply the rules of the grammar directly because the rules rewrite an initial start symbol, while the parser takes sentences as input. The standard solution to this dilemma consists in computational routines which reconstruct the grammatical analysis in an indirect way by building large intermediate structures (e.g., ``state sets'', ``charts'', ``tables'') which are not part of the grammar. \vspace{0.2cm} \begin{large} \noindent {\bf 2. Syntax and Semantics} \end{large} \vspace{0.1cm} \noindent The surface structures of natural language are linear. When we utter or understand a sentence, we process it word by word, starting at the beginning. This fundamental structural property is formalized in LA-grammar, which computes possible continuations. The left-assocative derivation order of LA-grammar reflects the time-linear nature of language. The linear syntactic derivations in LA-grammar may be extended to generate homomorphic semantic hierarchies, defined as minimal databases which are suitable for pragmatic interpretation. As an example of a left-associative parse in natural language consider (2.1). \vspace{0.1cm} %\pagebreak \noindent {\bf (2.1) A Sample Derivation} \vspace{-0.3cm} \begin{footnotesize} \begin{verbatim} NEWCAT> (z Fido found a bone.) Elapsed real time = 779 milliseconds User cpu time = 660 milliseconds System cpu time = 20 milliseconds Total cpu time = 680 milliseconds Linear Analysis: *START_0 1 (NA) FIDO (N SC V) FOUND *NOM+FVERB_3 2 (SC V) FIDO FOUND (SQ) A *FVERB+MAIN_4 3 (SQ V) FIDO FOUND A (SN) BONE *DET+NOUN_2 4 (V) FIDO FOUND A BONE (V DECL) . *CMPLT_13 5 (DECL) FIDO FOUND A BONE . Hierarchical Analysis: (PROPOSITION-5_6_13 (MOOD (DECLARATIVE-5_6_13)) (PROP-CONTENT ((SENT-2_6_13 (SUBJ ((NP-1_6_13 (NAME (FIDO-1_6_13))))) (VERB (FIND-2_6_13)) (DIR-OBJ ((NP-3_6_13 (REF (A-3_6_13 SG-4_6_13)) (NOUN ((BONE-4_6_13)))))))))) PROPOSITION | ----------------- | | MOOD PROP-CONTENT | | | | | | DECLARATIVE SENT | -------------- | | | SUBJ VERB DIR-OBJ | | | | | | | | | NP FIND NP | | | ------ | | | NAME REF NOUN | | | | ---- | | | | | FIDO A SG BONE \end{verbatim} \end{footnotesize} \pagebreak \noindent The linear syntactic analysis in 2.1 is followed by a semantic hierarchy which is constructed incrementally and simultaneously with the syntactic derivation. The semantic hierarchy expresses many of the intuitions central to constituent structure.\footnote{Constituent structures are a special kind of semantic hierarchy. They are based methodologically on {\it substitution} and {\it movement tests} which are intended to reveal which parts of the sentence belong most closely together. The rules of PS-grammar and C-grammar, which generate constituent structures, impose a partial derivation order which reflects the structure of the hierarchy, and not the time-linear nature of language. LA-grammar differs from constituent structure grammars in that it generates semantic hierarchies in a time-linear order.} It is displayed as a structured list and (optionally) as the equivalent tree structure. The following discussion of LA-grammar is limited to the formal properties of the linear syntax. \vspace{0.2cm} \begin{large} \noindent {\bf 3. The Mathematical Definition} \end{large} \vspace{0.1cm} \noindent Let us recall some notation from set theory needed in the following definition. If X is a set, then X$^{+}$ is the ``positive closure,'' i.e., the set of all concatenations of elements of X. X$^{*}$ is the Kleene closure of X, defined as X$^{+} \cup \epsilon$, where $\epsilon$ is the ``empty sequence.'' The power set of X is denoted by 2$^{X}$ . If X and Y are sets, then (X $\times$ Y) is the Cartesian product of X and Y, i.e., the set of ordered pairs consisting of an element of X and an element of Y. For convenience, we also identify integers with sets, i.e., {\it n} = \{i $\mid$ 0 $\leq$ i $<$ n\}. \noindent {\bf (3.1) Formal Definition of Left-Associative Grammar} \vspace{0.1cm} \noindent An LA-grammar is defined as a 7-tuple $<$W, C, LX, CO, RP, ST$_{S}$, ST$_{F}>$, where \vspace{-0.2cm}\begin{enumerate} \item W is a finite set of {\it word surfaces}. \vspace{-0.2cm} \item C is a finite set of {\it category segments}. \vspace{-0.2cm} \item LX $\subset$ (W $\times$ C$^{+}$) is a finite set comprising the {\it lexicon}. \vspace{-0.2cm} \item CO = (co$_{0}$ ... co$_{n-1}$) is a finite sequence of total recursive functions from (C$^{*}$ $\times$ C$^{+}$) into C$^{*}$ $\cup$ \{$\perp$\}, called {\it categorial operations}. \vspace{-0.2cm} \item RP = (rp$_{0}$ ... rp$_{n-1}$) is an equally long sequence of subsets of {\it n} called {\it rule packages}. \vspace{-0.2cm} \item ST$_{S}$ = \{(rp$_{s}$ cat$_{s}$), ...\} is a finite set of {\it initial states}, where each rp$_{s}$ is a subset of $n$ called a start rule package and each cat$_{s}$ $\varepsilon$ C$^{+}$. \vspace{-0.2cm} \item ST$_{F}$ = \{(rp$_{f}$ cat$_{f}$), ...\} is a finite set of {\it final states}, where each rp$_{f}$ $\varepsilon$ RP and each cat$_{f}$ $\varepsilon$ C$^{*}$. \end{enumerate} For theoretical reasons, the categorial operations are defined as total functions. In practice, the categorial operations are defined on easily-recognizable subsets of (C$^{*}$ $\times$ C$^{+}$), where anything outside these subsets is mapped into the arbitrary ``don't care'' value \{$\perp$\}, making the categorial operations total. An LA-grammar is specified by (i) a lexicon LX, (ii) a set of start states ST$_{S}$, (iii) a sequence of rules, each defined as an ordered pair (co$_{i}$ rp$_{i}$), and (iv) a set of final states ST$_{F}$. A left-associative rule r$_{i}$ takes a sentence start ss and a next word nw as input, and applies the categorial operation co$_{i}$ to the sentence category cat-1 and the next word category cat-2. If the input condition of the categorial operation is satisfied by (cat-1 cat-2), the application of r$_{i}$ is successful and an output is derived. The output consists of the pair [rp$_{i}$ ss'], where rp$_{i}$ is a {\it rule package} and ss' is a {\it resulting sentence start}. If the input condition of the categorial operation is not satisfied by (cat-1 cat-2), the application of rule r$_{i}$ is not successful and no output is derived. The rule package rp$_{i}$ provided by the rule r$_{i}$ contains all rules which may apply after rule r$_{i}$ was successful. A rule package is defined as a set of rule names, where the name of a rule r$_{g}$ is the place number g of its categorial operation co$_{g}$ in the sequence CO. The general format of LA-grammars is illustrated below with the context-sensitive language $a^{n} b^{n} c^{n}$. %\vspace{0.1cm} \pagebreak \noindent {\bf (3.2) The Definition of $a^{n} b^{n} c^{n}$} \vspace{-0.2cm} \begin{tabbing} xx\=LX \= =$_{def}$ \= \{a, b, c\}\kill \>LX\> =$_{def}$ \>\{[a (bc)], [b (b)], [c (c)]\}\\ \>ST$_{S}$ =$_{def}$ \{[\{r-1, r-2\} (bc)]\}\\ xx\= r-1: \= [(bXc) (bc)]\= $\Rightarrow$ \= [\{r-1, r-2\} (bbXcc)],\kill \> r-1: \>[(X) (bc)]\> $\Rightarrow$ \> [\{r-1, r-2\} (bXc)],\\ \> r-2: \>[(bXc) (b)]\> $\Rightarrow$ \> [\{r-2, r-3\} (Xc)],\\ \> r-3: \>[(cX) (c)]\> $\Rightarrow$ \> [\{r-3\} (X)]\\ xx\=\kill \>ST$_{F}$ =$_{def}$ \{[rp-3 $\epsilon$]\}. \end{tabbing} \vspace{-0.2cm} The lexicon LX defines three words, `a', `b', and `c'. Each word is an ordered pair consisting of a surface, e.g., `a', and a category defined as a list of category segments, e.g., (bc). The start state in ST$_{S}$ specifies that the first word must be of category (bc), i.e., an `a'. Furthermore, the only rules to be tried at the beginning are r-1 and r-2. The rules specify the categorial operations in terms of the sequence variable X, standing for zero or more category segments, and the category segments $b$ and $c$. Rule r-1 takes a sentence start of any category (represented by `(X)'), and a next word of category (bc)---that is, an `a'---and derives the output category by adding a $b$ to the beginning and a $c$ to the end of the sentence start category. Rule r-2 takes a sentence start category which starts with an $b$ and ends with a $c$, a next word of category (b)---that is, a `b'---and derives an output category by deleting the first segment. And similarly for r-3. Note that the three rules of $a^{n} b^{n} c^{n}$ have {\it incompatible} input conditions (they all take different next words). Thus the grammar is an example of an {\it unambiguous} C-LAG and parses in linear time.\footnote{See Section 8 for further discussion.} In LA-grammar the parsing of a language consists in applying the grammar rules directly to the input string.\footnote{In contrast to PS-grammar which requires the construction of state sets, charts, or tables to mediate between the parser input, i.e., a string of words, and the grammar rules.} Thus, the declarative linguistic analysis and the computation are merely different aspects of the same left-associative structure. The grammatical analysis provided by LA-parsers and LA-generators is simply a trace of the computation. \vspace{0.1cm} \noindent {\bf (3.3) Parsing {\bf aaabbbccc} with Active Rule Counter} \vspace{-0.1cm} \begin{scriptsize} \begin{verbatim} NEWCAT> (z a a a b b b c c c) ; 1: Applying rules (RULE-1 RULE-2) ; 2: Applying rules (RULE-1 RULE-2) ; 3: Applying rules (RULE-1 RULE-2) ; 4: Applying rules (RULE-2 RULE-3) ; 5: Applying rules (RULE-2 RULE-3) ; 6: Applying rules (RULE-2 RULE-3) ; 7: Applying rules (RULE-3) ; 8: Applying rules (RULE-3) ; Number of rule applications: 14. *START-0 1 (B C) A (B C) A *RULE-1 2 (B B C C) A A (B C) A *RULE-1 3 (B B B C C C) A A A (B) B *RULE-2 4 (B B C C C) A A A B (B) B *RULE-2 5 (B C C C) A A A B B (B) B *RULE-2 6 (C C C) A A A B B B (C) C *RULE-3 7 (C C) A A A B B B C (C) C *RULE-3 8 (C) A A A B B B C C (C) C *RULE-3 9 (NIL) A A A B B B C C C \end{verbatim} \end{scriptsize} %\pagebreak \noindent Each left-associative composition is characterized by a word number (e.g., 4), a sentence start consisting of a category\footnote{Note that categories precede the surfaces in (3.3). In the current implementation, the parse is called with the function ``(z ...)''.} and a surface (e.g., (B B C C) A A A B ), a next word (e.g., (B) B ), and a rule (e.g., *RULE-2). The result of the composition is shown in the first line of the next ``history section'' (e.g., (B C C C) A A A B B ). LA-grammar is as suitable for direct generation as it is for direct parsing. The only difference is that in parsing, the next word is provided by the input string, while in generation the next word is chosen from the lexicon. As an illustration of the relation between an LA-grammar and its generator, consider the following derivation of well-formed expressions up to length 12 using the grammar for a$^{n}$b$^{n}$c$^{n}$ defined in (3.2). \vspace{0.1cm} \noindent {\bf (3.4) Generating the Representative Sample of a$^{n}$b$^{n}$c$^{n}$} \vspace{-0.1cm} \begin{scriptsize} \begin{verbatim} NEWCAT> (gram-gen 3 '(a b c)) Parses of length 2: A B 2 (C) A A 1 (B B C C) Parses of length 3: A B C 2 3 (NIL) A A B 1 2 (B C C) A A A 1 1 (B B B C C C) Parses of length 4: A A B B 1 2 2 (C C) A A A B 1 1 2 (B B C C C) A A A A 1 1 1 (B B B B C C C C) Parses of length 5: A A B B C 1 2 2 3 (C) A A A B B 1 1 2 2 (B C C C) A A A A B 1 1 1 2 (B B B C C C C) Parses of length 6: A A B B C C 1 2 2 3 3 (NIL) A A A B B B 1 1 2 2 2 (C C C) A A A A B B 1 1 1 2 2 (B B C C C C) Parses of length 7: A A A B B B C 1 1 2 2 2 3 (C C) A A A A B B B 1 1 1 2 2 2 (B C C C C) Parses of length 8: A A A B B B C C 1 1 2 2 2 3 3 (C) A A A A B B B B 1 1 1 2 2 2 2 (C C C C) Parses of length 9: A A A B B B C C C 1 1 2 2 2 3 3 3 (NIL) A A A A B B B B C 1 1 1 2 2 2 2 3 (C C C) Parses of length 10: A A A A B B B B C C 1 1 1 2 2 2 2 3 3 (C C) Parses of length 11: A A A A B B B B C C C 1 1 1 2 2 2 2 3 3 3 (C) Parses of length 12: A A A A B B B B C C C C 1 1 1 2 2 2 2 3 3 3 3 (NIL) \end{verbatim} \end{scriptsize} After loading the same grammar as is used for parsing, the function `gram-gen' is called with two arguments: the ``recursion factor'' of the grammar [Hausser (1989), pp. 193 ff], and a list of the words to be used.\footnote{In another version, `gram-gen' is called with the maximal surface length rather than the recursion factor.} The output is a systematic generation, starting with well-formed expressions of length 2. Each derivation consists of a surface, a sequence of rules, and a result category. As an example of a single derivation, consider (3.5): \vspace{0.1cm} \noindent {\bf (3.5) A Complete Well-Formed Expression in a$^{n}$b$^{n}$c$^{n}$} \vspace{-0.2cm} \begin{footnotesize} \begin{verbatim} A A A B B B C C C 1 1 2 2 2 3 3 3 (NIL) \end{verbatim} \end{footnotesize} \vspace{-0.2cm} \noindent In (3.5) the surface and the rule sequence are lined up so that it is apparent which word was added by which rule. Derivation (3.5) characterizes a complete well-formed expression because it represents the rule state (rp-3 $\epsilon$), which is an element of the set of complete well-formed expressions of the LA-grammar for a$^{n}$b$^{n}$c$^{n}$ defined in (3.2). \vspace{0.2cm} %\pagebreak \begin{large} \noindent {\bf 4. The Hierarchy of A-LAGs, B-LAGs, and C-Lags} \end{large} \vspace{0.1cm} \noindent Intuitively, LA-grammar may be viewed as a mathematical generalization of (one-way) pushdown automata. Consider a PDA move $\delta$, defined as a function from \begin{quote} Q $\times$ ($\sum$ $\cup$ {$\epsilon$}) $\times$ $\Gamma$ into Q $\times$ $\Gamma^{*}$ \end{quote} \noindent where Q is the set of states, $\sum$ is the surface alphabet of the language, and $\Gamma$ is the stack alphabet. Corresponding to the mapping from Q into Q, a rule in LA-grammar has the specification of a rule package rp-i for each rule r-i: \begin{quote} r$_{i}$: [CAT-1 CAT-2] $\Rightarrow$ [rp$_{i}$ CAT-3] \end{quote} \noindent And corresponding to the mapping from ($\sum$ $\cup$ $\epsilon$) $\times$ $\Gamma$ into $\Gamma^{*}$, a rule in LA-grammar has a categorial operation, defined as a mapping from (CAT-1 CAT-2) into CAT-3. It follows that for any PDA there exists an equivalent LA-grammar. The converse does not hold, however, because LA-grammars are much more general than PDAs. LA-grammar distinguishes systematically between the surface and the category of expressions. Thus a single alphabet is used to specify both the sentence start category (corresponding to the stack $\Gamma$) and the next word category (corresponding to the surface alphabet $\sum$). Furthermore, the categorial operations of LA-grammar are defined as total recursive functions from (C$^{*}$ $\times$ C$^{+}$) into C$^{*}$ $\cup$ \{$\perp$\}. It follows that LA-grammar recognizes and generates {\bf all} recursive languages [Hausser (1989), Theorem 2, p. 135]. Furthermore, due to the linear structure of the derivation LA-grammar generates {\bf only} the recursive languages [Hausser (1989), Theorem 1, p. 134]. Thus all possible analyses are derived in finitely many steps for any given finite input, and there is no halting problem in LA-grammar and associated parsers. Given the powerful categorial operations of LA-grammar, we are interested in languages where the categorial operations may be specified in terms of simple patterns. Consider the following three rule schemas, whereby CAT-3 contains at most one sequence variable, e.g., X. \begin{quote}\vspace{-0.2cm} r$_{i}$: [(seg-1...seg-k X) CAT-2] $\Rightarrow$ [rp$_{i}$ CAT-3] r$_{i}$: [(X seg-1...seg-k) CAT-2] $\Rightarrow$ [rp$_{i}$ CAT-3] r$_{i}$: [(seg-1...seg-i X seg-i+1...seg-k) CAT-2] $\Rightarrow$ [rp$_{i}$ CAT-3] \end{quote}\vspace{-0.2cm} %\pagebreak \noindent These schemas have in common that the pattern matching has to check exactly $k$ segments in the sentence start category. LA-grammars conforming to this simple type of pattern matching are called constant LA-grammars or C-LAGs. On the other hand, if an LA-grammer has rules of the form \begin{quote}\vspace{-0.2cm} r$_{i}$: [(X seg-1...seg-k Y) CAT-2] $\Rightarrow$ [rp$_{i}$ CAT-3] \end{quote}\vspace{-0.2cm} \noindent the pattern matching has to search through an indefinite number of category segments (represented by X or Y, depending on where the search begins). In such non-constant LA-grammars CAT-3 may contain more than one sequence variable (e.g., X and Y). \vspace{0.1cm} \noindent {\bf (4.1) Definition of the Class of C-LAGs} \begin{quote}\vspace{-0.2cm} The class of {\it constant} LA-grammars, or C-LAGs, consists of grammars where no categorial operation co$_{i}$ looks at more than $k$ segments in the sentence-start categories, for a finite constant $k$.\footnote{This finite constant will vary between different grammars.} A language is called a C-language iff it is recognized by a C-LAG. \end{quote}\vspace{-0.2cm} \vspace{0.1cm} The C-languages contain many context-sensitive languages (e.g., 3.2) and all context-free languages. Context-free languages in LA-grammar are characterized by categorial operations which may only look at a finite number of segments at the beginning of the sentence-start category [Hausser (1989), Theorem 4, p. 138]. The regular languages, furthermore, are generated by C-LAGs where the length of the sentence start category is restricted by a finite constant [Hausser (1989), Theorem 4, p. 138, and Section 8.2]. Because of their restricted categorial operations, C-LAGs resemble PDAs most closely. This similarity between C-LAGs and PDAs explains why C-LAGs are much easier to write than corresponding PS-grammars. \begin{small} \begin{quote} Whenever one is faced with some new context-free set, it is generally much easier to describe a pushdown automaton to accept it than to try to produce a grammar. \begin{flushright} M.A. Harrison (1978), p. 135. \end{flushright} \end{quote} \end{small} \noindent C-LAGs differ from PDAs, however, in that they are more general, accepting a much larger class of languages. Furthermore, C-LAGs provide a simpler, more transparent, mathematical notation. The use of {\it pattern matching} in specifying the categorial operations of C-LAGs constitutes a major contrast to PS-grammars, which use the {\it rewriting} of variables (non-terminals) instead. PDAs make very limited use of pattern matching by looking only at the first element of the stack. LA-grammars which are not C-LAGs are called non-constant LA-grammars. In non-constant LA-grammars a categorial operation is viewed as a sequence of Turing machine moves, using the category as the tape. Non-constant LA-grammars are divided into the {\it B-LAGs} and {\it A-LAGs}. \vspace{0.1cm} %\pagebreak \noindent {\bf (4.2) Definition of the Class of B-LAGs} \begin{quote}\vspace{-0.2cm} The class of {\it bounded} LA-grammars or B-LAGs consists of grammars where for any complete well-formed expression E the length of intermediate sentence-start categories is bounded by $C \cdot n$, where $n$ is the length of E and $C$ is a constant. A language is called a B-language if it is recognized by a B-LAG, but not by a C-LAG. \end{quote}\vspace{-0.2cm} The class of languages generated by B-LAGs is the class of the context-sensitive languages. The proof [Hausser (1989), Theorem 5, p. 142] is based on the corresponding restrictions on linearly bounded automata. \vspace{0.1cm} \noindent {\bf (4.3) Definition of the Class of A-LAGs} \begin{quote}\vspace{-0.2cm} The class of A-LAGs consists of {\it all} LA-grammars because there is no limit on the length of the categories, or on the number of category segments read by the categorial operations. A language is called an A-language if it is recognized by an A-LAG, but not by a B-LAG. \end{quote}\vspace{-0.2cm} \vspace{0.1cm} The three classes of LA-grammars defined above are related in the following hierarchy: \vspace{0.1cm} \noindent {\bf (4.4) The Hierarchy of A-LAGs, B-LAGs, and C-LAGs} \begin{quote}\vspace{-0.2cm} The class of A-LAGs recognizes all recursive languages, the class of B-LAGs recognizes all context-sensitive languages, and the class of C-LAGs recognizes many context-sensitive languages, all context-free languages, and all regular languages. \end{quote}\vspace{-0.2cm} \noindent The remainder of this paper will deal mostly with the formal properties of C-LAGs. \vspace{0.2cm} %\pagebreak \begin{large} \noindent {\bf 5. Complexity and Ambiguity} \end{large} \vspace{0.1cm} \noindent Because the categorial operations of C-LAGs look at no more than $k$ sentence-start category segments, for some constant $k$, the application of a rule may be taken as the ``primitive operation'' for purposes of complexity analysis. Thus, the complexity of a C-LAG corresponds to the maximal number of rule applications per input. For example, a C-LAG parses in linear time if the maximal number of rule applications per input of length $n$ is $C \cdot n$, for some constant $C$. A C-LAG parses in square time if the maximal number of rule applications per input of length $n$ is $C \cdot n^{2}$, etc. The class of C-LAGs is divided into three subclasses according to (i) types of ambiguity, and (ii) corresponding degrees of computational efficiency.\footnote{A similar situation with regard to ambiguity arises also in PS-grammar. For example, the Earley algorithm (Earley 1970) recognizes {\bf unambiguous} context-free PS-grammars in $|G|^{2} \cdot n^{2}$, but {\bf ambiguous} context-free PS-grammars in $|G|^{2} \cdot n^{3}$ (where $|G|$ is the size of the grammar and {\it n} the length of the input string).} These subclasses are the C1-LAGs, which parse in $n$ (linear time), the C2-LAGs, which parse in $n^{2}$ (square time), and the C3-LAGs, which parse in $2^{n}$ (exponential time). In order to describe the subclassses of C-LAGs in more detail, we must first explain the different types of ambiguity in LA-grammar. There are (i) unambiguous LA-grammars, (ii) syntactically-ambiguous LA-grammars, and (iii) lexically-ambiguous LA-grammars. Syntactic ambiguity is defined in terms of the {\it input-compatibility} of rules. \vspace{0.1cm} \noindent {\bf (5.1) Three Types of Input Conditions} \begin{enumerate} \vspace{-0.1cm}\item {\bf Incompatible} input conditions: Two rules have incompatible input conditions if there exist no input pairs which are accepted by both rules. \vspace{-0.1cm}\item {\bf Compatible} input conditions: Two rules have compatible input conditions if there exists at least one input pair accepted by both rules, and there exists at least one input pair accepted by one rule, but not the other. \vspace{-0.1cm}\item {\bf Identical} input conditions: Two rules have identical input conditions if it holds for all input pairs that they are either accepted by both rules, or rejected by both rules. \end{enumerate} \vspace{0.1cm} \noindent {\bf (5.2) Definition of Unambiguous LA-Grammars} \begin{quote}\vspace{-0.2cm} An LA-grammar is unambiguous if and only if (i) it holds for all rule packages that their rules have {\it incompatible} input conditions, and (ii) there are no lexical ambiguities. \end{quote}\vspace{-0.2cm} \noindent Examples of incompatible input conditions are [(a X)(b)] and [(c X)(b)], as well as [(a X)(b)] and [(a X)(c)]. Since unambiguous LA-grammars permit--in any given state--at most one continuation, they represent the class of deterministic LA-grammars. Examples of unambiguous C-LAGs are 3.2 for $a^{n} b^{n} c^{n}$ and 6.1 for $a^{2^{i}}$. Unambiguous C-LAGs parse in linear time. \vspace{0.1cm} \noindent {\bf (5.3) Definition of Syntactically-Ambiguous LA-Grammars} \begin{quote}\vspace{-0.2cm} {\rm An LA-grammar is syntactically ambiguous if and only if (i) it has at least one rule package containing at least two rules with {\it compatible} input conditions, and (ii) there are no lexical ambiguities. } \end{quote}\vspace{-0.2cm} \vspace{0.1cm} \noindent For example, [(a X)(b)] and [(X a)(b)] represent compatible input conditions. The computational complexity of a syntactically ambiguous LA-grammar, and the number of possible readings in the associated language, depend on four types of syntactic ambiguity, based on the binary features $\pm local$ and $\pm recursive$. In a local ambiguity only one branch (resulting from an ambiguity split at a certain state) reaches a final state, whereas in a non-local ambiguity two or more branches reach a final state (for some given input). An ambiguity is called recursive if it arises inside a recursive loop. That is, a certain state (rp-i, CAT) has more than one continuation (for a given next word), and one or more of these continuations lead back into this state at a later point in the derivation, allowing for a repetition of the ambiguity split. An ambiguity is called non-recursive, on the other hand, if none of the branches resulting from the ambiguity can ever feed back into the state which caused the ambiguous continuation. C-LAGs with non-recursive ambiguities parse in linear time, just like unambiguous C-LAGs. An example of an LA-grammar with a non-local, non-recursive ambiguity is 6.2 for a$^{n}$b$^{n}$c$^{m}$d$^{m}$ $\cup$ a$^{n}$b$^{m}$c$^{m}$d$^{n}$. C-LAGs with recursive ambiguities can be of exponential complexity: there may be, e.g., a doubling of readings each time the derivation returns to the ambiguous state (examples are the C-LAGs 8.4 for L$_{no}$ and 8.5 for Subset Sum). However, if only a single branch of a recursive ambiguity split may return to the ambiguous state, then the complexity of the grammar is $n^{2}$. An example of a C-LAG with a recursive ``single return'' ambiguity is 7.1 for WW$^{R}$.\footnote{By definition 5.3, LA-grammars like 7.1 for WW$^{R}$ are ambiguous even though all complete expressions correspond to unique final states. In other words, if the LA-grammar for a language does not produce ambiguously analyzed sentence starts, then the grammar is deterministic in the sense that for any state there is at most one continuation--because all rule packages contain only incompatible rules. Conversely, if the LA-grammar for a language produces ambiguous sentence starts, then the grammar is non-deterministic in the sense that there are states with more than one possible continuation for a given next word. Thus the notions `unambiguous' and `deterministic,' as well as `ambiguous' and `non-deterministic,' coincide in LA-grammar.} %\vspace{0.1cm} \pagebreak \noindent {\bf (5.4) Definition of Lexically-Ambiguous LA-Grammars} \begin{quote}\vspace{-0.2cm} An LA-grammar is lexically ambiguous if its lexicon contains at least two analyzed words with identical surfaces. \end{quote}\vspace{-0.2cm} \noindent A non-linguistic example of lexical ambiguity is propositional calculus, e.g., (x v y v z) \& (..., where the propositional variable x is analyzed as (x (1)) and (x (0)), y is analyzed as (y (1)) and (y (0)), etc. \vspace{0.2cm} %\pagebreak \begin{large} \noindent {\bf 6. The C1-LAGs: Parsing in Linear Time} \end{large} \vspace{0.1cm} \noindent The most efficient type of constant LA-grammars are the C1-LAGs, which parse in linear time. A grammar $G$ is a C1-LAG if $G$ is a C-LAG that does not generate any recursive ambiguities. The class of C1-LAGs analyzes all deterministic context-free languages\footnote{Deterministic CF-languages are defined in terms of deterministic PDAs [Harrison 1978, p. 129].} and all context-free languages with non-recursive ambiguities. In addition the C1-LAGs analyze context-sensitive languages which are unambiguous or non-recursively ambiguous. Examples of unambiguous context-sensitive C-LAGs are $a^{n}b^{n}c^{n}$ defined in 3.2 and $a^{2^{i}}$ =$_{def}$ \{a$^{i}$ $\mid$ i is a positive power of 2\} defined in 6.1. \vspace{0.1cm} \noindent {\bf (6.1) The Unambiguous C1-LAG for Context-Sensitive $a^{2^{i}}$} \vspace{-0.2cm} \begin{tabbing} xxx\=LX \= =$_{def}$ \= \{a, b, c\}\kill \>LX\> =$_{def}$ \>\{[a (a)]\}\\ \>ST$_{S}$ =$_{def}$ \{(\{r-1\} (a))\}\\ xxx\= r-1: \= [(bXc) (bc)]\= $\Rightarrow$ \= [\{r-1, r-2\} (bbXcc)],\kill \> r-1: \>[(a) (a)]\> $\Rightarrow$ \> [\{r-2 \} (aa)],\\ \> r-2: \>[(aX) (a)]\> $\Rightarrow$ \> [\{r-2, r-3\} (Xbb)],\\ \> r-3: \>[(bX) (a)]\> $\Rightarrow$ \> [\{r-2, r-3\} (Xaa)]\\ xxx\=\kill \>ST$_{F}$ =$_{def}$ \{[rp-1 (aa)], [rp-2 (bXb], [rp-3 (aXa)]\}. \end{tabbing} \noindent 6.1 is a C1-LAG because r-2 and r-3 have incompatible input conditions. A comparison of 6.1 with corresponding phrase structure grammars [Hopcroft \& Ullman (1979)]\footnote{See op. cit., p. 224 for the canonical context-sensitive PS-grammar of $a^{2^{i}}$, and p. 220 for a version as an unrestricted PS-grammar.} for $a^{2^{i}}$ illustrates the formal and conceptual simplicity of LA-grammar. A C1-LAG with a non-recursive ambiguity is given in 6.2 for the language a$^{n}$b$^{n}$c$^{m}$d$^{m}$ $\cup$ a$^{n}$b$^{m}$c$^{m}$d$^{n}$. This language has been called an {\it inherently ambiguous language} [cf. Hopcroft and Ullman (1979), pp. 99 -- 103] because there exist no unambiguous PS-grammars for it. \vspace{0.1cm} \noindent {\bf (6.2) The Ambiguous C1-LAG for Context-Free a$^{n}$b$^{n}$c$^{m}$d$^{m}$ $\cup$ a$^{n}$b$^{m}$c$^{m}$d$^{n}$} \begin{tabbing} xx\=xxxr[(a X) (a)] \=$\Rightarrow$ [\{r-1, r-2, r-5\}(a X)]\kill \>LX =$_{def}$ \{(a (a)), (b (b)), (c (c)), (d (d))\}\\ \>ST$_{S}$ =$_{def}$ \{[\{r-1, r-2, r-5\} (a)]\}\\ \>r-1: [(X) (a)] \> $\Rightarrow$ [\{r-1, r-2, r-5\} (a X)]\\ \>r-2: [(a X) (b)] \> $\Rightarrow$ [\{r-2, r-3\} (X)]\\ \>r-3: [(X) (c)] \> $\Rightarrow$ [\{r-3, r-4\} (c X)]\\ \>r-4: [(c X) (d)] \> $\Rightarrow$ [\{r-4\}(X)]\\ \>r-5: [(X) (b)] \> $\Rightarrow$ [\{r-5, r-6\} (b X)]\\ \>r-6: [(b X) (c)] \> $\Rightarrow$ [\{r-6, r-7\} (X)]\\ \>r-7: [(a X) (d)] \> $\Rightarrow$ [\{r-7\} (X)]\\ \>ST$_{F}$ =$_{def}$ \{[rp-4 $\epsilon$], [rp-7 $\epsilon$]\} \end{tabbing} \noindent The grammar defined in 6.2 exhibits a syntactic ambiguity in the sense of definition 5.4: the rule package rp-1 calls the input-compatible rules r-2 and r-5, and the rule package rp-3 calls the input-compatible rules r-3 and r-4. Nevertheless, the grammar parses in linear time because the ambiguity splits are not part of a recursion: rp-2 and rp-5 do not call r-1, and rp-4 does not call r-3. In the worst case, the grammar generates two analyses, based on two parallel linear branches [Hausser (1989), pp. 154 ff]. The class of C1-LAGs cuts across other language classes which have been proposed as extensions of the context-free languages. These are the tree adjoining languages, or TALs, accepted by tree adjoining grammars, or TAGs [cf. Joshi (1987)], and the indexed languages [cf. Hopcroft and Ullman (1979), pp. 389ff]. The C1-LAGs accept TALs such as $a^{n} b^{n} c^{n}$, indexed languages which are not TALs such as $a^{n} b^{n} c^{n} d^{n} e^{n}$, $\{a^{n} b^{n} c^{n}\}^{*}$, $a^{k}b^{m}c^{k \cdot m}$,\footnote{Hausser 1989 presented $a^{k}b^{m}c^{k \cdot m}$ as a B-LAG. A C1-LAG for context-sensitive $a^{k}b^{m}c^{k \cdot m}$ is presented below: \vspace{0.1cm} LX =$_{def}$ \{(a (a)), (b (b)), (c (c))\} ST$_{S}$ =$_{def}$ \{[\{r-1\} (a)]\} r-1:[(X)(a)] $\Rightarrow$ [\{r-1, r-2\}(Xa)] r-2:[(X)(b)] $\Rightarrow$ [\{r-2, r-3\}(Xb)] r-3:[(aaXb)(c)] $\Rightarrow$ [\{r-4\}(baX)] r-4:[(Xb)(c)] $\Rightarrow$ [\{r-4, r-5, r-7\} (bX)] r-5:[(bXaa)(c)] $\Rightarrow$ [\{r-6\}(Xab)] r-6:[(bX)(c)] $\Rightarrow$ [\{r-6, r-3, r-9\}(Xb)] r-7:[(Xba)(c)] $\Rightarrow$ [\{r-9\}(X)] r-8:[(abX)(c)] $\Rightarrow$ [\{r-9\}(X)] r-9:[(bX)(c)] $\Rightarrow$ [\{r-9\}(X)] ST$_{F}$ =$_{def}$ \{[rp-9 $\epsilon$]\} } and $a^{2^{i}}$, as well as $a^{i!}$ which is not an indexed language [Hayashi (1973)].\footnote{Applegate has constructed a C1-LAG for $a^{i!}$ as follows: ``The category after reading a$^{x!}$ consists of: 1. \# 2. x written as an x bit binary number 3. \# 4. x! written as an x$^{2}$ bit binary number 5. \# Then, while reading the next (x+1)!-x! a's, we adjust the category to be correct after reading a$^{(x+1)!}$. The basic idea is that we expand x and x! to be x+1 and (x+1)$^{2}$ bit numbers, increment x, and multiply x! by the result to get (x+1)!. Then, we compute how many extra a's are needed to get from where we are to a$^{(x+1)!}$, and then match those a's.''} We may regard the C1-LAGs as an extension of the deterministic CF-grammars (or LR($k$)-grammars),\footnote{Any deterministic CF-language has an LR($k$)-grammar. Every LR($k$) language is a deterministic context-free language. Cf. Harrison (1978), p. 501, and 554.} providing linear parsing for a much larger class of languages in a simple, unified framework. \vspace{0.2cm} \begin{large} \noindent {\bf 7. The C2-LAGs: Parsing in $n^{2}$} \end{large} \vspace{0.1cm} \noindent The second most efficient type of C-LAGs are the C2-LAGs, which parse in $n^{2}$. A grammar $G$ is a C2-LAG if $G$ generates recursive ambiguities which are restricted by the ``Single Return Principle'' (SRP): \begin{quote}\vspace{-0.2cm} {\rm A syntactic ambiguity arising inside a recursion constitutes a ``single return'' if exactly one of the branches resulting from the ambiguity may feed back into the recursion. } \end{quote}\vspace{-0.2cm} \noindent An LA-grammar satisfies the SRP if all its recursive ambiguities are ``single return.'' As a consequence of the SRP, C2-LAGs have---at most---($C \cdot n$) readings.\footnote{$C$ is some finite, grammar dependent constant reflecting the number of rules introducing recursive ambiguities and $n$ is the length of the input.} Consequently, any C2-LAG can be parsed in $n^{2}$. As a case in point consider the non-deterministic context-free language WW$^{R}$, defined in 7.1. \vspace{0.1cm} %\pagebreak \noindent {\bf (7.1) The Nondeterministic Context-Free Language WW$^{R}$} \vspace{0.1cm} LX =$_{def}$ \{(a (a)), (b (b)), (c (c)), (d (d)), ...\} ST$_{S}$ =$_{def}$ \{[\{r-1, r-2\}(seg1)]\} r-1: [(X)(seg1)] $\Rightarrow$ [\{r-1, r-2\}(seg1 X)] r-2: [(seg1 X)(seg1)] $\Rightarrow$ [\{r-2\}(X)] ST$_{F}$ =$_{def}$ \{[rp-2 $\epsilon$]\} \vspace{0.1cm} \noindent This language consists of an arbitrarily long sequence of arbitrary letters (called W) followed by an inverse sequence. The worst case for parsing WW$^{R}$ is input strings consisting of even numbers of the same letter. Consider 7.2: \vspace{0.1cm} \noindent {\bf (7.2) The Derivational Structure of the Worst Case in WW$^{R}$ (for Input $aaaaaa$)} \begin{verbatim} rules: analyses: 2 a $ a 1 2 2 a a $ a a 1 1 2 2 2 a a a $ a a a 1 1 1 2 2 a a a a $ a a 1 1 1 1 2 a a a a a $ a 1 1 1 1 1 a a a a a a $ \end{verbatim} \vspace{0.2cm} \noindent 7.2 exhibits a recursive ambiguity: each time r-1 has applied, the system tries both r-1 and r-2 at the next composition. But once a derivation has entered r-2 it cannot go back to r-1. Thus, only one branch of the ambiguity (represented by the application of r-1) can reenter the recursion. In other words, the LA-grammar 7.1 satisfies the SRP. An example of a C2-LAG for a context-sensitive language is the grammar for WW. This grammar is just like 7.1, except for the output category of r-1: in context-sensitive WW, r-1 produces (X seg1), while in context-free WW$^{R}$, r-1 produces (seg1 X). Furthermore, the C2-LAGs for context-free WW$^{R}$ and context-sensitive WW use exactly the same number of rules for corresponding input (i.e., they have the same complexity). In summary, the class of C2-LAGs, defined in terms of recursive ``single return'' ambiguities, parses a subset of the nondeterministic context-free languages in $n^{2}$, e.g., WW$^{R}$. In addition, the C2-LAGs include context-sensitive languages like WW (a TAL), plus WWW, WWWWW, and \{WWW\}$^{*}$, which are Index Languages but not TALs. \vspace{0.2cm} \begin{large} \noindent {\bf 8. The C3-LAGs: Parsing in Exponential Time} \end{large} \vspace{0.1cm} \noindent The C3-LAGs contain C-LAGs with unrestricted recursive ambiguities and parse in $2^{n}$ (exponential time). An example of a non-deterministic context-free language accepted by a C3-LAG is the language L$_{no}$ (or ``noise''-language). L$_{no}$ has a simple context-free PS-grammar. \vspace{0.1cm} \noindent {\bf 8.1 PS-Definition of L$_{no}$} S $\rightarrow$ 1S1 S $\rightarrow$ 0S0 S $\rightarrow$ 1S S $\rightarrow$ 0S S $\rightarrow$ \# \noindent L$_{no}$ generates expressions with the structure W'\#W$^{R}$. The symbol \# separates W' and W$^{R}$. W$^{R}$ is the inverse of W, and W' differs from W in that it may contain an arbitrary number of additional 0's and 1's. These additional symbols are interspersed with and indistinguishable from those which have a counterpart in W$^{R}$. In other words, the additional symbols in W' function as noise. %\pagebreak Traditional parsers based on context-free PS-grammars like Earley and CYK can handle L$_{no}$ in n$^{3}$ because the PS-grammar rules reflect the basic structure of CF-languages, which is ``inverse pairs,'' e.g., `abc...cba'.\footnote{Any context-free language is homomorphic with the intersection of a regular set and a semi Dyck set (Chomsky-Sch\"{u}tzenberger theorem). Cf. Harrison (1978), pp. 317ff.} Example 8.2 presents a sample derivation within the PS-grammar 8.1 as well as the corresponding states created by the Earley algorithm. \vspace{0.1cm} \noindent {\bf 8.2 A PS-Grammar Derivation of 10010\#101 in L$_{no}$} \begin{verbatim} derivation tree: derived strings: states: S / | \ 1.S1 1S1. 1 S 1 1S1 1.S / | \ 0.S0 0S0. 0 S 0 10S01 0.S / | 0.S0 0 S 100S01 0.S 0S. / | \ 1.S1 1S1. 1 S 1 1001S101 1.S / | 0.S0 0 S 10010S101 0.S 0S. | # 10010#101 #. \end{verbatim} \noindent In L$_{no}$ an Earley parser creates only two states for each terminal preceding \#, e.g., `1.S1' and `1.S'. Thus, if \# is preceded by $n$ terminals, the number of states is 2$n$ upon reaching \#. C-LAGs, on the other hand, use the different structure of a double ended queue. This structure is well suited for repetitions of any number, whereby the repetitions may be modified, e.g., inverted, doubled, halved, etc. C-LAGs are highly efficient for deterministic languages of any kind (not just context-free ones). They are also efficient for languages that have nonrecursive ambiguities and languages that have recursive ``single return'' ambiguities. Parsers in general, and C-LAGs in particular, are computationally highly complex, however, if the string to be repeated contains an unspecified number of symbols with alternative interpretations such that these interpretations are relevant in later repetitions. This is the characteristic property of $\cal NP$-hard languages, i.e., languages which take $\cal N$ondeterministic $\cal P$olynomial time for verification, and exponential time for recognition. An example of an $\cal NP$-hard language is 3SAT, an instance of Boolean satisfiability [cf. Hopcroft \& Ullman (1979), pp. 324ff]. Consider an arbitrary Boolean formula like 8.3: \vspace{0.1cm} \noindent {\bf 8.3 A Formula in 3SAT} \vspace{0.1cm} {\rm (x v \={y} v \={z}) \& (y v z v u) \& (x v z v \={u}) \& (\={x} v y v u) } \vspace{0.1cm} \noindent The sign `v' stands for logical $or$ (disjunction), the sign `\&' stands for logical $and$ (conjunction), and the bar above some of the variables, e.g., \={z}, stands for negation. 3SAT is a specialized Boolean language, restricted to conjunctions such that each conjunct consists of a disjunction with three disjuncts. The problem of {\it satisfying} formulas like 8.3 consists in finding a value assignment which makes the formula true (if such an assignment exists). This problem is computationally complex because the analysis has to remember potentially 2$^{n}$ many variable assignments. For example, at the first variable x, two hypotheses must be pursued, namely that x is 1 (true) and that x is 0 (false). At the next variable y, the analysis has to remember four possible value assignments, namely (x=1 y=0), (x=1 y=1), (x=0 y=1), and (x=0 y=0). Thus, each time a new variable is encountered, the number of possible assignments is doubled. The point is that in LA-grammar nondeterministic context-free L$_{no}$ and $\cal NP$-complete 3SAT are alike. Because C-LAGs are not based on the ``inverse pair'' structure of context-free languages, the C-LAG complexity of L$_{no}$ is the same as that of, e.g., context-sensitive L$^{3}_{no}$, defined as W'\#W\#W'' (where W' and W'' are noisy versions of W). The only way a C-LAG can handle L$_{no}$ is by giving two interpretations to each symbol preceding \#, one as a genuine symbol and one as a noise symbol. This results in an exponential number of readings (for the input preceding \#), each characterized by a different category. Assuming input 10010\#..., for example, one reading has the category (10010), representing the hypothesis that the pre-\# symbols are all `genuine'. Another reading has the category (1001), representing the hypothesis that the last pre-\# symbol is a noise symbol, etc. The syntactically ambiguous C3-LAG in 8.4 generates these different hypotheses systematically from left to right. \vspace{0.1cm} \noindent {\bf 8.4 Syntactically Ambiguous C3-LAG for L$_{no}$} LX =$_{def}$ \{(0(0)) (1(1)) (\#(\#))\} ST$_{S}$ =$_{def}$ \{[\{r-1, r-2, r-3, r-4\}(0)] [\{r-1, r-2, r-3, r-4, r-5\}(1)]\} \hspace{0.5cm} {\it Let} seg1 $\varepsilon$ \{0, 1\} {\it and} seg2 $\epsilon$ \{0, 1\}. r-1: [(seg1)(seg2)] $\Rightarrow$ [\{r-1, r-2, r-3, r-4, r-5\} $\epsilon$] r-2: [(seg1)(seg2)] $\Rightarrow$ [\{r-1, r-2, r-3, r-4, r-5\} (seg2)] r-3: [(X)(seg1)] $\Rightarrow$ [\{r-1 r-2, r-3, r-4, r-5\}(X)] r-4: [(X)(seg1)] $\Rightarrow$ [\{r-1 r-2, r-3, r-4, r-5\}(seg1 X)] r-5: [(X)(\#)] $\Rightarrow$ [\{r-4\}(X)] r-6: [(seg1 X)(seg1)] $\Rightarrow$ [\{r-4\}(X)] ST$_{F}$ =$_{def}$ \{[rp-6 $\epsilon$]\} \vspace{0.1cm} \noindent The grammar 8.4 is clearly a context-free C-LAG: no input pattern looks at more than one initial ss-category segment. It is not a C2-LAG, however, because it doesn't satisfy the single return principle. Consider the rules r-3 and r-4, which have identical input conditions. Rule r-3 ignores the category of the next word in the output category (i.e., treats the next word as a noise symbol), while r-4 adds the next word category at the beginning of the output category. Rule r-3 causes an ambiguity split by calling input compatible r-3 and r-4. Rule r-4 causes a similar ambiguity split. Furthermore, these ambiguities are recursive because both branches of the respective splits feed back into the rules causing the ambiguity. The nature of the language L$_{no}$ is such that the multiple recursive ambiguity cannot be eliminated without changing the generative properties of the grammar. Thus, L$_{no}$ is a counterexample to the claim [Hausser (1989)] that for any C-LAG there exists an equivalent C-LAG which parses in $n^{2}$. Another nondeterministic context-free C3-LAG is HCFL, the ``hardest context-free language'' [Greibach (1973)].\footnote{See also Harrison (1978), pp. 326ff.} HCFL is like L$_{no}$ in that it provides for ``noise'' symbols which may precede and follow the ``genuine'' symbol sequences. It remains to show that the class of C3-LAGs is $\cal NP$-complete. C3-LAGs clearly are within $\cal NP$ because C-LAGs are defined to verify in linear time---and {\it a fortiori} in polynomial time. To show that C3-LAGs are $\cal NP$-complete we define a C3-LAG for $\cal NP$-complete Subset Sum. Subset Sum is defined as the language y\#a$_{1}$\#a$_{2}$\#a$_{3}$\#...\#a$_{n}$\# such that y, a$_{1}$, a$_{2}$, ..., a$_{n}$ are all binary strings containing exactly the same number of digits. Furthermore, when viewed as binary numbers presented least significant digit first, y is equal to the sum of a subset of the a$_{i}$. \vspace{0.1cm} \noindent {\bf 8.5 Syntactically Ambiguous C3-LAG for Subset Sum} \vspace{0.1cm} LX =$_{def}$ \{(0(0)), (1(1)), (\#(\#))\} ST$_{S}$ =$_{def}$ \{[\{r-1 r-2\}(0)], [\{r-1 r-2\}(1)]\} \hspace{0.5cm} {\it Let} seg1 $\varepsilon$ \{0, 1\} r-1: [(X)(seg1)] $\Rightarrow$ [\{r-1, r-2\}(seg1 X)] r-2: [(X)(\#)] $\Rightarrow$ [\{r-3, r-4, r-6, r-7, r-12, r-14\}(\# X)] r-3: [(X seg1)(seg1)] $\Rightarrow$ [\{r-3, r-4, r-6, r-7\}(0 X)] r-4: [(X \#)(\#)] $\Rightarrow$ [\{r-3, r-4, r-6, r-7, r-12, r-14\}(\# X)] r-5: [(X seg1)(seg1)] $\Rightarrow$ [\{r-5, r-6, r-7, r-11\}(0 X)] r-6: [(X 1)(0)] $\Rightarrow$ [\{r-5, r-6, r-7, r-11\}(1 X)] r-7: [(X 0)(1)] $\Rightarrow$ [\{r-8, r-9, r-10\}(1 X)] r-8: [(X seg1)(seg1)] $\Rightarrow$ [\{r-8, r-9, r-10\}(1 X)] r-9: [(X 1)(0)] $\Rightarrow$ [\{r-5, r-6, r-7, r-11\}(0 X)] r-10: [(X 0)(1)] $\Rightarrow$ [\{r-8, r-9, r-10\}(0 X)] r-11: [(X \#)(\#)] $\Rightarrow$ [\{r-3, r-4, r-6, r-7, r-12, r-14\}(\# X)] r-12: [(X 0)(seg1)] $\Rightarrow$ [\{r-4, r-12, r-14\}(0 X)] r-13: [(X 0)(seg1)] $\Rightarrow$ [\{r-11, r-13, r-14\}(0 X)] r-14: [(X 1)(seg1)] $\Rightarrow$ [\{r-11, r-13 r-14\}(1 X)] ST$_{F}$ =$_{def}$ \{[rp-4 (X)]\} \vspace{0.1cm} \noindent The above C3-LAG copies y into the category, and then non-deterministically either does or does not subtract each $a_i$ from y. If the result of the subtraction is zero, it enters an accepting state, otherwise not.\footnote{Like all other LA-grammars defined in this paper, 8.5 has been tested as a parser. The rule system may be further simplified by collapsing rules.} The C3-LAG for Subset Sum resembles that for L$_{no}$ in that each $a_{i}$ may function either as noise or as a genuine subset. Note, however, that 8.5 is a context-sensitive C-LAG: rule r-4, for example, works both at the end and the beginning of the sentence start category. The C-LAG 8.4, on the other hand, is context-free because the categorial operations affect only the first segment of sentence start categories.\footnote{The C3-LAG for SAT is based on input strings where each disjunction is encoded as a vector, indicating for each variable if it is absent (a), present (p), or present negated (n). For example, if the input contains the variable w, x, y, and z, then (w v x) \& (z v \={y} v x) is represented as (ppaa)(apnp). These vectors are preceded by $k$ letters, where $k$ represents the number of distinct variables in the formula. The C3-LAG begins by assigning all possible value assignments to the initial $k$-letter sequence, resulting in $2^{k}$ different readings, e.g., 1111, 1110, 1101, 1100, etc. Then each assignment is tested left to right against each vector. For example, the assignment 1010 makes the first clause (disjunction) in (ppaa)(apnp) true, but the second false. As the grammar tests the variable values against the vector values, the vector values are discarded while the variable values are recycled and applied to the next disjunction.} \vspace{0.2cm} \begin{large} \noindent {\bf 9. Open Questions} \end{large} \vspace{0.1cm} \noindent At present C3-LAGs raise the following open question:. \begin{itemize} \item C3-LAGs generate $\cal NP$-hard context-sensitive languages such as Subset Sum and SAT, and are therefore $\cal NP$-complete. CS-recognition is known to be PSPACE complete [Hopcroft and Ullman (1979), p. 346,7]. Because a PSPACE-complete problem is not likely to be in $\cal NP$,\footnote{``Not only is a PSPACE-complete problem not likely to be in $\cal P$, it is also not likely to be in $\cal NP$. Hence the property whose existence is PSPACE-complete probably cannot even be {\it verified} in polynomial time using a polynomial length `guess'.'' (Gary and Johnson 1979:171).} the set of C-languages is likely to be a proper subset of the B-languages (CS-languages). What is an example of a context-sensitive language which is not a C-language? \item LA-grammars are closed under union, concatenation, and Kleene Closure [Hausser (1989), Theorem 6, p. 145]. It remains to be shown whether different types of LA-grammars constitute {\it abstract families of languages} in the sense of Ginsberg and Greibach (1969).\footnote{See also Hopcroft \& Ullman (1979), pp. 270ff.} \item There are syntactically ambiguous C-LAGs which can be translated into lexically ambiguous C-LAGs, and there are lexical ambiguities which can be simulated syntactically. Are syntactic and lexical ambiguities always intertranslatable? Do C-LAGs formally reflect the conceptual difference between lexical and syntactic ambiguity?\footnote{If the ambiguity of a language is intuitively lexical, then a syntactically ambiguous C-LAG will often require more rules than the corresponding lexically ambiguous C-LAG. On the other hand, if the ambiguity of a language is syntactic, then a lexically ambiguous C-LAG will often require more rules than the corresponding syntactically ambiguous C-LAG (as witnessed by 7.2 and 9.2).} \end{itemize} Regarding the last point, consider the following lexically ambiguous variant of the syntactically ambiguous C3-LAG 8.4 for L$_{no}$. \vspace{0.1cm} \noindent {\bf 9.1 A Lexically Ambiguous C3-LAG for L$_{no}$} \vspace{0.1cm} LX =$_{def}$ \{(0(0)), (1(1)), (0(3)), (1(3)), (\#(\#))\} ST$_{S}$ =$_{def}$ \{[\{r-1, r-2\}(0)] [\{r-1, r-2\}(1)]\} \hspace{0.5cm} {\it Let} seg1 $\varepsilon$ \{0,1\}. X' = nil $if$ X = 3, $and$ X' = X $otherwise$. r-1: [(X)(seg1)] $\Rightarrow$ [\{r-1, r-2, r-3\}(seg1 X')], r-2: [(X)(3)] $\Rightarrow$ [\{r-1, r-2, r-3\}(X')], r-3: [(X)(\#)] $\Rightarrow$ [\{r-4\}(X)] r-4: [(seg1 X)(seg1)] $\Rightarrow$ [\{r-4\}(X)] ST$_{F}$ =$_{def}$ \{[rp-4 $\epsilon$]\} \vspace{0.1cm} \noindent Note that 9.1 is not syntactically ambiguous because r-1, r-2, and r-3 have incompatible input conditions. Given the view of L$_{no}$ as a noise language, where certain words create confusion because of their homonymy with ``genuine'' words, the lexically ambiguous version 9.1 seems intuitively more natural than the syntactically ambiguous C3-LAG presented in 8.4. On the other hand, the nondeterministic context-free language WW$^{R}$ represents an instance of syntactic ambiguity (see the C2-LAG defined in 7.2), though 9.2 demonstrates the possibility of a lexically ambiguous C-LAG for this language:\footnote{The grammar 9.2 was also provided by David Applegate.} \vspace{0.1cm} \noindent {\bf 9.2 A Lexically Ambiguous C-LAG for WW$^{R}$, W $\varepsilon$ \{a, b\}$^{+}$ } \vspace{0.1cm} LX =$_{def}$ \{(a(0)), (b(1)), (a(a)), (b(b))\} ST$_{S}$ =$_{def}$ \{[\{r-1, r-2\} (0)], [\{r-1, r-2\} (1)]\} r-1 [(X)(0)] $\Rightarrow$ [\{r-1, r-2, r-3, r-4\}(0 X)] r-2 [(X)(1)] $\Rightarrow$ [\{r-1, r-2, r-3, r-4\}(1 X)] r-3 [(0 X)(a)] $\Rightarrow$ [\{r-3, r-4\}(X)] r-4 [(1 X)(b)] $\Rightarrow$ [\{r-3, r-4\}(X)] ST$_{F}$ =$_{def}$ \{[rp-3 $\epsilon$], [rp-4 $\epsilon$]\} \vspace{0.1cm} \noindent 9.2 uses the rules r-1, r-2 and the category segments in \{0, 1\} for W, and rules r-3, r-4 and the category segments in \{a, b\} for W$^{R}$. As soon as r-3 or r-4 fire in a derivation, r-1 and r-2 cannot be reapplied. Furthermore, r-3 and r-4 ensure the proper derivation of W$^{R}$ by cancelling `0' with `a' and `1' with `b'. The C-LAG 9.2 is not syntactically ambiguous because all rules have incompatible input conditions.\footnote{In the analysis of natural language, the distinction between syntactic and lexical ambiguities seems to be fairly clear intuitively. But within the formalism of C-LAGs the relation between syntactic and lexical ambiguities is more complicated than previously thought. Reliance on linguistically motivated concepts resulted in the claim that for any C-LAG there exists an equivalent one which parses in $n^{2}$ (Hausser 1989). The language L$_{no}$ demonstrated the necessity to distinguish between C-LAGs in general and C-LAGs which parse efficiently. At the same time it was shown by David Applegate that the class of C-LAGs is much larger than previously thought: the C1-LAGs include the indexed language $a^{i!}$ and the C3-LAGs are $\cal NP$-complete, including Subset Sum and 3SAT.} \vspace{0.2cm} \begin{large} \noindent {\bf 10. Comparing the CFGs and the C-LAGs} \end{large} \vspace{0.1cm} \noindent In computer science, the class of context-free languages has been thoroughly studied in the last 30 years, yielding many interesting formal and practical results. However, ``It is no secret that context-free grammars are only a first order approximation to the various mechanisms used for specifying the syntax of modern programming languages'' [Ginsberg 1980, p.7].\footnotemark \footnotetext{See also Harrison 1978, pp. 219ff.} Similar considerations hold for applications to natural languages analysis. Because the generative power of context-free grammars is too weak for the description of natural languages, context-free ``base''-grammars have been augmented with powerful mechanisms which increase the generative capacity far beyond that of the context-free grammars. For example, standard transformational grammar is recursively enumerable [Peters \& Ritchie (1973)] and LFG and GPSG are $\cal NP$-complete [Barton et al. (1987)]. For these reasons an alternative grammar formalism better suited for specifying the syntax of programming as well as natural languages is of considerable theoretical and practical interest. However, up to now alternative formalisms have either turned out to be weakly equivalent to the context-free phrase structure grammars (e.g., categorial grammar), or the new systems have been defined as {\it extensions} of context-free phrase structure grammars, based on the use of certain additional formal devices (e.g., the tree adjoining grammars and the index grammars). Thereby the class of confree-languages constitutes a proper subset of the class of tree adjoing languages (TALs), and the class of TALs constitutes a proper subset of the class of index languages.\footnote{Also, the parsing efficiency of these extensions of context-free PS-grammar is too low to be of practical interest.} This raises the question, how ``natural'' is the class of context-free languages? In formal language theory, classes of languages are defined in terms of formal grammar types, and formal grammar types are distinguished in terms of different properties of the underlying mathematical formalism. Within phrase structure grammar, the class of context-free languages is defined in terms of restrictions (``left-hand sides of the rewriting rules must consist of single non-terminals'') which are natural only within this particular formalism. If we change the formalism, we get other types of natural restrictions which in turn yield other classes of languages. In LA-grammar, for example, the language classes defined by the C-1, C-2, and C-3 LAGs are defined in terms of whether or not ambiguities are recursive, and whether recursive ambiguities are ``single return'' or not. The resulting classification of languages is orthogonal to the context-free languages and their extensions, e.g., the tree-adjoining languages and the index languages. This alternative classification of languages provides a new perspective on formal language theory. At the same time LA-grammar provides for a straightforward formal reconstruction of the PS-grammar hierarchy (as shown in Section 4). Thus, the many results proved on the basis of PS-grammars remain directly available in LA-grammar. How does the LA-grammar hierarchy compare intuitively and computationally with the Chomsky hierarchy? To answer this question we must look at the languages which are being classified. Formal languages, like natural languages, exist independently of particular formal grammars, or grammar formalisms. For example, $a^{n}b^{n}$, $a^{n}b^{n}c^{n}$, WW$^{R}$, and WW are names for formal languages. The notation of these names provides an informal description of the languages which is independent of the PS-grammar or the LA-grammar formalism. >From a pretheoretical point of view, one would be inclined to class $a^{n}b^{n}$ with $a^{n}b^{n}c^{n}$ (as well as $a^{n}b^{n}c^{n}d^{n}$, etc.), and WW$^{R}$ with WW (as well as WWW, etc.). It is always surprising to the beginning student that the Chomsky hierarchy puts $a^{n}b^{n}$ with WW$^{R}$ into one class (context-free) that parses relatively efficiently, but puts $a^{n}b^{n}c^{n}$ with WW into another class (context-sensitive) that is PSPACE-complete. The LA-grammar hierarchy is intuitively more natural because it classifies $a^{n}b^{n}$ with $a^{n}b^{n}c^{n}$ (and $a^{n}b^{n}c^{n}d^{n}$, etc.) as C1-languages, and WW$^{R}$ with WW (and WWW, etc.) as C2-languages. Furthermore, we have argued that L$_{no}$ and HCFL share structural properties with SAT and Subset Sum. From the viewpoint of LA-grammar, the parsing of L$_{no}$ and HCFL in $n^{3}$---using a conventional PS-grammar based parser like Earley's or CYK---is just as much an artifact of the PS-grammar formalism as the exclusion of $a^{n}b^{n}c^{n}$, $a^{2^{i}}$, $a^{k}b^{m}c^{k \cdot m}$, $a^{i!}$, etc., from linear parsing, and the exclusion of WW, WWW, etc., from $n^{2}$ parsing. The correlation of the context-free languages and the C-1, C-2, and C-3 languages is summarized in the following diagram. %\pagebreak \noindent {\bf 10.1 Relating the Context-Free Languages to the C-Languages} \vspace{-1.5cm} \setlength{\unitlength}{1in} \begin{picture}(6,7) \put(2.85,3.1){\oval(4.5,4.6)} \put(2.65,2.95){\oval(3.1,3.1)} \put(2.5,2.9){\oval(1.6,1.9)} \put(1.95,3.6){$a^{n} b^{n}$} \put(1.95,3.25){$a^{n} b^{n} c^{n}$} \put(1.95,2.9){$a^{n} b^{n} c^{n} d^{n} e^{n}$} \put(1.95,2.55){$a^{2^{n}}$} \put(1.95,2.2){$a^{n!}$} \put(2.8,2.15){\large\bf C-1} \put(3.7,1.65){\large\bf C-2} \put(4.6,1.05){\large\bf C-3} \put(2.1,4.35){\oval(.7,1.7)} \put(1.9,4.15){WW$^{R}$} \put(2.8,4.15){WW} \put(3.5,4.15){WWW} \put(1.9,4.75){HCFL} \put(2,4.95){L$_{no}$} \put(2.9,4.9){3-SAT} \put(3.8,4.9){SUBSET-SUM} \put(1.9,5.9){Context-Free Languages} \multiput(2.1,5.87)(0,-.14){5}{\line(0,-1){.1}} \end{picture} \vspace{-1cm} \noindent The formal system of C-LAGs greatly expands the set of deterministic CF-languages (C1-LAGs, linear time), and the set of ``reasonable'' non-deterministic CF-Languages (C2-LAGs, square time). It also results in grouping context-free L$_{no}$ and HCFL with SAT and Subset Sum in the class of C3-LAGs, which is $\cal NP$-complete and parses in exponential time. \vspace{0.1cm} \begin{large} \noindent {\bf Concluding Remark} \end{large} \vspace{0.1cm} \noindent Analyzing the basic mathematical and computational properties of LA-grammar and comparing it with PS-grammar is of theoretical interest within formal language theory. The formal simplicity, computational efficiency, and mathematical power of LA-grammar will be relevant in a much wider context, however, if the formalism in question is well-suited for practical applications. In the long run, the interest of the formal properties of a grammar type is proportional to the usefulness of the grammar in a wide variety of different tasks, such as linguistic description, computational processing, and psychological explanation. The suitability of LA-grammar for the description of natural language has been shown in extensive analyses of English and German syntax [Hausser (1986)], as well as in smaller systems of French, classical Latin, Polish, and Japanese. These applications demonstrate that LA-grammar handles the different types of structures characteristic of natural language (word order, agreement, long distance dependencies, center embedding, etc.) in a simple and linguistically well-motivated manner. Regarding computational processing, LA-grammars are easy to parse because the rules of the grammar are used directly (``absolute type transparency''). Consequently, analysis of the processing provides heuristics for simplifying the grammar, and improvements in the grammar translate into faster processing. LA-grammars for natural (as well as formal) languages are easy to write, to extend, and to debug because the derivation of sentences is based on the principle of possible continuations. >From a psychological viewpoint, finally, the most basic property of LA-grammar is its strictly time-linear approach to language, providing a formalism that is input-output equivalent with the speaker-hearer. Given that the computation of possible continuations is linguistically and psychologically well-motivated, the nice computational properties of LA-grammar provide additional incentive to further explore this new approach to natural and formal languages. \newpage \begin{large} \noindent {\bf References} \end{large} \def \beginref {\par \begingroup \ifdim \lastskip < \medskipamount \medskip \fi \leftskip =\parindent \parindent =-\parindent \parskip =\medskipamount } \def \endref{\par \endgroup } \beginref \begin{small} Ajdukiewicz, K. \ (1935) ``Die syntaktische Konnexit\"{a}t,'' {\it Studia Philosophica}, 1:1-27. Barton, G.E., R.C. Berwick, and E.S. Ristad \ (1987) {\it Computational Complexity and Natural Language}. The MIT-Press, Cambridge, Massachusetts. Berwick, R.C., and A.S. Weinberg \ (1984) {\it The Grammatical Basis of Linguistic Performance: Language Use and Acquisition}. The MIT-Press, Cambridge, Massachusetts. Book, R.V. (ed.) \ (1980) {\it Formal Language Theory: Perspectives and Open Problems}. Academic Press, New York. Earley, J. \ (1970) ``An Efficient Context-Free Parsing Algorithm,'' {\it CACM} 13(2):94-102. Gary, M. and D. Johnson \ (1979) {\it Computers and Intractibility}. San Francisco: W.H. Freeman and Co. Ginsberg, S. \ (1980) ``Formal Language Theory: Methods for Specifying Families of Formal Languages---Past-Present-Future,'' in Book (1980), pp. 1-22.. Ginsberg, S. and S. A. Greibach \ (1969) ``Abstract Families of Languages,'' {\it Memoirs of the American Mathematical Society}, 87. Greibach, S.A. \ (1973) ``The Hardest Context-Free Language,'' {\it SIAM Journal of Computing}, 2, pp. 304-310. Harrison, M.A. \ (1978) {\it Introduction to Formal Language Theory}. Addison-Wesley Publishing Company, Reading, Massachusetts. Hayashi, T. \ (1973) ``On derivation trees of indexed grammars---an extension of the {\rm uvwxy} theorem,'' {\it Publications of the Research Institute for Mathematical Sciences 9:1}, pp. 61-92. Hausser, R. \ (1986) {\it NEWCAT: Parsing Natural Language Using Left-Associative Grammar}. Springer-Verlag Berlin--New York (Lecture Notes in Computer Science 231). Hausser, R. \ (1989) {\it Computation of Language}. Springer-Verlag Berlin--New York (Symbolic Computation -- {\it Artificial Intelligence}). Hopcroft, J.E., and Ullman, J.D. \ (1969) {\it Formal Languages and their Relation to Automata}. Addison-Wesley Publishing Company, Reading, Massachusetts. Hopcroft, J.E., and Ullman, J.D. \ (1979) {\it Introduction to Automata Theory, Languages, and Computation}. Addison-Wesley Publishing Company, Reading, Massachusetts. Joshi, A. \ (1987) ``Intro to TAG'' in A. Manaster-Ramer (ed.) {\it Mathematics of Language}. 1987, John Benjamins, Amsterdam. Le\'{s}niewski, S. \ (1929) ``Grundz\"{u}ge eines neuen Systems der Grundlage der Ma\-thematik,'' {\it Fundamenta Mathematicae}, Warsaw. Peters, S., and Ritchie, R. \ (1973) ``On the Generative Power of Transformational Grammar,'' {\it Information and Control}, 18:483-501. Post, E. \ (1936) ``Finite Combinatory Processes---Formulation I,'' {\it Journal of Symbolic Logic}, I. \end{small} \endref \end{document} morph.tex: --- \documentstyle[11pt]{cmu-art} \begin{document} \begin{large} \noindent PRINCIPLES OF \vspace{0.1cm} \noindent COMPUTATIONAL MORPHOLOGY \end{large} \vspace{10pt} \noindent {\em ROLAND HAUSSER} \vspace{18pt} \begin{small} \begin{quote} This paper describes three basic methods of lexical lookup during syntactic parsing, and presents a new approach to morphological analysis, called LA-MORPH. It is based on the algorithm of Left-Associative Grammar, which has been applied previously only to the {\em syntactico-semantic} analysis of natural (and formal) languages. Apart from the left-associative algorithm, LA-MORPH differs from other systems of word recognition because it separates the generation of allomorphs, performed automatically during the loading of the lexicon, from the concatenation of morphemes, performed during run-time lookup. This design results not only in optimized space- and time-efficiency, but relates to a fundamental question of traditional morphology, namely, which aspects of word formation are based on stored forms and which are based on derivational procedures. \end{quote} \end{small} \noindent Morphology, or the theory of word structure, has been approached from many different viewpoints: diachronically, typologically, and as part of particular formal approaches to language (e.g., transformational grammar). {\em Computational morphology} analyzes word forms for the purpose of natural language processing. The goals are a space-efficient storage of words, a time-efficient lookup of word forms, ease of adding new words to the lexicon, and ease of writing grammars for morphological analysis. We will begin by describing the basic computational notions and design alternatives (as opposed to going from various models of traditional generative morphology to corresponding systems in computational morphology). In the context of explaining the operations of LA-MORPH in some detail, theoretical alternatives of traditional generative morphology will be discussed insofar as they relate to the computational issues at hand. The background of these discussions are the formal methods of traditional generative morphology, the general system of LA-grammar, and its relation to other systems of linguistic analysis, most of which will have to remain implicit. Section 16, however, presents a direct, high-level comparison of different models of computational and traditional generative morphology, and discusses whether computational efficiency should be admissible as evidence for or against a particular model of generative morphology. The theoretical and methodological aspects of the present system, LA-MORPH, evolved in the context of building a comprehensive English lexicon for parsing large databases.\footnote{The LA-MORPH system was designed between January 19, and June 8, 1989, while the author was working as a research scientist at the Laboratory for Computational Linguistics, Carnegie Mellon University, Pittsburgh, Pennsylvania. The Common Lisp implementation was written with the expert help of Todd Kaufmann, research programmer at the Center for Machine Translation of CMU. A preliminary draft of this paper benefited from comments by Brian MacWhinney, Mitzi Morris, Virginia Swisher, and David Weber. The Common Lisp implementation of LA-MORPH was translated into C by Carolyn Ellis. In April of 1990, Ms. Ellis and the author developed an abstract notation for LA-MORPH, based on regular expressions. Subsequently, Ms. Ellis wrote a C-program which automatically interprets the grammar rules in their abstract notation. In September of 1990 the paper was revised to accommodate suggestions by the reviewers and to include a LA-MORPH grammar for English written in the notation of regular expressions. The author is especially indebted to Brian MacWhinney and his team at CMU for providing the support that made the continuing evolution of LA-MORPH possible.} For this application it was most practical to base the morphological analysis of English on {\em orthographic} rather than the customary {\em phonetic} representations. However, the method of LA-MORPH is equally applicable to spoken language. Furthermore, the issues discussed in this paper do not rest on the assumption of a specific language medium. Beyond the practical task of achieving efficient, large-scale lexical lookup, LA-MORPH was motivated by the linguistic question of how well the general approach of LA-grammar\footnote{See Hausser (1989) for detailed analyses of the formal, linguistic, psychological, and philosophical aspects of LA-grammar.} would apply to morphology. Virtually all models of generative morphology have been based on the formalism of phrase structure grammar. Phrase structure grammars are linguistically motivated in terms of {\em possible substitutions}. In contrast, LA-grammar is based on the notion of {\em possible continuations}, and combines the principle of surface compositionality (Hausser 1984) with a strictly time-linear derivation order. The initial design of a left-associative morphology for English turned out to be straightforward. Furthermore, applying the system to the morphology of other languages proved to be very easy. In addition to English, LA-MORPH has so far been applied to Latin, German, French, Spanish, and Japanese. While the lexical lookup provided by LA-MORPH is compatible with syntactic and semantic applications of LA-Grammar (illustrated in 1.1 below), it can be used also with other systems of natural language parsing. For example, LA-MORPH systems are currently used at Carnegie Mellon University\footnote{The ALICE project, which deals with computer aided foreign language tutoring, is headed by David Evans, Laboratory for Computational Linguistics, and Lori Levin, Center for Machine Translation.} in conjunction with a Tomita parser.\footnote{Tomita (1986). This application of the Tomita parser is based on an LFG-style unification grammar.} \vspace{0.2cm} \begin{large} \noindent {\bf 1. Lexical Lookup During Parsing} \end{large} \vspace{0.1cm} \noindent One basic application of computational morphology is {\em lexical lookup.} As the parser reads in the sentence word by word, each word has to be assigned a syntactic {\em category}, and---in semantically interpreted systems---a {\em meaning}. The categories and meanings of the word forms are used by the parser to build a syntactic and a semantic analysis of the sentence. As an example, consider the left-associative derivation in 1.1, which shows the syntactico-semantic processing of a sentence (without displaying the details of lexical lookup). \vspace{0.2cm} \noindent {\bf 1.1 } {\em Example of a Syntactico-Semantic Parse in LA-Grammar} \begin{footnotesize} \begin{verbatim} (z Fido found a bone.) Linear Analysis: *START_0 1 (NA) FIDO (N A V) FOUND *NOM+FVERB_3 2 (A V) FIDO FOUND (SQ) A *FVERB+MAIN_4 3 (SQ V) FIDO FOUND A (SN) BONE *DET+NOUN_2 4 (V) FIDO FOUND A BONE (V DECL) . *CMPLT_13 5 (DECL) FIDO FOUND A BONE . Hierarchical Analysis: (PROPOSITION-5_6_3 (MOOD (DECLARATIVE-5_6_3)) (PROP-CONTENT ((SENT-2_6_3 (SUBJ ((NP-1_6_3 (NAME (FIDO-1_6_3))))) (VERB (FIND-2_6_3)) (DIR-OBJ ((NP-3_6_3 (REF (A-3_6_3 SG-4_6_3)) (NOUN ((BONE-4_6_3)))))))))) PROPOSITION | ----------------- | | MOOD PROP-CONTENT | | | | | | DECLARATIVE SENT | -------------- | | | SUBJ VERB DIR-OBJ | | | | | | | | | NP FIND NP | | | ------ | | | NAME REF NOUN | | | | ---- | | | | | FIDO A SG BONE \end{verbatim} \end{footnotesize} \noindent In 1.1 lexical lookup characterizes, e.g., {\em Fido} and {\em found} as ``{\tt (NA) FIDO}'' and ``{\tt (N A V) FOUND},'' respectively. The category {\tt (NA)} indicates that {\em Fido} is lexically analyzed as a `proper NAme', while the category {\tt (N A V)} characterizes {\em found} as a verb (V) which takes a nominative (N) and an accusative (A) as its valencies. The categories of the word forms are retrieved from the on-line lexicon of the parser. Prior to lexical lookup words like {\em Fido}, {\em found}, etc., are simply letter sequences without a syntactic (or semantic) characterization. The parsing algorithm presupposes lexical lookup because it is based on grammatical rules which refer to the categories of the input expressions. The ``Linear Analysis'' in the above parse illustrates syntactic analysis within the framework of LA-grammar. LA-grammar computes {\em possible continuations}, always combining a {\em sentence start} and a {\em next word} into a new sentence start. Schematically, a left-assocative rule has the following form: \noindent {\bf 1.2 } {\em Schema of a Syntactic Rule in LA-grammar} \begin{quote} r$_{i}$: [CAT-1 CAT-2] $\Rightarrow$ [rp$_{i}$ CAT-3] \end{quote} %\pagebreak \noindent The mapping of CAT-1 and CAT-2 into CAT-3 is called the categorial operation of an LA-rule. The categorial operation is specified in terms of patterns which match the input expressions. The rule package rp$_{i}$ contains the rules which apply to the next ordered pair, consisting of the result of the categorial operation of r$_{i}$ (treated as a new sentence start) and a new ``next word''. In 1.1, for example, the sentence start {\tt FIDO FOUND A} of category {\tt (SQ V)} and the next word {\tt BONE} of category {\tt (SN)} are combined by the rule DET+NOUN into the new sentence start {\tt FIDO FOUND A BONE} of category {\tt (V)}. At this point, the rule package of DET+NOUN is applied to the sentence start {\tt FIDO FOUND A BONE} and the next word (which happens to be ``.''). One of the rules in the rule package of DET+NOUN is CMPLT, which happens to fire in the next left-associative composition (cf. 1.1). A left-associative parse terminates if there is no ``next word'' (at the end of the input), or if at a certain composition no categorial operation accepts the input (failure of the parse). The left-associative rule DET+NOUN instantiates the schema (1.2) as follows: \vspace{0.2cm} \noindent {\bf 1.3 } {\em Instantiation of an LA-rule} \begin{quote} DET+NOUN: [(SQ V) (SN)] $\Rightarrow$ [\{... CMPLT ...\} (V)] \end{quote} Simultaneously with the linear syntax, LA-grammar constructs a (semantic) hierarchy which characterizes the grammatical relations among elements. Many different forms of semantic interpretation are possible. In the current system (cf. ``Hierarchical Analysis'' in 1.1), the semantic hierarchy is generated as a tree resembling a constituent structure.\footnote{As shown in (1.1), the semantic hierarchy may be displayed as a structured list, or---equivalently---as a tree.} This tree is homomorphic to the linear syntax in the following sense: Each word of the surface is mapped into a semantic subtree, and for each left-associative composition there is a semantic operation combining the subtree of the sentence start with the subtree of the next word. Thus a linear syntax is combined with a homomorphic hierarchical semantics.\footnote{Note that the semantic hierarchy in 1.1 uses the form {\tt FIND} rather than {\tt FOUND}. Again, this information is provided by the lexicon of the parser.} \vspace{0.2cm} \begin{large} \noindent {\bf 2. The Data Structure of the Lexicon} \end{large} \vspace{0.1cm} \noindent The lexical component of the parser {\em recognizes} word forms by retrieving the full entry from an on-line lexicon. Let us write each lexical entry as an ordered triple, consisting of the surface, the category, and the stem of the word form. \vspace{0.1cm} \noindent {\bf 2.1} {\em Schema of an Analyzed Word Form} \begin{quote} (``SURFACE'' (CATEGORY) STEM) \end{quote} \noindent The surface (in double quotes)\footnote{In a programming language like LISP, the double quotes are needed in the recognition of certain word forms, such as English genitive nouns like {\em book's} and {\em books'}. Without the double quotes, the apostrophe is mistakenly interpreted as the LISP function ``quote''. Francis and Ku\v{c}era (1982) define a ``graphic word'' as ``a string of continuous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks.''} is the concrete representation of the word form in a sentence. The category specifies the combinatorial properties of the word form.\footnote{Schema 2.1 imposes no restriction on the kind of category system. The illustration of LA-morph will be based on the categories characteristic of left-associative grammar.} The stem relates different word forms of the same paradigm and is the key to the semantic analysis of the word. Different word forms may have the same surface. An example is {\em bears}, which can be a noun denoting animals and a verb with a meaning similar to {\em carries}: %\pagebreak %\vspace{0.1cm} \noindent {\bf 2.2} {\em Different Word Forms with the Same Surface} \begin{verbatim} ("bears" (PN) BEAR1) ("bears" (S3 A V) BEAR2) \end{verbatim} \noindent The category {\tt (PN)} stands for ``plural noun'', while {\tt (S3 A V)} represents a verb (V) which takes a third person singular nominative (S3) and an accusative (A). The LA-grammar categories used here are defined as lists of category segments. The category {\tt (PN)}, for example, consists of a single segment, while {\tt (S3 A V)} consists of three segments. The stems BEAR1 and BEAR2 are derived from the ``representative forms'' of the respective paradigms; since these representative forms happen to be identical, they are distinguished by numbers. We assume that the two word forms are distinguished semantically, e.g., in terms of a semantic hierarchy or network such that the semantic node (frame) BEAR1 is dominated by the node ANIMAL, while the frame BEAR2 is dominated by the node ACTIVITY. Word forms with different surfaces may have the same stem. Consider for example the paradigm of the verb {\em go}: \vspace{0.1cm} \noindent {\bf 2.3} {\em Different Word Forms with the Same Stem} \begin{verbatim} ["go" (NOM V) GO] ["goes" (S3 V) GO] ["going" (B) GO] ["went" (N V) GO] ["gone" (HV) GO] \end{verbatim} \noindent Here the stems provide the common denominator of the different forms of a paradigm. For example, in a suppletive form like {\em went}, the stem is the only direct link to the other forms in the paradigm. The tense differences are not indicated in the respective stems of 2.3 because they can be inferred from the syntactic categories. The analysis of word forms illustrated in 2.2 and 2.3 is designed to support two kinds of lexical lookup, called surface-lookup and stem-lookup. Surface-lookup is used during the analysis (interpretation) of a sentence, and provides all the different readings of a given surface. Stem-lookup is used during the generation of a sentence, and provides all the different analyzed surfaces associated with a given stem. \vspace{0.1cm} \noindent {\bf 2.4} {\em Surface Lookup versus Stem Lookup} \begin{verbatim} bears bear bear's bears bears' go goes going went gone / \ | | | | | | | | | / \ | | | | | | | | | (PN) (S3 A V) (SN) (GN) (PN) (GN) (NOM V) (S3 V) (B) (N V) (HV) | | \ / \/ BEAR1 BEAR2 BEAR1 GO \end{verbatim} \noindent Consider analysis of a sentence containing the word form {\em bears}. Surface lookup relates this form to two different lexical readings, with different syntactic categories and semantic stems. Which reading is the correct one depends on the syntactic, semantic, and pragmatic context of the word form. During generation, on the other hand, stem lookup relates a (noun) stem like {\tt BEAR1} to all related surface forms. Which surface form (and associated category) is appropriate depends on the syntactic context, and certain semantic aspects to be expressed (e.g., number). For a stem like {\tt GO}, stem lookup provides even suppletive surface forms like $went$. In addition to its function as the common denominator of a word paradigm, the stem is used as the key to access the semantic analysis of a word.\footnote{Please note that our use of the terms {\em stem} and {\em paradigm} does not imply a restriction to {\em inflectional} morphology only.} \vspace{0.1cm} \noindent {\bf 2.5} {\em Schematic Correlation of Syntax and Semantics} \begin{verbatim} .......... : /\ : : / \ : : / \ : : ------ : ..................:....... : : (surface (cat) : stem): : :.................:......: : syntax : /\ : : / \ : : / \ : : ------ : :........: semantics \end{verbatim} \noindent Because the semantic analysis of the stem is orthogonal to its grammatical function relating to lexical lookup, the semantic analysis of a stem can be developed independently of the grammatical aspect of word analysis. For example, the common meaning of the different forms of {\em go} may be characterized by the location of their common stem in a semantic hierarchy or network (e.g., {\em is-a} hierarchy, {\em is-part-of} hierarchy, etc.). The two different readings of the form {\em bear}, on the other hand, are differentiated in terms of their distinct stems, which relate to different concepts in the semantic hierarchy. \vspace{0.2cm} \begin{large} \noindent {\bf 3. A Core Lexicon versus a Full-Form Lexicon} \end{large} \vspace{0.1cm} \noindent The word form analysis provided by lexical lookup is constructed (at least in part) from the information contained in the lexical entries of an on-line lexicon. Regarding these lexical entries, we distinguish two basic lexical formats: the {\em core} format and the {\em full-form} format. These two formats coincide in the case of closed-class items, e.g., function words like {\em the}, {\em and}, or {\em above}, but differ in the case of open class items which exhibit paradigmatic variation in terms of inflection and/or derivation. In a core lexicon the different word forms of a paradigm are represented by a single lexical entry, constituting the {\em representative form} of the paradigm. As an example consider 3.1: \vspace{0.1cm} \noindent {\bf 3.1} {\em The Core Lexicon for} {\bf boy}, {\bf learn}{\em , and} {\bf happy} \begin{verbatim} ("boy" (SN) BOY) ("learn" (NOM SC V) LEARN) ("happy" (ADJ) HAPPY) \end{verbatim} \noindent Following traditional grammar, the representative form of a noun is given in 3.1 as the (non-genitive) singular, the representative form of a verb is given as the infinitive (which in English coincides with non-third-person singular present tense), and the representative form of an adjective-adverb is given as the adjectival standard.\footnote{The category {\tt (SN)} stands for `singular noun'. The category {\tt (NOM SC V)} represents a verb (V) which takes a non-third-person-singular nominative (NOM) and a subclause (SC) or (accusative) noun phrase as valencies.} The full-form lexicon corresponding to 3.1 is given in 3.2. \vspace{0.1cm} \noindent {\bf 3.2} {\em The Full-Form Lexicon for {\em boy}, {\em learn}, and {\em happy} in 3.1} \begin{verbatim} ("boy" (SN) BOY) ("boy's" (GN) BOY) ("boys" (PN) BOY) ("boys'" (GN) BOY) ("learn" (NOM SC V) LEARN) ("learns" (S3 SC V) LEARN) ("learned" (N SC V) LEARN (HV SC) LEARN) ("learning" (B SC) LEARN) ("learner" (SN) LEARN) ... ("happy" (ADJ) HAPPY) ("happily" (ADV) HAPPY) ("happier" (CAD) HAPPY) ("happiest" (SAD) HAPPY) \end{verbatim} The core format has the advantage of brevity. This allows not only the loading of a much larger\footnote{Larger than a corresponding full-form lexicon in the sense of characterizing more core entries or words.} lexicon into a given run-time memory, but also facilitates the work of the lexico\-grapher, who has to enter only one form per paradigm. Of course, a core lexicon requires a morphology grammar to derive all the other forms of the paradigm from the representative form. But from a theoretical viewpoint this is desirable because the required morphology grammar constitutes a rule system which states all the relevant linguistic generalizations explicitly. The full-form format, on the other hand, presents the lexical information in the form which is required by the parser. But a full-form lexicon uses considerably more space than a corresponding core lexicon. Furthermore, because there is no room for a morphology grammar, there is no explicit statement of the morphological regularities and irregularities of the language in question.\footnote{In theory, the lookup relative to a full-form lexicon should be fast because there is no run-time computation of different word forms. But in practice the large size of a full-form lexicon may result in frequent swapping, resulting in a considerable slow-down of real-time lookup.} \vspace{0.2cm} %\vspace{0.2cm} \vspace{0.2cm} \begin{large} \noindent {\bf 4. The Structure of Lexical Lexical Lookup} \end{large} \vspace{0.1cm} \noindent Given the notions of a core lexicon and a corresponding full-form lexicon, the task of computational morphology may be described as mediating efficiently between these two formats. What are the options in designing a time- and space-efficient lexical lookup? The lookup of a word form during parsing is dependent on three different levels of representation and two different mappings between them. \vspace{0.1cm} \noindent {\bf 4.1} {\em Three Different Levels of a Lexical Entry} \begin{tabbing} xx\=xxxxxxxxxxxxxxx\=\kill \>{\bf Level III}: \>The lexical entry returned by the run-time lookup\\ \\ xx\=xxxxxxxxxxxx\=\kill \> $\Uparrow$ {\em Mapping B}: \>Retrieving the lexical entry from the run-time buffer\\ \\ xx\=xxxxxxxxxxxxxxx\=\kill \>{\bf Level II}: \>The lexical entry in the run-time buffer\\ \\ xx\=xxxxxxxxxxxx\=\kill \> $\Uparrow$ {\em Mapping A}: \>Loading the file (or an entry) into run-time memory\\ \\ xx\=xxxxxxxxxxxxxxx\=\kill \>{\bf Level I}: \>The lexical entry in the off-line file \end{tabbing} \noindent These three levels follow from the fact that information is written into an off-line memory (e.g., a disc or a tape) for long term storage (level I), but must be loaded (mapping A) into run-time memory (i.e., RAM) for actual use (level II). Analyzed word forms (level III) are retrieved from the run-time memory (level II) by means of an index function (mapping B). Given the distinction between off-line storage and run-time storage, and between loading into run-time memory and retrieval from run-time memory, we are especially interested in the efficiency of the components which are used during parsing. At run-time, only level II is crucial for space efficiency and only mapping B is crucial for time efficiency.\footnote{New, fast disc access makes it possible to retrieve single lexical entries from the disc during run-time. In such systems mapping B and mapping A are performed at the same time. But disc access is still considerably slower than RAM access (i.e., lookup from level II, based on loading the whole lexicon file prior to parsing).} On the other hand, off-line storage efficiency (level I) and the time it takes to load the file into the run-time buffer (mapping A) do not negatively affect run-time efficiency. Because lexical lookup of a word form involves two different mappings, there are two opportunities to modify the lexical entry in a linguistically and computationally useful way. As a simple illustration of possible modifications consider the following example. \pagebreak %\vspace{0.1cm} \noindent {\bf 4.2} {\em Example of Possible Modifications During Lookup of an Entry} \begin{tabbing} xx\=xxxxxxxxxxxxxxx\=lexical entry written byxxxx \=\kill \>{\bf Level III}: \> (``house'' (SN) HOUSE) \> {\em Stem added by mapping B}\footnotemark\\ \\ xx\=xxxxxxxxxxxx\=\kill \> $\Uparrow$ {\em Mapping B}: \> Retrieving the lexical entry from the run-time buffer\\ \\ xx\=xxxxxxxxxxxxxxx\=\kill \>{\bf Level II}: \> (``house'' (SN)) \> {\em Category changed to upper case by mapping A}\\ \\ xx\=xxxxxxxxxxxx\=\kill \> $\Uparrow$ {\em Mapping A}: \> Loading the entry into run-time memory\\ \\ xx\=xxxxxxxxxxxxxxx\=\kill \>{\bf Level I}: \> (``house'' (sn)) \> {\em Lexical entry written by lexicographer} \end{tabbing} \footnotetext{The stem supplied by mapping B is derived from the surface by deletion of quotes and change into upper case.} \noindent In this simple example, the lexical entry (``house'' (sn)) is retrieved in the form (``house'' (SN) HOUSE). Depending on whether or not mapping A and/or B are defined as the identity mapping,\footnote{A mapping is called the identity mapping if it does not modify the input in any way.} we may distinguish three basic types of lexical lookup. These are called (i) core lookup, (ii) allomorph lookup, and (iii) full-form lookup, and are described in the following section. %\vspace{0.2cm} \vspace{0.2cm} \begin{large} \noindent {\bf 5. Three Basic Methods of Lexical Lookup} \end{large} \vspace{0.1cm} \noindent The first possibility, core lookup, uses a core lexicon at levels I and II, and computes the different word forms during run-time. Core lookup is characterized by the fact that morpheme composition (as in $learn+ed$) and allomorph derivation (as in /wa{\bf I}f/ $\rightarrow$ /wa{\bf I}v/) are not systematically separated, but handled together during run-time. \vspace{0.1cm} \noindent {\bf 5.1} {\em Method 1:} {\bf Core Lookup} \begin{tabbing} xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level III}: \>full form format \>(``happi+ly" (ADV) HAPPY)\\ \\ xx\=xxxxxxxxxxxx\=\kill \> $\Uparrow$ {\em Mapping B}: \>run-time computation of word forms\\ \\ xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level II}: \>core format \>(``happy" (ADJ) HAPPY)\\ \\ xx\=xxxxxxxxxxxx\=\kill \>$\Uparrow$ {\em Mapping A}: \>identity mapping\\ \\ xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level I}: \>core format \>(``happy" (ADJ) HAPPY) \end{tabbing} \noindent In example 5.1, mapping B not only attaches the ending $ly$, but also transforms the surface of the core entry $happy$ into $happi$. $Happi$ is an (orthograpic) allomorph of $happy$. The Greek root of the term {\em allomorph} translates into English as {\em alternative form}. Thus, the notion does not distinguish between spoken and written language. If such a distinction is desired, the terms {\em allophone} versus {\em allograph} are available. We use the term {\em allomorph} in the present context because the basic issues under discussion are not dependent on a specific medium of language. For simplicity, the derivation of allomorphs is illustrated on the basis of orthographic rather than phonetic alternations. The method of core lookup is currently the most widely used, and is exemplified by Koskenniemi (1983), and Evans (1987, pp. 79-101).\footnote{The two systems resemble each other only in that they use the same type of lookup function. In all other respects the theoretical and practical details of Koskenniemi (1983) and Evans (1987) differ considerably.} It is space-efficient at levels I and II, and requires a morphology grammar to compute full forms during run-time, thus accounting explicitly for the morphological structures of the language. The disadvantage is that---due to the run-time computation of word forms---it is comparatively slow. Furthermore, the morphology grammars in question turn out to be highly complicated because they have to handle (i) concatenation, (ii) assimilation and other regular alternations, and (iii) exceptions such as suppletion all at the same time. The second possibility, allomorph lookup, is the method used by LA-MORPH. It is novel in that it utilizes both mapping A and mapping B to modify the lexical entry. The idea is to precompute the different allomorphs of the language during the loading of the lexicon. The computation of allomorphs from core entries is based on rules, called allo-rules. \vspace{0.1cm} \noindent {\bf 5.2} {\em Method 2:} {\bf Allomorph Lookup} \begin{tabbing} xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level III}: \>full form format \>(``happi+ly" (ADV) HAPPY)\\ \\ xx\=xxxxxxxxxxxx\=\kill \> $\Uparrow$ {\em Mapping B}: \>run-time concatenation of morphemes\\ \\ xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level II}: \>extended core format \>(``happy" (ADJ) HAPPY)\\ \> \> \>(``happi" (SR ADJ) HAPPY)\\ \\ xx\=xxxxxxxxxxxx\=\kill \> $\Uparrow${\em Mapping A}: \>derivation of allomorphs\\ \\ xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level I}: \>core format \>(``happy" (ADJ) HAPPY) \end{tabbing} \noindent By taking care of the allomorphs during loading, mapping B can be limited to that operation which can be performed most efficiently during run-time: simple left-to-right concatenation of morphemes, based on the concatenation rules of LA-MORPH.\footnote{The AMPLE system (Weber, Black, McConnel 1988) resembles LA-MORPH in that it is based on left-to-right concatenation of allomorphs during run-time. AMPLE does not, however, provide for the {\em automatic} derivation of allomorph from core-format entries. Rather, the lexicographer has to enter the allomorphs at level I, either by hand or through semi-automatic processing using, e.g., UNIX utilities such as AWK. Another difference between AMPLE and LA-MORPH is that AMPLE uses several different constraint representations, whereas LA-MORPH uses the highly uniform and more efficient left-associative algorithm. The constraints of AMPLE are divided into two classes, successor tests which apply to two adjacent morphemes (rather than word starts and next morphemes, as in LA-MORPH), and final tests, which check whether the morpheme pattern is legal. Thus, AMPLE often discards illegal morpheme patterns only at the final test. LA-MORPH, on the other hand, computes possible continuations on the basis of {\em categorized} morphemes; illegal continuations are discarded as soon as they occur. Finally, AMPLE is implemented as a depth-first, backtracking system, while LA-MORPH proceeds in a breadth-first fashion (which is conceptually interpreted as applying the rules in a rule package in parallel).} The choice of the proper allomorphs during concatenation is guided by their different categories\footnote{For example, the category of the allomorph {\tt HAPPI} contains the marker {\tt SR} to ensure concatenation with other morphemes, as in happi+ly, happi+er and happi+est. The category {\tt (ADJ)} of the allomorph {\tt HAPPY}, on the other hand, is not accepted by any of the concatenation rules.} and/or surface properties. Morphological analysis based on allomorph lookup (method 2) is not only faster than one based on core lookup (method 1), but also much more space efficient than full-form lookup (method 3). This is because complex word forms are derived on the basis of concatenation. Because the allomorphs are generated automatically from the core lexicon, and the different word forms are derived during run-time, the lexicographer usually has to write only one (core format) entry per paradigm. The third possibility, full-form lookup, uses a full-form lexicon at level I and level II. \vspace{0.1cm} \noindent {\bf 5.3} {\em Method 3:} {\bf Full-Form Lookup} \begin{tabbing} xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level III}: \>full form format \>(``happily" (ADV) HAPPY)\\ \\ xx\=xxxxxxxxxxxx\=\kill \>$\Uparrow$ {\em Mapping B}: \>identity mapping\\ \\ xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level II}: \>full form format \>(``happily" (ADV) HAPPY)\\ \\ xx\=xxxxxxxxxxxx\=\kill \>$\Uparrow$ {\em Mapping A}:\>identity mapping\\ \\ xx\=xxxxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxx\=\kill \>{\bf Level I}: \>full form format \>(``happily" (ADV) HAPPY) \end{tabbing} \noindent >From the viewpoint of linguistic analysis, morphological analysis based on a full-form lexicon is trivial: the word forms are simply listed and the morphological regularities (and irregularities) of the language in question remain implicit. However, if the main interest is in syntactic parsing on the basis of a small, experimental lexicon, full-form lookup provides a ``quick and dirty'' method.\footnote{This method was used in the systems described in Hausser (1986). It is easy to implement as a hash table and fast as long as the lexicon is small (e.g., 1000 different forms).} But for serious lexical applications the space requirements of full-form lookup are prohibitive. \vspace{0.2cm} \begin{large} \noindent {\bf 6. Recognizing Possible Morphemes} \end{large} \vspace{0.1cm} \noindent An important component of lexical lookup is the task of recognizing words or morphemes by relating their surface to corresponding lexical entries in the run-time buffer (level II). There are basically three computational methods, namely (i) linear search, (ii) a hash table index, and (iii) a trie structure index. The choice of the retrieval method depends in part on the method of lexical lookup. The most primitive method of recognizing word forms is {\bf linear search}: given a surface, the algorithm searches for a corresponding entry by looking through the list of lexical entries starting at the beginning. In the worst case, the time of the search corresponds to the length of the lexicon. A {\bf hash table} achieves a more efficient lookup by dividing the set of entries into a finite number of classes, or ``buckets''. Furthermore, the entries are distributed evenly into the different buckets by using abstract properties of their surface. This is accomplished by means of a hashing function.\footnote{While there are many different ways to define a hashing function, they all work roughly as follows. Each character of the surface is defined as an integer. Assuming that B is the number of buckets in the hash table, the proper bucket for each surface may be determined, e.g., by summing the integers for its characters and dividing the result by B, taking the remainder which is an integer from 0 to B-1 (see Aho et al. (1983), p. 123). This integer, called the {\bf key} of the hashing function, is used to determine the bucket for the initial storage as well as for subsequent retrieval.} By deducing the proper bucket from the surface, the search for the corresponding entry is greatly reduced. Instead of searching through the whole set of entries, only the associated bucket has to be searched. The hash table method is most suitable for full-form lookup, where each surface may be treated as an unanalyzed atom. This is because hashing functions create an index without a transparent correlation between the surface of the word and the associated search. While properties of the surface are used to compute the key, the resulting index refers to the surface as a whole. Hash tables are neither practical nor efficient, however, if lookup is based on a core lexicon in connection with morphological analysis. To dissemble a complex surface for morphological analysis based on a hash table requires that the surface be ``unpacked''. For example, the surface {\em learned} is unpacked into the list of atoms (L E A R N E D). >From this list, new atoms may be created by ``packing'', e.g., $le$, $lea$, $lear$, etc. For each of these new atoms we may check whether it has a hash table entry. In this manner, $learned$ may eventually be analyzed as $learn+ed$. This process of unpacking in conjunction with subsequent multiple packing into new atoms is very costly computationally. An alternative method of lexical lookup are {\bf trie structures}.\footnote{ Tries were first proposed by Fredkin (1960). The name {\em trie} is ``derived from the middle letters of the word `retrieval' ... Trie was originally intended to be a homonym of `tree' but to distinguish these two terms many people prefer to pronounce trie as though it rhymes with `pie'.'' Aho et al. (1983), p. 163.} Instead of constructing an abstract relation between the surface and the analyzed word, a trie structure creates a transparent index consisting of a {\bf letter tree}. Each path from the root to a leaf corresponds to one word (or morpheme) in the set. Intuitively, we might say that tries treat words or morphemes as letter sequences such that the lexical entries in the run-time buffer are stored at the last letter of the surface. As an example, consider (6.1). \vspace{0.1cm} \noindent {\bf 6.1} {\em The Storage of Morphemes in a Letter Tree (Trie Structure)} \begin{verbatim} u d l \ \ \ n (un (PX) NEG) u (du (SR ADJ-J) DUE) y (ly (SX ADV) LY) / / / d e (due (ADJ) DUE) e (lye (SN) LYE) / \ \ e u_____________________ l (duel (SN) DUEL) \ \ r (under (PREP) UNDER) l / a \ t (undulat (SR A V) UNDULATE) / e (undulate (SR NOM A V) UNDULATE) \end{verbatim} \noindent Trie structures are ideally suited for the recognition of different morphemes in a complex word---as required by morphological analysis based on a core lexicon. The different possible morphemes in a word form are retrieved in a breadth-first manner, going through the surface letter by letter from left to right. The corresponding path through the trie structure is defined in terms of three basic operations, called push, pop, and jump. A {\bf push} follows the structure of the tree, going downward from a node to one of its daughters. For example, when reading the surface $undulate$, the algorithm finds the top level $u$ in the trie structure, pushes from there to the $n$ one level deeper, from there to the $d$, etc. A {\bf pop} is triggered by the presence of (i) a buffer entry indicating the end of a morpheme, e.g., (un (PX) NEG), or (ii) an escape marker, e.g., {\tt \^{}}, indicating a discontinous morpheme. A pop terminates the continuous downward travel along the tree, starting a new morpheme at the top of the trie structure, e.g., $du$. A {\bf jump} is triggered by an escape marker. An escape marker triggers a jump only if there has been a previous escape, otherwise it triggers a pop. A previous escape is marked by the presence of a flag which is activated when an incomplete morpheme is exited. When the algorithm reverts to a previous escape, its flag is deactivated. Jumps allow for the intercalation of two morphemes in discontinuous morphologies, such as Hebrew and Arabic. For example, the morphemes {\em k{\tt \^{}}t{\tt \^{}}b} and {\em a{\tt \^{}}a} may be intercalated into $katab$. The functioning of jumps is explained in detail in Section 15. In LA-MORPH the implementation of morpheme recognition in the trie structure is based on the following general principles: \vspace{0.1cm} \noindent {\bf 6.2} {\em Principles of Incremental Morpheme Recognition} \begin{enumerate} \item {\bf Beginning of a morpheme:} The beginning of a morpheme is a letter at the top level of the trie structure. \item {\bf Continuation of a morpheme:} The next letter in a morpheme is a daughter of the current letter in the trie structure. \item {\bf End of a morpheme:} The end of a morpheme is indicated by the presence of a buffer entry. \item {\bf Recognition of a morpheme:} A morpheme is recognized in terms of the buffer entry at its last letter. \item {\bf Starting a next morpheme:} If (i) a buffer entry or an escape marker is encountered, (ii) the surface of the input continues, and (iii) there is {\bf no} active flag from a previous escape marker, a next morpheme is started by popping back to the top level of the trie structure. %5 \item {\bf Jumping back to an escape marker:} If (i) a buffer entry or an escape marker is encountered, (ii) the surface of the input continues, and (iii) there is an active flag from previous escape marker, the path returns to the previous escape marker (instead of starting a new morpheme), deactivates the flag, and pushes to the next letter. \item {\bf Pursuing alternative analyses:} If a letter with a buffer entry or an escape marker dominates a continuation letter corresponding to the surface, this path is continued as well. %7 \item {\bf Concatenating morphemes:} The rules for concatenating morphemes (cat-rules, c.f. Section 9) apply only when recognition of a next morpheme is completed. %8 \end{enumerate} \noindent If a certain letter sequence, e.g., {\em bet}, can be interpreted as both a complete morpheme and the beginning of an incomplete morpheme, the two readings are pursued in parallel (clause 5 and 7 in 6.2 apply simultaneously). For example, {\em better} may be analyzed as $better$, i.e., the suppletive comparative of {\em good}, and as $bet+t+er$, i.e., the deverbal noun meaning {\em someone who bets}. These different analyses are obtained by pursuing all possible morpheme sequences in a word form, starting at the beginning. The breadth-first analysis of all possible morphemes in LA-MORPH may result in several alternative analyses which constitute different readings. Some of these readings are eliminated because subsequent letters in the surface do not provide for a grammatical continuation.\footnote{In the current system, the trie structure contains only those letter sequences which have been read in from the lexicon. If the input exhibits a non-occurring continuation letter, such as UL+Q in English, the system aborts the analysis and characterizes the input as ill-formed. This feature was implemented in the current system for reasons of efficiency. However, the parsing of word forms with unknown stems may be desirable, e.g., for reasons of robustness, and for modelling the learning of new words on the basis of analogy. In these cases, the parsing of letters which do not previously exist in the trie structure may be made legal.} Other readings may represent genuine ambiguities of a word form. For example, while $better$ is ambiguous as a whole, $undul...$ is ambiguous up to the fifth letter and may then be disambiguated into either $un+du+ly$ or $undulate$ (cf. 10.1 and 10.2). For further discussion see Section 15, which deals with the intercalation of discontinuous morphemes in such languages as Hebrew and Arabic. \vspace{0.2cm} \begin{large} \noindent {\bf 7. The Space Requirements of Approaches (1), (2), and (3)} \end{large} \vspace{0.1cm} \noindent The different methods of lexical lookup (core lookup, allomorph lookup, full-form lookup) and the different methods of run time retrieval (linear search, hash table method, trie structure index) represent clearly defined conceptual alternatives of word form recognition. These alternatives have far reaching consequences not only for linguistic theory, but also for computational efficiency. As a case in point let us compare the space requirements of different methods of lexical lookup. From a computational viewpoint, such a comparison has two aspects: (i) the number of lexical entries in the run-time buffer (relative to a given core lexicon), and (ii) the size of the index. The number of lexical entries depends on the type of entries in the run-time buffer (core entries vs. allomorphs vs. full-form entries). For a given set of paradigms, the number of allomorphs is always greater than the number of core-lexicon entries and smaller than the number of full-form entries. Furthermore, the number of buffer entries depends on the language under consideration. In a language like English, for example, with its comparatively simple morphology, the full-form lexicon is roughly three times as large as the corresponding core-lexicon. In an inflectional language like German, on the other hand, the full-form lexicon is about eight times as large as the core-lexicon. The size of the allomorph lexicon relative to corresponding core and full-form lexica depends on the degree of regularity of the language in qestion and the method of morphological analysis (see Section 15 for an example). Regarding the retrieval of entries from an on-line lexicon, Section 6 discussed linear search, use of a hash table index, and use of a trie structure index. The average lookup time of linear search is too slow.\footnote{On average, linear search takes half the time of a search through the whole lexicon.} An index such as a hash table or a trie structure is much faster, but it requires a certain amount of additional space. This amount of space can be measured abstractly, e.g., in terms of the number of nodes in a trie structure (rather than the number of bytes in a certain implementation on a certain machine). A trie structure index is equally suited for retrieving whole word forms and parts of words (morphemes). Therefore, the comparison of core lookup, allomorph lookup, and full-form lookup in table (7.1) below is based on a trie structure index. The lexicon serving as the standard of comparison consists of English nouns, verbs, and adjectives in representative proportions and comprises 8000 entries in core format. The corresponding allomorph lexicon was generated from the core lexicon by means of the allo-rules (see following section) of LA-MORPH during loading. The corresponding full-form version was created from the core lexicon by means of a morphological generator with subsequent post-editing.\footnote{The software providing the measurements in 7.1 was written by Ben Fine, former research programmer in the Laboratory of Computational Linguistics at CMU.} \vspace{0.1cm} \noindent {\bf 7.1} {\em The Space Requirements of Approaches (1), (2), and (3)} \begin{small} \begin{verbatim} entries av. length characters trie nodes (1) core lexicon 8,001 6.781540 54,152 20,586 (2) allomorph lexicon 9,527 6.740842 64,220 21,211 (3) full-form lexicon 31,244 7.685317 240,120 44,899 \end{verbatim} \end{small} \noindent The above table presents (a) the number of words or word forms ({\em entries}), (b) the average length of the surface of words or word forms ({\em av. length}), (c) the total number of surface {\em characters} used in the lexicon, and (d) the number of {\em trie nodes} used by the trie structure index. We see that the automatic precomputation of allomorphs in English increases the content of the buffer (RAM) from 8,001 to 9,527 entries, and the size of the index from 20,586 to 21,211 trie nodes. Thus, the additional space required for the precomputation of allomorphs in English is surprisingly small.\footnote{Loading the core lexicon in LA-MORPH (which includes building the trie structure index and deriving the expanded buffer of 9,527 entries) takes 78 seconds on an HP-9000 in the Common Lisp implementation. The corresponding C-implementation takes 14 seconds on a DECstation 3100.} In languages with a strong inflectional morphology, such as German or Latin, the space advantage of an automatically derived allomorph lexicon versus the corresponding full-form lexicon is even more striking than in English. Regarding time efficiency, an objective comparison of the three methods of lexical lookup is more difficult than comparing the required space.\footnote{For one thing, the timing depends on the machine and the software. For example, Koskenniemi (1983) quotes an average lookup speed of 0.1 seconds per word form, given an implementation in PASCAL on a Burroughs B7800 computer. A meaningful comparison with LA-MORPH would require running equally comprehensive systems of the two approaches in the same software on the same machines. Furthermore, given a particular machine, comparing method (3) with that of (1) and (2) is affected by the great size of the full-form lookup, which results in frequent swapping.} For all practical purposes, however, the present LA-MORPH system based on approach (2) is very fast: on an HP-9000 work station the average analysis of a word form takes 0.011 seconds in the Common Lisp implementation. This corresponds to an average of 90 word forms a second,\footnote{This timing refers to a straightforward Common Lisp implementation on a standard machine, without any optimization such as byte encoding.} and requires 980K for an expanded core lexicon of 8000 words. The corresponding C-implementation is even faster, and analyzes 270 words per second on a DECstation 3100. Tagging performance may be improved further by using an efficient method of caching. \vspace{0.2cm} \begin{large} \noindent {\bf 8. Generating Allomorphs} \end{large} \vspace{0.1cm} \noindent After this general discussion of different methods of word form recognition let us turn to a more detailed description of LA-MORPH. LA-MORPH instantiates the new method of allomorph lookup, characterized by the strict separation of (precomputed) allomorph derivation and (run-time) concatenation. LA-MORPH generates allomorphs from core entries with rules, called ``allo-rules'' during the loading of the lexicon. %\pagebreak \noindent {\bf 8.1} {\em Illustrating the Application of an Allo-rule} \begin{tabbing} core entry \=xx$\Rightarrow$xx \=allomorphxxx-1 \=allomorphxxx-2 \=allomorph-3 \= allomorph-4\kill core entry \> \>allomorph-1 \>allomorph-2 \\ $derive$ \> $\Rightarrow$\> $derive$ \> $deriv$ \end{tabbing} The allomorphs $derive$ and $deriv$ generated in 8.1 appear in the following (semi-regular) paradigm: \vspace{0.1cm} \noindent {\bf 8.2} {\em Different Allographs in the Paradigm of} {\bf derive} \begin{verbatim} derive derive+s deriv+ed deriv+ing \end{verbatim} \noindent This pattern arises in verbs ending in $e$ --- the `silent' $e$ is orthographically omitted before another vowel. Other (regular) examples are {\em abridge/abridg-}, {\em accelerate/\-accele\-rat-,} {\em accommodate/accom\-mo\-dat-}. An irregular paradigm exhibiting this pattern is $make$, $make+s$, $made$, $mak+ing$. In LA-MORPH, each allo-rule is defined as a table, consisting of three or more columns, called LEX, ALLO1, ALLO2, ....\footnote{Additional columns for ALLO3, etc., may be added if needed.} The rows are labeled $surface$, $cat$, and $stem$. The input and the output of each row is specified in terms of {\bf regular expressions}. Regular expressions constitute a simple formal language for specifying patterns.\footnote{See Aho et al. (1988), p. 28 f, for a description. Regular expressions are widely known, especially among UNIX users. Most editors, e.g., $vi$ or $emacs$, use regular expressions for searching.} As an example of an allo-rule using regular expressions consider 8.3, which restates 8.1 in more general terms. \vspace{0.1cm} \noindent {\bf 8.3} {\em Allo-rule generating} {\bf -e/\#} {\em Allographs in Verbs} (Appendix A.2.1) \begin{verbatim} LEX ALLO1 ALLO2 X = .* Z = .*[^aeo] surface: $Ze & $Z cat: (ir)? nom $X v sr nom $X v sr $X v stem: $X & & \end{verbatim} \noindent The C-implementation of LA-MORPH automatically interprets ASCII files with allo-rules like (8.3), specified in the tabular format and in terms of regular expressions. For a complete statement of the allo-rules of the LA-MORPH application to English, called EMORPH, see the Appendix. The surface of the input is specified in terms of the variable {\tt Z}, defined as any sequence of letters not ending in $a$, $e$, or $o$,\footnote{For example, $shoe+ing$ retains the final $e$.} plus the final letter $e$. ALLO1 leaves the input surface unchanged (as indicated by {\tt \&}), while ALLO2 drops the final $e$. The category of the input begins with an optional {\tt ir},\footnote{The category of irregular lexical entries, e.g. $make$, is marked with an initial {\tt ir} segment.} followed by {\tt nom}, followed by zero or more unspecified segments, and concluding with a {\tt v}. ALLO1 adds the segment {\tt sr} at the beginning (and deletes the {\tt ir} segment if it is present). The category of ALLO2 is like that of ALLO1, except that {\tt nom} is deleted. The presence of {\tt sr nom} in the category of ALLO1 will ensure that the concatenation rules add only $s$ (as in $derive+s$). The presence of {\tt sr} without {\tt nom} in ALLO2 will ensure that the concatenation rules add only $ed$ and $ing$ (as in $deriv+ed$ and $deriv+ing$). The stem can be anything and is not changed in the allomorphs. This is because no semantics has been implemented in the current system. Note that variables like {\tt X} and {\tt Z} must be preceded by {\tt \$} when used in a pattern. Counting the default allo-rule (i.e., the identity mapping), LA-MORPH distinguishes altogether four different cases of allomorph generation, which correspond to four different types