Step 2 - Filtering From start to end, performing the following filtering and lexing tasks using the given order of precedence in case of conflict: a. If the Lojban word "zoi" (selma'o ZOI) is identified, take the following Lojban word (which should be end delimited with a pause for separation from the following non-Lojban text) as an opening delimiter. Treat all text following that delimiter, until that delimiter recurs *after a pause*, as grammatically a single token (labelled 'anything_699' in this grammar). There is no need for processing within this text except as necessary to find the closing delimiter. ; b. If the Lojban word "zo" (selma'o ZO) is identified, treat the ; following Lojban word as a token labelled 'any_word_698', instead of lexing ; it by its normal grammatical function. ; c. If the Lojban word "lo'u" (selma'o LOhU) is identified, search for ; the closing delimiter "le'u" (selma'o LEhU), ignoring any such closing ; delimiters absorbed by the previous two steps. The text between the ; delimiters should be treated as the single token 'any_words_697'. ; d. Categorize all remaining words into their Lojban selma'o category, ; including the various delimiters mentioned in the previous steps. In ; all steps after step 2, only the selma'o token type is significant for ; each word. ; e. If the word "si" (selma'o SI) is identified, erase it and the ; previous word (or token, if the previous text has been condensed into a ; single token by one of the above rules). f. If the word "sa" (selma'o SA) is identified, erase it and all preceding text as far back as necessary to make what follows attach to what precedes. (This rule is hard to formalize and may receive further definition later.) > > [2]: Actually, without an extension to PEGs, 'zoi' cannot be > > handled without a pre-processor, and without a re-definition > > that is at least marginally sane, 'sa' doesn't even have a > > working definition to try to handle. > > The definition of "sa" is straightforward: check the selma'o of > the next token rightward, and remove tokens leftward until a token > of the same selma'o has been removed (in the usual extended sense > of "selma'o"). ; g. If the word 'su' (selma'o SU) is identified, erase it and all ; preceding text back to and including the first preceding token word ; which is in one of the selma'o: NIhO, LU, TUhE, and TO. However, if ; speaker identification is available, a SU shall only erase to the ; beginning of a speaker's discourse, unless it occurs at the beginning of ; a speaker's discourse. (Thus, if the speaker has said something, two ; "su"'s are required to erase the entire conversation. ; Step 3 - Termination ; If the text contains a FAhO, treat that as the end-of-text and ignore ; everything that follows it. Step 4 - Absorption of Grammar-Free Tokens In a new pass, perform the following absorptions (absorption means that the token is removed from the grammar for processing in following steps, and optionally reinserted, grouped with the absorbing token after parsing is completed). ; a. Token sequences of the form any - (ZEI - any) ..., where there may be ; any number of ZEIs, are merged into a single token of selma'o BRIVLA. ; b. Absorb all selma'o BAhE tokens into the following token. If ; they occur at the end of text, leave them alone (they are errors). ; c. Absorb all selma'o BU tokens into the previous token. Relabel the ; previous token as selma'o BY. ; d. If selma'o NAI occurs immediately following any of tokens UI or CAI, ; absorb the NAI into the previous token. ; e. Absorb all members of selma'o DAhO, FUhO, FUhE, UI, Y, and CAI ; into the previous token. All of these null grammar tokens are permitted ; following any word of the grammar, without interfering with that word's ; grammatical function, or causing any effect on the grammatical ; interpretation of any other token in the text. Indicators at the ; beginning of text are explicitly handled by the grammar. ; Step 5 - Insertion of Lexer Lexemes ; ; Lojban is not in itself LALR1. There are words whose grammatical ; function is determined by following tokens. As a result, parsing of the ; YACC grammar must take place in two steps. In the first step, certain ; strings of tokens with defined grammars are identified, and either ; ; a. are replaced by a single specified 'lexer token' for step 6, or ; ; b. the lexer token is inserted in front of the token string to identify ; it uniquely. ; ; The YACC grammar included herein is written to make YACC generation of a ; step 6 parser easy regardless of whether a. or b. is used. The strings ; of tokens to be labelled with lexer tokens are found in rule terminals ; labelled with numbers between 900 and 1099. These rules are defined ; with the lexer tokens inserted, with the result that it can be verified ; that the language is LALR1 under option b. after steps 1 through 4 have ; been performed. Alternatively, if option a. is to be used, these rules ; are commented out, and the rule terminals labelled from 800 to 900 refer ; to the lexer tokens *without* the strings of defining tokens. Two sets ; of lexer tokens are defined in the token set so as to be compatible with ; either option. ; ; In this step, the strings must be labelled with the appropriate lexer ; tokens. Order of inserting lexer tokens *IS* significant, since some ; shorter strings that would be marked with a lexer token may be found ; inside longer strings. If the tokens are inserted before or in place of ; the shorter strings, the longer strings cannot be identified. ; ; If option a. is chosen, the following order of insertion works correctly ; (it is not the only possible order): A, C, D, B, U, E, H, I, ; J, K, M ,N, G, O, V, W, F, P, R, T, S, Y, L, Q. This ensures that the longest ; rules will be processed first; a PA+MAI will not be seen as a PA ; with a dangling MAI at the end, for example.