In typical use when we refer to an LR parser we mean a particular parser capable of recognising a particular language specified by a context free grammar. It is not uncommon, however, to refer to an LR parser meaning a driver program that can be supplied with a suitable table to produce a wide number of different particular LR parsers.
LR parsing has many benefits:
Table of contents |
2 Constructing LR(0) parsing tables |
|
Figure 1. Architecture of a table-based bottom-up parser |
The two LR(0) parsing tables for this grammar look as follows:
action |
|
The action table is indexed by a state of the parser and a terminal (including a special terminal $ that indicates the end of the input stream) and contains three types of actions: a shift that is written as 'sn ' and indicates that the next state is n, a reduce that is written as 'rm ' and indicates that a reduction with grammar rule m should be performed and an accept that is written as 'acc' and inidcates that the parser accepts the string in the input stream.
The goto table is indexed by a state of the parser and a nonterminal and simply indicates what the next state of the parser will be if it has recognized a certain nonterminal.
The LR parsing algorithm now works as follows:
To explain why this algorithm works we now proceed with showing how a string like "1 + 1" would be parsed by such a parser. When the parser starts it always starts with the initial state 0 and the following stack:
In state 2 the action table says that whatever terminal we see on the input stream we should do a reduction with grammar rule 5. If the table is correct then this means that the parser has just recognized the right-hand side of rule 5, which is indeed the case.
So in this case we write 5 to the output stream, pop one state from the stack, and push on the stack the state from the cell in the goto table for state 0 and B, i.e., state 4, onto the stack. The resulting stack is:
The next terminal is now '1' and this means that we perform a shift and go to state 2:
The construction of these parsing tables is based on the notion of LR(0) item (simply called item here) which are grammar rules with a special dot added somewhere in the right-hand side. For example the rule E → E + B has the following four corresponding items:
It is usually not possible to characterize the state of the parser with a single item because it may not know in advance which rule it is going to use for reduction. For example if there is also a rule E → E * B then the items E → E · + B and E → E · * B will both apply after a string corresponding with E has been read. Therefore we will characterize the state of the parser by a set of items, in this case the set { E → E · + B, E → E · * B }.
An item with a dot in front of a nonterminal, such as E → E + · B, indicates that the parser expects to parse the nonterminal B next. To ensure the item set contains all possible rules the parser may be in the midst of parsing, it must include all items describing how B itself will be parsed. This means that if there are rules such as B → 1 and B → 0 then the item set must also include the items B → · 1 and B → · 0. In general this can be formulated as follows:
The Parsing Algorithm
In practice the parser will usually try to recover from a syntax error by ignoring unexpected symbols and/or inserting missing symbols, and continue parsing, but this is outside the scope of this article.An Example
The first terminal that the parser sees is the '1' and according to the action table it should then go to state 2 resulting in the following stack:
The top of the stack is shown on the right-hand side. For the sake of our explanation we also show the symbol that caused the transition to the next state although strictly speaking it is not part of the stack.
However, also in state 4 the action table says we also do a reduction with rule 3. So we write 3 to the output stream, pop one state from the stack, and find the new state in the goto table for state 0 and E, which is state 3. The resulting stack:
The next terminal that the parser sees is a '+' and according to the action table it should then go to state 6:
Note that the resulting stack can be interpreted as the history of a finite state automaton that has just read a nonterminal E followed by a terminal '+'. The transition table of this automaton is defined by the shift actions in the action table and the goto actions in the goto table.
Just as the previous '1' this one is reduced to B giving the following stack:
Again note that the stack corresponds with a list of states of a finite automaton that has read a nonterminal E, followed by a '+' and then a nonterminal B. In state 8 we always perform a reduce with rule 2. Note that the top 3 states on the stack correspond with the 3 symbols in the right-hand side of rule 2.
Finally, we read a '$' from the input stream which means that according to the action table (the current state is 3) the parser accepts the input string.
The rule numbers that will then have been written to the output stream will be [ 5, 3, 5, 2 ] which is indeed a rightmost derivation of the string "1 + 1" in reverse.Constructing LR(0) parsing tables
Items
An exception are rules of the form A → ε with which only the item A → · corresponds. These rules will be used to denote that state of the parser. The item E → E · + B, for example, indicates that the parser has recognized a string corresponding with E on the input stream and now expects to read a '+' followed another string corresponding with E.Item sets
Closure of item sets
Any set of items can be extended such that it satisfies this rule: simply continue to add the appropriate items until all nonterminals preceeded by dots are accounted for. The minimal extension is called the closure of an item set and written as clos(I) where I is an item set. It are these closed item sets that we will take as the states of the parser, although only the ones that are actually reachable from the begin state will be included in the tables.
item set | * | + | 0 | 1 | E | B | |
0 | 1 | 2 | 3 | 4 | |||
1 | |||||||
2 | |||||||
3 | 5 | 6 | |||||
4 | |||||||
5 | 1 | 2 | 7 | ||||
6 | 1 | 2 | 8 | ||||
7 | |||||||
8 |
From this table and the found item sets we construct the action and goto table as follows:
Note that the reduce actions produced by the above procedure always occupy an entire table row, causing the reduction to occur regardless of the next symbol in the input stream. This is why these are LR(0) parse tables: they are only suitable for grammars that do not require lookahead. A grammar that needs lookahead to disambiguate reductions would need a table row containing different reduce actions in different columns, and the above procedure is not capable of creating such rows.
Refinements to the LR(0) table construction procedure (namely SLR, LALR) are capable of constructing reduce actions that do not occupy entire rows. Therefore, they are capable performing different reductions (or even shifts) from the same parser state for different input symbols, allowing them to parse grammars not parseable by an LR(0) table.
The automaton that is constructed is by the way it is constructed always a deterministic automaton. However, when reduce actions are added to the action table it can happen that the same cell is filled with a reduce action and a shift action (a shift-reduce conflict) or with two different reduce actions (a reduce-reduce conflict). However, it can be shown that when this happens the grammar is not an LL(0) grammar.
A small example of a non-LL(0) grammar with a shift-reduce conflict is:
A small example of a non-LL(0) grammar with a reduce-reduce conflict is:
Both examples above can be solved by letting the parser use the follow set (see LL parser) of a nonterminal A to decide if it is going to use one of As rules for a reduction; it will only use the rule A → w for a reduction if the next symbol on the input stream is in the follow set of A. This solution results in so-called Simple LR parsers.Constructing the action and goto table
The reader may verify that this results indeed in the action and goto table that were presented earlier on.A note about LR(0) versus SLR and LALR parsing
Conflicts in the constructed tables
One of the item sets we then find is:
There is a shift-reduce conflict in this item set because in the cell in the action table for this item set and the terminal '1' there will be both a shift action to state 1 and a reduce action with rule 2.
In this case we obtain the following item set:
There is a reduce-reduce conflict in this item set because in the cells in the action table for this item set there will be both a reduce action for rule 3 and one for rule 4.