Example: Parsing Scheme

Here we prepare a small parser for a very simple Scheme-like language.

The main features of the parser include:

handling whitespace
parsing number-containing identifiers and numbers while avoiding ambiguities using not_followed_by
error recovery by manually traversing the memo table

We choose not to implement any of the Scheme data types except numbers and identifiers; also all top-level expressions must be parenthesized "command" S-expressions.

Implementing the grammar

import PikaParser as P

rules = Dict(
    :letter => P.satisfy(isletter),
    :digit => P.satisfy(isdigit),
    :ident => P.tie(
        P.seq(
            P.seq(:letter),
            P.many(:inIdent => P.first(:letter, :digit, P.token('-'))),
            P.not_followed_by(:inIdent),
        ),
    ),
    :number => P.seq(P.some(:digit), P.not_followed_by(:inIdent)),
    :ws => P.many(P.satisfy(isspace)),
    :popen => P.seq(P.token('('), :ws),
    :pclose => P.seq(P.token(')'), :ws),
    :sexpr => P.seq(:popen, :insexpr => P.many(:scheme), :pclose),
    :scheme => P.seq(:basic => P.first(:number, :ident, :sexpr), :ws),
    :top => P.seq(:ws, :sexpr), #support leading blanks
);

Notice that the rules "clean" the space characters after each sensible token is matched, except for :top that is able to clean up the leading spaces. This way prevents unnecessary checking (and redundant matching) of the tokens, and buildup of uninteresting entries in the memo table.

Parsing input

Let's test the grammar on a piece of source code that contains lots of whitespace and some errors.

p = P.parse(
    P.make_grammar([:top], P.flatten(rules, Char)),
    """
(plus 1 2 3)
(minus 1 2(plus 3 2)  ) woohoo extra parenthesis here )
(complex
  id3nt1f13r5 αβγδ भरत kůň)
(invalid 1d3n7)
(something
  1
  2
  valid)
(straight (out (missing(parenthesis error))
(apply (make-function) (make-data))
""",
);

Prepare a folding function:

fold_scheme(m, p, s) =
    m.rule == :number ? parse(Int, m.view) :
    m.rule == :ident ? Symbol(m.view) :
    m.rule == :insexpr ? Expr(:call, :S, s...) :
    m.rule == :sexpr ? s[2] : m.rule == :top ? s[2] : length(s) > 0 ? s[1] : nothing;

Recovering from errors and showing partial parses

We can run through all top matches, tracking the position where we would expect the next match:

next_pos = 1
while next_pos <= lastindex(p.input)
    global next_pos
    pos = next_pos
    mid = 0
    while pos <= lastindex(p.input) # try to find a match
        mid = P.find_match_at!(p, :top, pos)
        mid != 0 && break
        pos = nextind(p.input, pos)
    end
    pos > next_pos && # if we skipped something, report it
        println(
            "Got problems understanding this: $(p.input[next_pos:prevind(p.input, pos)])",
        )
    if mid == 0
        break # if we skipped all the way to the end, quit
    else # we have an actual match, print it.
        value = P.traverse_match(p, mid, fold = fold_scheme)
        println("Got a good value: $value")
        m = p.matches[mid] # skip the whole match and continue
        next_pos = nextind(p.input, m.last)
    end
end

Got a good value: S(plus, 1, 2, 3)
Got a good value: S(minus, 1, 2, S(plus, 3, 2))
Got problems understanding this: woohoo extra parenthesis here )
Got a good value: S(complex, id3nt1f13r5, αβγδ, भरत, kůň)
Got problems understanding this: (invalid 1d3n7)
Got a good value: S(something, 1, 2, valid)
Got problems understanding this: (straight (out
Got a good value: S(missing, S(parenthesis, error))
Got a good value: S(apply, S(var"make-function"), S(var"make-data"))

We can see that the unparseable parts of input were correctly skipped, while the sensible parts were interpreted as expressions. The chosen error recovery method might not be optimal in the case of missing parentheses – as an improvement, one might choose to utilize another grammar rule to find a good part of input to discard (e.g., everything to the end of the line).

This page was generated using Literate.jl.