Canopy – Building parse trees

Building parse trees

By default, a Canopy parser will generate a parse tree without you needing to tell it how to do so. Every node has a text, an offset and a (possibly empty) list of elements. But you can also tell Canopy to call functions that you define, to build the tree yourself.

Say we have a grammar that matches strings that represent a mapping from a name to a list of numbers, like {'ints':[1,2,3]}:

maps.peg

grammar Maps
  map     <-  "{" string ":" value "}"
  string  <-  "'" [^']* "'"
  value   <-  list / number
  list    <-  "[" value ("," value)* "]"
  number  <-  [0-9]+

To change the kinds of values the parser generates each time it matches a rule, we can give the names of functions to call, as names prefixed with a % sign:

maps.peg

grammar Maps
  map     <-  "{" string ":" value "}" %make_map
  string  <-  "'" [^']* "'" %make_string
  value   <-  list / number
  list    <-  "[" value ("," value)* "]" %make_list
  number  <-  [0-9]+ %make_number

These function names are called actions. Once you’ve compiled this parser, you can use it by passing in an object that implements the named actions. Each function is passed four arguments:

input: the complete text of the input document
start: the start offset of the text that matches the rule
end: the end offset of the text that matches the rule
elements: an array of the values generated by the rule’s sub-rules

For example, let’s implement actions for the above parser that translate the input text into a JavaScript value representing the same structure:

const maps = require('./maps')

const actions = {
  make_map (input, start, end, elements) {
    let map = {}
    map[elements[1]] = elements[3]
    return map
  },

  make_string (input, start, end, elements) {
    return elements[1].text
  },

  make_list (input, start, end, elements) {
    let list = [elements[1]]
    elements[2].forEach((el) => list.push(el.value))
    return list
  },

  make_number (input, start, end, elements) {
    return parseInt(input.substring(start, end), 10)
  }
}

let result = maps.parse("{'ints':[1,2,3]}", { actions })
console.log(result)

This program prints

{ ints: [ 1, 2, 3 ] }

The parser calls these actions instead of building nodes itself. It passes the (input, start, end) arguments rather than just the text of the match, because this lets it skip spending time and memory on creating substrings when it doesn’t need to; notice how most of the rules above don’t use these arguments.

The % operator binds to sequence expressions, that is, in the following grammar, the input abc will invoke make_alpha while the input 123 will invoke make_numeric:

actions.peg

grammar Actions
  root  <-  "a" "b" "c" %make_alpha / "1" "2" "3" %make_numeric

It can only be used with expressions that create new nodes. It cannot be used with expressions that simply pass through a node created by another rule, such as the ?, /, & and ! operators, and cross-references. It can be used with a sequence of two or more expressions that contains such a rule, but not with those rules on their own.

Action functions are called as the parser is running, so they let you execute code while the input is still being processed.

Adding methods to nodes

Instead of telling the parser how to build nodes, you can have it augment the nodes it builds by default with your own methods. This is done by annotating parsing expressions with types. A type is any valid JavaScript object name like Foo.Bar surrounded with pointy brackets. When the input matches this expression, the generated syntax node will gain the methods from the named type.

Let’s take a simple example: matching a string literal:

strings.peg

grammar Strings
  root  <-  "hello" <HelloNode>

const strings = require('./strings')

const types = {
  HelloNode: {
    upcase () {
      return this.text.toUpperCase()
    }
  }
}

let tree = strings.parse('hello', { types })
console.log(tree.upcase())

The grammar says that a node matching hello is of type HelloNode. Then in our JavaScript code, we pass in an object that contains the named types via the types option, and use the parser to process a string.

Because the string matches our typed rule, it gains the methods from the HelloNode module, and we can invoke those methods on the node.

Let’s run this script:

$ node strings_test.js
HELLO

In the grammar syntax, type annotations bind to sequences. That is, a type annotation may only appear at the end of a sequence expression, and binds tighter than choice expressions. Unlike action annotations, type annotations can be used on any kind of expression, not just those that produce new nodes.

For example the following means that a node matching the sequence "foo" "bar" will be augmented with the Extension methods.

words.peg

grammar Words
  root  <-  first:"foo" second:"bar" <Extension>

The extension methods have access to the labelled node from the sequence.

const words = require('./words')

const types = {
  Extension: {
    convert () {
      return this.first.text + this.second.text.toUpperCase()
    }
  }
}

words.parse('foobar', { types }).convert()
  == 'fooBAR'

Because type annotations bind to sequences rather than to choices, the following matches either the string "abc" which gains the Foo type, or "123" which gains the Bar type:

sequences.peg

grammar Choice
  root  <-  "a" "b" "c" <Foo> / "1" "2" "3" <Bar>

If you want all the branches of a choice to be augmented with the same type, you need to parenthesize the choice and place the type afterward.

choices.peg

grammar Choices
  root    <-  (alpha / beta) <Extension>
  alpha   <-  first:"a" second:"z"
  beta    <-  first:"j" second:"c"

const choices = require('./choices')

const types = {
  Extension: {
    convert () {
      return this.first.text + this.second.text.toUpperCase()
    }
  }
}

choices.parse('az', { types }).convert()
   == 'aZ'

choices.parse('jc', { types }).convert()
   == 'jC'