Introducing Possum, a Tiny Text Parsing Language

June 03, 2024 - possum parsing programming languages ~~~(##)'>

Possum is a domain-specific language designed for parsing text, inspired by classic Unix utilities like AWK and sed. You can use Possum for tasks ranging from single-line scripts for data extraction to quickly prototyping a new programming language syntax. The language aims to make parsing friendly and fun, and uses a minimal feature set to write declarative programs that are both compact and readable.

This guide teaches the core features of Possum using interactive examples, and should give you enough context to handle a wide range of parsing situations. If you're checking out Possum for the first time and want to learn more about the language at a higher level, I'm planning on writing separate articles covering the design philosophy behind Possum and examples of larger Possum programs.

⚠️ Work In Progress ⚠️

Possum is still in development. Most of the core functionality is in place, but there are a number of rough edges. The one you'll likely notice is that error messages are mostly placeholders, and will be pretty unhelpful. Rest assured we've got a team of marsupials working around the clock to correct this issue.

The Basics

A Possum program consists of parsers, functions that both validate text inputs and transform them into structured data. The Possum runtime takes a program and an input string and either successfully parses the input into a JSON encoded value, or fails if the input was malformed.

This section covers parsers that match specific strings or numbers in the input text and return the matched value unchanged. Later on we'll introduce ways to compose these basic parsers together to make compound parsers that can validate more complex inputs and produce any JSON value as output.

Literal Parsers

String literals are parsers which match the exact text of the string and return the string value on success.

Here's our first interactive example! Typically Possum is run from the command line, but in the browser the Input field is the text we're going to parse, while the Parser field is the Possum program. Try running the program once to see it succeed, and then change either the input or parser to experiment with the string matching behavior.

String literals can use double or single quotes. JSON strings are encoded with double quotes, so the output will always use double quotes.

Number literals are parsers which match the exact digits of a number and return the number value on success. Possum supports the same number format as JSON, which includes positive and negative numbers, integers, and numbers with fraction and/or exponent parts.

Range Parsers

Character ranges are parsers that match a single Unicode code point that falls within an inclusive range.

Code points are, broadly speaking, how Unicode defines units of text. This means we can use character range parsers for more than just ASCII characters. The emoji "😄" is code point U+1F604 and "🤠" is U+1F920, so "😅" (U+1F605) is in the range. It's worth noting that some units of text are made up of multiple code points stuck together, so character ranges won't work for absolutely everything that looks like a single character. This limitation shouldn't be an issue in the majority of parsing use cases.

Integer ranges use the same .. syntax, but match all integers that fall within an inclusive range.

Greed and Failure

Parsers always start matching from the beginning of the input, do not skip over any input, and return the longest possible match.

After parsing, any extra input is thrown out. This means that the empty string "" is a parser that always succeeds, no matter the input.

If the parser fails to find a match, Possum returns an error.

The Standard Library

Possum has a standard library with parsers covering many common parsing situations. Here's a quick overview of some standard library parsers we'll be using in our examples.

Parsing Strings

Use char to parse exactly one character, returning the value as a string.

Parse and return an upper- or lower-case letter from the English alphabet with alpha. To parse multiple letters try changing alpha to alphas.

Parse and return one or more alphanumeric characters with word. This parser also accepts _ and -.

Parse and return one or more non-whitespace characters with token.

Some parsers are parametrized by other parsers. The parser many(p) tries to run the parser p repeatedly until it no longer succeeds, and returns the concatenation of all of the parsed values.

Parsing Whitespace

The space parser matches a single blank non-line-breaking character. This usually means an ASCII space or tab. By convention spaces will instead parse multiple blank characters at once.

The newline parser matches and returns a single line-breaking character. To parse multiple line breaks use newlines. These parsers are aliased to the abbreviations nl and nls, respectively.

To parse all contiguous whitespace use whitespace or ws.

Parsing Numbers

The digit parser matches a single Arabic numeral between 0 and 9, and returns the numeral as an integer.

Parse any valid JSON integer with integer, or the alias int.

Parse any valid JSON number with number or num. This includes numbers with fraction and/or exponent parts.

Parsing Constants

The parsers true(t), false(f), bool(t, f), and null(n) return the appropriate constant values when the provided parser matches.

Parsing Collections

Finally, array(elem) and object(key, value) return ordered list collections (arrays) and key/value pair collections (objects).

Collections often have separator characters between elements. You can use array_sep(elem, sep) and object_sep(key, pair_sep, value, sep) to handle these cases, parsing the separators but excluding them from the result.

Composing Parsers

We've now covered both basic parsers for strings and numbers, and some of the high-level parser functions from Possum's standard library.

The last major feature we need is the ability to combine parsers to handle more complex inputs. In Possum we do this with infix operators, symbols that go between two parsers to control how and when the parsers run.

Or

The "or" operator p1 | p2 tries to match the parser p1 and then if that fails tries to match p2 instead.

If both parsers fail then the compound parser fails.

Take Right

The "take right" operator p1 > p2 matches p1 and then matches and returns p2.

If either parser fails then the compound parser fails.

Take Left

Similarly the "take left" operator p1 < p2 matches p1, keeps the result, then matches p2. If both succeed then the value parsed by p1 is returned.

If either parser fails then the compound parser fails.

Merge

The "merge" operator p1 + p2 matches p1 and then p2 and combines the two values.

Merging will concatenate strings:

Concatenate arrays:

Combine objects, overwriting existing values:

Sum numbers:

And apply logical "or" to booleans:

If the two parsed values have different types then the operation will throw a runtime error.

The one exception to this rule is the value null, which can be merged with any other value, acting as the identity value for that data type:

Return

The "return" operator p $ V matches p, and then on success returns the value V.

The value on the right-side of $ can be any valid JSON data, including arrays, objects, true, false, and null.

Destructure

The "destructure" operator p -> P matches p, and then compares the result to the pattern P. If the parsed value has the same structure as the pattern then the parser matches and the whole value is returned. The pattern can be any value, including arrays and objects.

If the parsed value does not match the pattern then the parser fails.

Patterns can also include UpperCamelCase variables which match and capture any value. Variables can be used later in the same parser.

Sequence

The "sequence" operator p1 & p2 matches p1 and then matches and returns p2. This behavior is similar to >, but & has a more general precedence, grouping parts of a parser together in a similar way to parentheses. Because of this > is best suited for parsing and then ignoring a value within a parsing step, while & is more useful in stringing together a list of steps. Instead of grouping like this:

The same parser can be written more clearly as a sequence:

Putting it all together

Using the return, destructure, and sequence operators together we can implement a very common pattern in Possum — matching a sequence of parsers, destructuring to assign values to variables, and then building a return value using the variables.

Defining Parsers

A Possum program must have one main parser, and can optionally declare any number of named parsers. Parsers must be separated either by newlines or semicolons. Named parsers are declared with the syntax name = parser. At runtime Possum executes the main parser, which can reference named parsers declared in the program in the same way we reference named parsers from the standard library.

Named Parsers can be parameterized with both parsers and values. Parser params are always snake_case while value params are always UpperCamelCase.

When passing values as parser arguments, any value that could be confused with a parser must be prefixed with $. This includes strings, numbers, and the constants true, false, and null. Arrays, objects, and UpperCamelCase variables are always values, so there's no need to disambiguate.

Named parsers can be recursive and referenced before they are declared. The main parser can come before, after, or in between named parser declarations.

A Few More Standard Library Parsers

At this point you should be well equipped to browse the standard library, but here are a few more parsers that you might find particularly useful.

The parser maybe(p) runs p and either returns the parsed value if p succeeds, or returns null if p fails. This means maybe(p) will never fail, and can be merged with any other value in a concatenated output.

Similarly, skip(p) runs p, but on success always returns null. Since null can merge with any value this allows parts of the input to be ignored in a concatenated output.

Once you're happy with a parser, you may want to ensure that it always parses the whole input by using end_of_input or end to specify that the end of the input has been reached.

If end finds unparsed input then it fails.

Alternatively, input(p) wraps a parser to both strip surrounding whitespace and make sure the whole input is parsed.

Use find(p) to skip characters until the provided parser matches.

While array_sep(elem, sep) handles one-dimensional data with separators, table_sep(array, sep, row_sep) handles two-dimensional data with both column and row separators.

~~~(##)'> Conclusion

We've made it — that's just about everything you need to know to be productive with Possum. In the very first example we matched and returned a string input exactly, but with just a few changes we can extend that parser to handle any number of variations or requirements.

Possum aims to make parsing friendly and fun by letting you compose complex parsers from simple building blocks. These kinds of parsers are frequently called parser combinators. Many modern languages have parser combinator libraries, but they're all implemented slightly differently from one another. Possum provides a consistent set of parsing utilities that integrate with any programming language through JSON. On top of that, Possum avoids much of the complexity that comes from trying to integrate parser combinators into an existing language by focusing solely on parsing.

To install Possum check out the Github repo. To learn more about using Possum read through the standard library or some larger examples. Good luck and happy parsing!