Parsing markup part 01 (Recursion)
Let's have a look at how to parse a multi-lined string of markup text
written by a user and convert it to the Document
type we defined
in the previous chapter.
Our strategy is to take the string of markup text and:
- Split it into a list where each element represents a separate line, and
- Go over the list line by line and process it, remembering information from previous lines if necessary
So the first thing we want to do is to process the string line by line.
We can do that by converting the string to a list of string.
Fortunately, the Haskell
Prelude
module from the Haskell standard library
base
exposes the function
lines
that does exactly what we want. The Prelude
module is exposed in every
Haskell file by default, so we don't need to import it.
For the line processing part, let's start by ignoring all of the markup syntax and just group lines together into paragraphs (paragraphs are separated by an empty line), and iteratively add new features later in the chapter.
A common solution in imperative programs would be to iterate over the lines using some loop construct and accumulate lines that should be grouped together into some intermediate mutable variable. When we reach an empty line, we insert the content of that variable into another mutable variable that accumulates the results.
Our approach in Haskell isn't so different, except that we do not use loops or mutable variables. Instead, we use recursion.
Recursion and accumulating information
Instead of loops, in Haskell, we use recursion to model iteration.
Consider the following contrived example: let's say that
we want to write an algorithm for adding two natural numbers together,
and we don't have a standard operation to do that (+), but we do
have two operations we could use on each number: increment
and decrement
.
A solution we could come up with is to slowly "pass" one number to the other number iteratively by incrementing one and decrementing the other. And we do that until the number we decrement reaches 0.
For example, for 3
and 2
:
- We start with
3
and2
, and we increment3
and decrement2
- In the next step, we now have
4
and1
, we increment4
and decrement1
- In the next step, we now have
5
and0
, since the second number is0
we declare5
as the result.
This can be written imperatively using a loop:
function add(n, m) {
while (m /= 0) {
n = increment(n);
m = decrement(m);
}
return n;
}
We can write the same algorithm in Haskell without mutation using recursion:
add n m =
if m /= 0
then add (increment n) (decrement m)
else n
In Haskell, to emulate iteration with a mutable state, we call the function again with the values we want the variables to have in the next iteration.
Evaluation of recursion
Recursion commonly has a bad reputation for being slow and possibly unsafe compared to loops. This is because, in imperative languages, calling a function often requires creating a new call stack.
However, functional languages (and Haskell in particular) play by different
rules and implement a feature called tail call elimination - when the result of a function call
is the result of the function (this is called tail position), we can just drop the current
stack frame and then allocate one for the function we call, so we don't require N
stack frames
for N
iterations.
This is, of course, only one way to do tail call elimination and other
strategies exist, such as translating code like our recursive add
above to the iteration version.
Laziness
Haskell plays by slightly different rules because it uses a lazy evaluation strategy instead of the much more common strict evaluation strategy. An evaluation strategy refers to "when do we evaluate a computation". In a strict language, the answer is simple: we evaluate the arguments of a function before entering a function.
So, for example, the evaluation of add (increment 3) (decrement 2)
using strict evaluation
will look like this:
- Evaluate
increment 3
to4
- Evaluate
decrement 2
to1
- Evaluate
add 4 1
Or, alternatively (depending on the language), we reverse (1) and (2) and evaluate the arguments from right-to-left instead of left-to-right.
On the other hand, with lazy evaluation, we only evaluate computation when we need it, which is when it is part of a computation that will have some effect on the outside world, for example, when writing a computation to standard output or sending it over the network.
So unless this computation is required, it won't be evaluated. For example:
main =
if add (increment 2) (decrement 3) == 5
then putStrLn "Yes."
else putStrLn "No."
In the case above, we need the result of add (increment 2) (decrement 3)
in order to know which message to write,
so it will be evaluated. But:
main =
let
five = add (increment 2) (decrement 3)
in
putStrLn "Not required"
In the case above, we don't actually need five
, so we don't evaluate it!
But then, if we know we need add (increment 2) (decrement 3)
,
do we use strict evaluation now? The answer is no - because we might not need
to evaluate the arguments to complete the computation. For example, in this case:
const a b = a
main =
if const (increment 2) (decrement 3) == 3
then putStrLn "Yes."
else putStrLn "No."
const
ignores the second argument and returns the first, so we don't actually need
to calculate decrement 3
to provide an answer to the computation and in
turn output an answer to the screen.
With the lazy evaluation strategy, we will evaluate expressions when we need to (when they are required
in order to do something for the user), and we evaluate from the outside in - first
we enter functions, and then we evaluate the arguments when we need to (usually when the thing
we want to evaluate appears in some control flow such as the condition of an if
expression
or a pattern in pattern matching).
I've written a more in-depth blog post about how this works in Haskell: Substitution and Equational Reasoning.
Please read it and try to evaluate the following program by hand:
import Prelude hiding (const) -- feel free to ignore this line
increment n = n + 1
decrement n = n - 1
const a b = a
add n m =
if m /= 0
then add (increment n) (decrement m)
else n
main =
if const (add 3 2) (decrement 3) == 5
then putStrLn "Yes."
else putStrLn "No."
Remember that evaluation always begins from main
.
Solution
evaluating main
if const (add 3 2) (decrement 3) == 5
then putStrLn "Yes."
else putStrLn "No."
expanding const
if add 3 2 == 5
then putStrLn "Yes."
else putStrLn "No."
expanding add
if (if 2 /= 0 then add (increment 3) (decrement 2) else 3) == 5
then putStrLn "Yes."
else putStrLn "No."
evaluating the control flow 2 /= 0
if (if True then add (increment 3) (decrement 2) else 3) == 5
then putStrLn "Yes."
else putStrLn "No."
Choosing the then
branch
if (add (increment 3) (decrement 2)) == 5
then putStrLn "Yes."
else putStrLn "No."
expanding add
if
( if decrement 2 /= 0
then add
(increment (increment 3))
(decrement (decrement 2))
else (increment 3)
) == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluating decrement 2
in the control flow (notice how both places change!)
if
( if 1 /= 0
then add
(increment (increment 3))
(decrement 1)
else (increment 3)
) == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluating the control flow 1 /= 0
if
( if True
then add
(increment (increment 3))
(decrement 1)
else (increment 3)
) == 5
then putStrLn "Yes."
else putStrLn "No."
Choosing the then
branch
if
( add
(increment (increment 3))
(decrement 1)
) == 5
then putStrLn "Yes."
else putStrLn "No."
Expanding add
if
( if decrement 1 /= 0
then add
(increment (increment (increment 3)))
(decrement (decrement 1))
else increment (increment 3)
) == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluating control flow decrement 1
if
( if 0 /= 0
then add
(increment (increment (increment 3)))
(decrement 0)
else increment (increment 3)
) == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluating control flow 0 /= 0
if
( if False
then add
(increment (increment (increment 3)))
(decrement 0)
else increment (increment 3)
) == 5
then putStrLn "Yes."
else putStrLn "No."
Choosing the else
branch
if
(increment (increment 3)) == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluate control flow increment (increment 3)
if
(increment 3 + 1) == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluate in control flow increment 3
if
(3 + 1 + 1) == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluate in control flow 3 + 1
if
(4 + 1) == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluate in control flow 4 + 1
if
5 == 5
then putStrLn "Yes."
else putStrLn "No."
Evaluate in control flow 5 == 5
if
True
then putStrLn "Yes."
else putStrLn "No."
Choosing the then
branch
putStrLn "Yes."
Which when run will print Yes.
to the screen.
General recursion
In general, when trying to solve problems recursively, it is useful to think about the problem in three parts:
- Finding the base case (the most simple cases - the ones we already know how to answer)
- Figuring out how to reduce the problem to something simpler (so it gets closer to the base case)
- Mitigating the difference between the reduced version and the solution we need to provide
The reduce and mitigate steps together are usually called the recursive step.
Let's take a look at another example problem: generating a list of a particular size with a specific value in place of every element.
In Haskell, this function would have the following signature:
replicate :: Int -> a -> [a]
Here are a few usage examples of replicate
:
ghci> replicate 4 True
[True,True,True,True]
ghci> replicate 0 True
[]
ghci> replicate (-13) True
[]
How would we implement this function recursively? How would we describe it in the three steps above?
- Base case: the cases we already know how to generate are the cases where the length of the list is zero (or less) - we just return an empty list.
- Reduce: while we might not know how to generate a list of size
N
(whereN
is positive), if we knew the solution forN-1
, we could: - Mitigate: Add another element to the solution for
N-1
using the:
(cons) operator.
Try to write this in Haskell!
Solution
replicate :: Int -> a -> [a]
replicate n x =
if n <= 0 -- recognizing the base case
then
[] -- the solution for the base case
else
x : replicate (n - 1) x
-- --- -------------------
-- ^ ^
-- | |
-- | +-------- reduction
-- |
-- +--- mitigation
Mutual recursion
When solving functions recursively, we usually call the same function again, but that doesn't have to be the case. It is possible to reduce our problem to something simpler that requires an answer from a different function. If, in turn, that function will (or another function in that call chain) call our function again; we have a mutual recursive solution.
For example, let's write two functions, one that checks whether a natural number is even or not, and one that checks whether a number is odd or not only by decrementing it.
even :: Int -> Bool
odd :: Int -> Bool
Let's start with even
; how should we solve this recursively?
- Base case: We know the answer for
0
- it isTrue
. - Reduction: We might not know the answer for a general
N
, but we could check whetherN - 1
is odd, - Mitigation: if
N - 1
is odd, thenN
is even! if it isn't odd, thenN
isn't even.
What about odd
?
- Base case: We know the answer for
0
- it isFalse
. - Reduction: We might not know the answer for a general
N
, but we could check whetherN - 1
is even, - Mitigation: if
N - 1
is even, thenN
is odd! if it isn't even, thenN
isn't odd.
Try writing this in Haskell!
Solution
even :: Int -> Bool
even n =
if n == 0
then
True
else
odd (n - 1)
odd :: Int -> Bool
odd n =
if n == 0
then
False
else
even (n - 1)
Partial functions
Because we didn't handle the negative numbers cases in the example above, our functions will loop forever when a negative value is passed as input. A function that does not return a result for some value (either by not terminating or by throwing an error) is called a partial function (because it only returns a result for a part of the possible inputs).
Partial functions are generally considered bad practice because they can have undesired behaviour at runtime (a runtime exception or an infinite loop), so we want to avoid using partial functions as well as avoid writing partial functions.
The best way to avoid writing partial functions is by covering all inputs!
In the situation above, it is definitely possible to handle negative numbers
as well, so we should do that! Or, instead, we could require that our functions
accept a Natural
instead of an Int
, and then the type system would've stopped
us from using these functions with values we did not handle.
There are cases where we can't possibly cover all inputs; in these cases, it is important to re-examine the code and see if we could further restrict the inputs using types to mitigate these issues.
For example, the head :: [a] -> a
function from Prelude
promises
to return the first element (the head) of a list, but we know that lists
could possibly be empty, so how can this function deliver on its promise?
Unfortunately, it can't. But there exists a different function that can:
head :: NonEmpty a -> a
from the
Data.List.NonEmpty
module! The trick here is that this other head
does not take a general list
as input, it takes a different type entirely, one that promises to have
at least one element and, therefore, can deliver on its promise!
We could also potentially use smart constructors with newtype
and enforce some sort
of restrictions in the type system, as we saw in earlier chapters,
But this solution can sometimes be less ergonomic to use.
An alternative approach is to use data
types to encode the absence of a proper result,
for example, using Maybe
, as we'll see in a future chapter.
Make sure the functions you write return a result for every input, either by constraining the input using types or by encoding the absence of a result using types.
Parsing markup?
Let's get back to the task at hand.
As stated previously, our strategy for parsing the markup text is:
- Split the string into a list where each element is a separate line
(which we can do with
lines
), and - Go over the list line by line and process it, remembering information from previous lines if necessary
Remember that we want to start by ignoring all of the markup syntax and just group lines together into paragraphs (paragraphs are separated by an empty line), and iteratively add new features later in the chapter:
parse :: String -> Document
parse = parseLines [] . lines -- (1)
parseLines :: [String] -> [String] -> Document
parseLines currentParagraph txts =
let
paragraph = Paragraph (unlines (reverse currentParagraph)) -- (2), (3)
in
case txts of -- (4)
[] -> [paragraph]
currentLine : rest ->
if trim currentLine == ""
then
paragraph : parseLines [] rest -- (5)
else
parseLines (currentLine : currentParagraph) rest -- (6)
trim :: String -> String
trim = unwords . words
Things to note:
-
We pass a list that contains the currently grouped paragraph (paragraphs are separated by an empty line)
-
Because of laziness,
paragraph
is not computed until it's needed, so we don't have to worry about the performance implications in the case that we are still grouping lines -
Why do we reverse
currentParagraph
? (See point (6)) -
We saw case expressions used to deconstruct
newtype
s andChar
s, but we can also pattern match on lists and other ADTs as well! In this case, we match against two patterns, an empty list ([]
), and a "cons cell" - a list with at least one element (currentLine : rest
). In the body of the "cons" pattern, we bind the first element to the namecurrentLine
, and the rest of the elements to the namerest
.We will talk about how all of this works really soon!
-
When we run into an empty line, we add the accumulated paragraph to the resulting list (A
Document
is a list of structures) and start the function again with the rest of the input. -
We pass the new lines to be grouped in a paragraph in reverse order because of performance characteristics - because of the nature of singly-linked lists, prepending an element is fast, and appending is slow. Prepending only requires us to create a new cons (
:
) cell to hold a pointer to the value and a pointer to the list, but appending requires us to traverse the list to its end and rebuild the cons cells - the last one will contain the last value of the list and a pointer to the list to append, the next will contain the value before the last value of the list and a pointer to the list, which contains the last element and the appended list, and so on.
This code above will group together paragraphs in a structure, but how do we view our result? In the next chapter, we will take a short detour and talk about type classes, and how they can help us in this scenario.