Explore our categories

Insurance is hard – From SweetXML to Saxy

In this article, we discuss XML parsing and out-of-memory errors in Elixir services. We will show how we solved our memory problems by migrating from DOM to SAX parsing, switching SweetXML for Saxy. This lead to a 90% reduction in memory usage and 20% parsing speedup.

Quick heads up: We work in the insurance industry. We try to keep insurance talk to a minimum. But if you don’t want to hear about insurance, we recommend you skip to the next section. We won’t be offended. 🥲

A big part of working in insurance is deciding who to insure. We must consider many factors. No doubt one of the most important is whether an entity – be it a person, vehicle, company or whatever – has been sanctioned by a particular government.

Each government keeps an updated sanctions list, which is made freely available online. For instance, you can find the UK Sanctions List.

These lists come in different formats and are usually available as XML documents, so they can be easily read by machines. The documents can range from a few hundred kilobytes to tens of megabytes. They can contain a few thousand up to tens of thousands of entries (with 20k being the largest).

Since the lists can be updated frequently, and are not always backward compatible, we use a cron job that runs daily and should download, parse and write all entries to the database.

We managed to nail the downloading and writing ~40k entries to the database part. Unfortunately, we had huge memory spikes during the actual parsing of the XML files, which went as high as 9GiB and averaged 1GiB per run. This meant that we had to either request a lot of memory to Kubernetes just to be safe or risk getting Out Of Memory killed, which started happening too frequently. The situation wasn’t sustainable.

Here’s a graph of our memory usage from the latter half of February to the first of April. We can see memory spikes during February, March and all through April, although it somewhat lessened in April.

The same can be seen for CPU usage.

A brief detour into XML parsing

To understand why this was happening, let’s take a detour into the wonderful world of XML parsing.

There are at least two ways to parse an XML file: load it all up in memory or read it as a stream of data. These two modes are known as DOM (Document Object Model) and SAX (the Simple API for XML) respectively.

They differ in that DOM parsing costs more memory, since it has to read the whole file to create the DOM tree. On the other hand, SAX parsing reads the file and generates a stream of events, with each event corresponding to an XML tag that’s opened or closed. DOM parsing is a lot more ergonomic since you can run queries using XPath. SAX parsing is more scalable though.

In the Erlang ecosystem, the main library for XML parsing is xmerl. This offers both DOM and SAX modes. Many libraries are based on it and SweetXML, the library we used, is a thin Elixir wrapper around it.

SweetXML, as the name suggests, is a really sweet library. It offers a handy xpath function for interfacing with the DOM: given an XML document and an XPath query, represented by an ~x sigil expression, it runs the given query against the document and returns its result.

Here’s a quick example (more can be found in its documentation):

import SweetXml
doc = "<h1>Some title</h1>"
xpath(doc, ~x"/h1/text()") # 'Some title'Code language: Elixir (elixir)

Unfortunately, SweetXML also inherits the faults of xmerl (namely, a high memory footprint, especially when not using the streaming API). SweetXML offers options to reduce memory usage, through element streaming and a ‘discard’ option (which aims to avoid the construction of the whole tree). But these didn’t fully solve our problem since SweetXML is a DOM parser. This led us to search for solutions elsewhere.

We had various ideas on how to fix this problem. There are external services that offer solutions to query these sanctions lists. But that didn’t make sense for our use case. We also thought about rewriting the import service in Rust, the other main language in Prima, to get that juicy blazingly™️️ fast speed. In the end, since we had a lot of machinery written in Elixir already, we decided to try out a proper SAX parser.

There are quite a few SAX parsers for Erlang/Elixir, like erlsomfast_xml and yaccety_sax, all of which would probably speed up parsing. Unfortunately, none of those were appealing since (a) the code you end up with is often quite unreadable and (b) their interface isn’t that user-friendly. Among these though, we found Saxy, which proved to be both ergonomic and fast.

From DOM to SAX

The translation from a DOM implementation using XPath to a SAX one is not straightforward. This is because, with XPath, you can just query a document. With SAX, you have to build your internal representation from sequential events.

Let’s suppose we have a simple XML as input:

<?xml version="1.0" encoding="UTF-8"?>
<teams>
  <team>
      <id>1</id>
      <name>Team One</name>
  </team>
  <team>
      <id>2</id>
      <name>Team Two</name>
  </team>
</teams>Code language: HTML, XML (xml)

Here are a few queries we might perform on this document using SweetXml:

doc |> xpath(~x"//teams/team/name/text()") # 'Team One'
doc |> xpath(~x"//teams/team/id/text()"l) # ['1', '2']Code language: Elixir (elixir)

As you can see, queries are quite straightforward.

The Saxy implementation will be widely different from using XPath. Instead of issuing multiple queries on the document to fetch the data we need, we have to build the output tag by tag. All the while, we need to keep track of which tag we are reading, which we have already read and so on.

While the code might be less appealing than its DOM version, Saxy offers a very simple behaviour. We can implement this to create a self-contained parser. What’s more, readability can greatly improve if the state is handled by a separate module, hiding all its internal details.

Here’s what a Saxy implementation might look like:

defmodule SaxyParser do
  @behaviour Saxy.Handler

  @impl true
  def handle_event(:start_document, _, state) do
    {:ok, state}
  end

  def handle_event(:end_document, _, state) do
    {:ok, state}
  end

  def handle_event(:start_element, {name, _}, {elements, teams}) do
    teams =
      case name do
        "team" ->
          [%{} | teams]

        _ ->
          teams
      end

    {:ok, {[name | elements], teams}}
  end

  def handle_event(:end_element, _, {[_ | elements], teams}) do
    {:ok, {elements, teams}}
  end

  def handle_event(:characters, chars, {[element | elements], teams}) do
    teams =
      cond do
        element in ["name", "id"] ->
          teams
          |> List.first(%{})
          |> Map.put(element, chars)
          |> then(&List.replace_at(teams, 0, &1))

        true ->
          teams
      end

    {:ok, {[element | elements], teams}}
  end
end

# Which can then be used like so
iex> stream = File.stream!("teams.xml")
iex> Saxy.parse_stream(stream, SaxyParser, {[], []})Code language: Elixir (elixir)

Let’s break it down!

Entry point

The entry point can be considered Saxy.parse_stream (and its alternative, Saxy.parse_string/4). These functions take four inputs:

  • a stream/string
  • a module that implements the Saxy.Handler behaviour (i.e. the parser)
  • an initial state for the parser, which is completely arbitrary (it’s up to you to decide what your initial state is)
  • some options

In our example, the initial state is a couple of empty lists. The first one is the list of parsed XML elements. The second one is the list of parsed teams.

Saxy.Handler

The parser itself will have to implement the Saxy.Handler behaviour, which simply means writing a handled_event for various events:

  • :start_document and :end_document (pretty self-explanatory)
  • :start_element and :end_element are sent when an XML element starts and ends
  • :characters is sent when binary data inside an XML element is encountered

Each call to handle_event should return a tuple whose first element is one of :ok:stop or :halt and some value. :stop and :halt let you stop your computation and return some result, which can be useful if you don’t want to parse the whole document. :ok lets you advance your parsing by returning a new state, which will be given as input to the next handle_event.

The idea then is to gradually parse the XML and build your output through each event. In our experience, the meat of the parsing happens on :start_element:end_element and :characters, as can be seen in the example above.

To aid in parsing, handle_event also receives some event data, which usually includes info such as the elements’s name or the content in case of the :characters event.

Parsing a team

def handle_event(:start_element, {name, _}, {elements, teams}) do
  teams =
    case name do
      "team" ->
        [%{} | teams]

      _ ->
        teams
    end

  {:ok, {[name | elements], teams}}
end

def handle_event(:end_element, _, {[_ | elements], teams}) do
  {:ok, {elements, teams}}
endCode language: Elixir (elixir)

When a new team is encountered (i.e. when the team element is parsed), we simply add a new team, as an empty map, to our list of teams. The new state will then contain the updated list of elements encountered and the updated list of teams.

Parsing team names and ids

def handle_event(:characters, chars, {[element | elements], teams}) do
  teams =
    cond do
      element in ["name", "id"] ->
        teams
        |> List.first(%{})
        |> Map.put(element, chars)
        |> then(&List.replace_at(teams, 0, &1))

      true ->
        teams
    end

  {:ok, {[element | elements], teams}}
endCode language: Elixir (elixir)

When we encounter a :characters event, that means we are parsing a leaf, which can either be a name or an id. Based on the last opened tag (element in this snippet), we can then set the correct key in the team map we are building.

As you can see, the code can become quite cumbersome to both read and maintain. To improve this, it’s highly recommended to create a separate module to handle the state. For instance, in this case, we could have a Teams module, which exposes functions like add_new_teamset_team_name and so on.

Results

As mentioned, Saxy is pretty fast and much lighter on the memory. Here is a snapshot of CPU and memory usage around the release of the updated cron job, which happened on the 20th of May:

After releasing the updated cron job, we can see a sharp decline in memory usage and, surprisingly, a decrease in CPU usage too. To get a better idea of this refactoring’s impact, here is the CPU and memory usage during a longer time frame (February to July):

While CPU usage varies quite a bit, we can safely say that it has seen a 20% to 30% decrease. Memory usage on the other hand decreased by about 90% and is more or less stable at 100MiB.

Since we set out to avoid getting OOM Killed and reduce memory usage as much as possible, these results indicate that we can consider our refactor successful.

Finally, we also noticed one added benefit that we didn’t set out to achieve and didn’t expect: the cron job runtime was halved. Before this refactor, the cron job took around 2 minutes to complete. Now it takes about 1 minute. While performance was not our primary concern, it was a nice surprise indeed.

References