An overview of serialization formats. Numbers and anecdotes.

There are lots of format specifications to serialize your data against. These days, I have been looking for potential alternatives to YAML, which has been my go-to tool for a while, basically because since Rails decided to use YAML from its very beginnings, Ruby developers started to follow the leader and it’s pretty widely used. Funnily enough, YAML was born as a markup language, and developed into a data-oriented format. The main reason why I’m writing this blog post is so that I when I have to choose what serialization format to use, I can analyze the problem and see which format fits my problem best.

If you are not familiarized with serialization formats, just for the sake of making the article a little more engaging, here are some potential uses of serialized data:

  • Caches
  • Inter process communication.
  • Dummy objects for testing
  • Message brokers

Keeping this points in mind, let’s go on and analyze a few of the most promising serialization formats these days. I’d argue these list contains, (not limited to these) HDF5, BSON , MessagePack, YAML and Protocol Buffers. I would love to write about Thrift and Avro but I have no experience nor I know them very well, I might update the post with information about them in the future. Shoot me an email if you want to do it yourself! I don’t want to get into things like separating statically and dynamically typed formats, mainly because these formats are different enough to provide more significant differences in performance, space, and other metrics than just something that is (sort of) a preference. Now, the list:

HDF5

HDF5 is a hierarchical data format born more than 20 years ago in the NCSA (National Center for Supercomputing Applications). Born and bred in scientific environments, it’s not surprising that its user base are largely scientific laboratories and research institutes. Root, CERN’s data analysis framework, uses a variation of HDF5 for its own files, and it’s largely compatible with it. Not a human readable format, to me it’s more interesting feature is how it looks like it was designed for parallelizing IO operations. It does this by separating the data into chunks of related information, in a sort of nxn table. Then, any HDF5 reader can easily pick a chunk (a rectangle or a set of points) of this virtual table and start processing it. Another worker can be doing the same in other part of the table and so on. Neat, but datasets are large and they cannot be easily compressed because of this.

BSON

BSON is an attempt to binary serialize JSON documents. It shares some parts of the JSON spec but for instance it has added embeddable features like data types that are not part of JSON (Date, BinData). Funnily enough, BSON files is not always smaller than their JSON equivalents, but there is a good reason for this, traversability. The way BSON introduces overhead to improve access times is minimal and actually pretty easy to explain.

Take a JSON document like this:

Serializing this to BSON it will leave these numbers before the strings (it doesn’t do only this, but I want to focus on the overhead):

These markers are a very simplistic way of telling the BSON parser “hey, if you want to skip this segment, just advance your pointer X bytes to find something else”.
Therefore the efficiency of BSON lies on having a smart parser that can understand the code properly. As an example of the parser can optimize, think of the number 1 in javascript. Of course this needs to be stored as a number in BSON, but for certain numbers, the ASCII representation is best and you don’t need to waste 8 bytes per number. Your parser can figure things like this out, and probably people at 10gen and other places where mongodb is used, constantly find ways to improve the parser for a certain language.

MessagePack

MessagePack is not too different from BSON in the sense that both try to serialize JSON. Unlike BSON, MessagePack tries to keep a one to one correspondence between its spec and JSON’s spec, so there is no loss of data on the conversion and the format is more transparent. This affects space heavily, so for instance a trivial document like “{“a”:1, “b”:2}” is 7 bytes in MessagePack (19 in BSON). Such a simple thing as having extra metadata on the binary object (BSON) can help in some situations where the data is constantly changing, as it lets you change the values in place. MessagePack has to reserialize the whole object if a change needs to be made. This is just my personal opinion, but these differences are probably what makes MessagePack very suitable for networking and BSON is instead a format more suited to storage scenarios, like Mongodb.

YAML

YAML appeared as a human-readable alternative to XML. It is barely usable as a language for object serialization, but it’s worth mentioning why, as these same reason left out other possible candidates.
In the event of a network failure, YAML files might be transmitted but there is no way to tell whether whatever arrived to the other peer is correct or not. Most serialization formats simply break if you slice the file.
There is still no support for YAML schemas so two peers can exchange YAML files agreeing to a data exchange format. That renders it unuseful for RPC, message brokers and the like.

Protocol Buffers

A very smart way of define messages (but this works for any structured data) designed by Google. Instead of having a human readable format like some of the formats I mentioned previously, Protocol Buffers has source code files “.proto”s that have to be compiled into a binary object. It is mainly geared towards C++ programming but there are implementations in many languages. From my experience with Clojure’s library and other LISPs libraries are pretty much abandoned, while Ruby’s implementation is actively developed. I wouldn’t recommend anything but the official ones (Java, C, C++ and Python).

An example of a .proto file:

The way Protocol Buffers encodes the data has two aims. Consumers should be able to read the data produced by newer producers, by simply skipping unexpected fields. Consumers have to be able to find the end of a field without needing any metadata about the field. The whole encoding revolves about fixing this problem (varints, ZigZag coding). It’s called the binary wire format and in short uses varints to encode the data, varints are simply integers grouped in 7 bit sets, where the high bit (MSB) is the stop bit. The way negative values are handled are by ZigZag coding the 7 bit groups with the following function.

This basically renders -1 as 1, 1 as 2, -2 as 3, 2 as 4 and so forth.

Strings are UTF8 encoded. Advantages of Protocol Buffers for RPC (as opposed to XML, which apparently was Google’s itch for PB) are way faster  encoding, smaller files, and more easy to use programatically (aka ‘we got rid of the infamous XMLController’).

Protocol Buffers encoding guide by Google

 

—————–

Other resources I found useful:

Binary Serialization Tourguide

Browser Physics – BSON, binary JSON, now for the web