Intro: Protobuf

Or Protocol Buffers, an open source solution by Google to manage the reading and writing of structured data.

Notes from the documentation.

Overview

Protocol Buffers are the flexible, efficient, and automated solution to figuring out how to serialize and retrieve structured data. You write a .proto description of the data structure you wish to store.  The protocol buffer compiler creates a class in your chosen programming language that implements the automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the serializing/deserializing of the protocol buffer as a unit. The protocol buffer format supports the idea of extending the format over time such that the code can still read data encoded with the old format.

Benefits

  • supports multiple languages
  • supports multiple platforms
  • highly efficient encoding, using C-style enums
  • compact data storage
  • fast parsing
  • easily grokkable syntax
  • supports safe evolution of API contracts (backwards- and forwards-compatible)
  • supports reflection. Can iterate over fields of a message and manipulate their values without writing your code against any specific message type.
  • Helpful for converting protocol messages to and from other encodings, e.g., XML or JSON.
  • Helpful for finding differences between messages of the same type
  • Helpful to develop a sort of regex for protocol messages to match on certain message contents

.proto files

.proto files have a syntax similar to C++ or Java. Each .proto file starts with a package declarion to prevent naming conflicts between different projects. There may be other specific options for your chosen language. For example, Java has java_multiple_files, java_package, and java_outer_classname options to specify how to name the outputted Java package, class name, and whether to generate a file per class (or to just have all classes in one file).

Then there are the message definition(s) which are aggregates containing a set of typed fields. These fields can be primitive types, enums, or other message definition types. Each typed field has a marker to identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so that's a good optimiziation to use when you can for repeated or reused tags. Tag numbers 16+ can be for less-commonly used optional elements.

Each field requires one of the following annotations:

optional

with some default value specified, or fallback to the system default:

  • Numbers: zero
  • Strings: empty string
  • Bools: false
  • Embedded messages: the default instance or prototype of the message (which has no fields set)

repeated

where the order of the repeated values will be preserved in the protocol buffer. These are like dynamically-sided arrays.

required

Without this field, a given message's builder will throw a RuntimeException, and attempting to parse the message throws an IOException. The message is considered uninitalized with missing required fields.

Be careful with required. For backwards compatibility, any modifications to the protocol must not remove required fields. These are forever!

Generating code in your programming language

After writing your .proto file, you can compile it into your chosen programming lnaguage using the protocol buffer compiler protoc. Here's an example command using Java as the output language:

protoc -I=$SRC_DIR --java_out=$DST_DIR $SRC_DIR/YOUR-PROTO-FILE.proto

The outputted .java files will use Java-standard camel-case naming and have standard builders JavaBeans-style getters and setters to define the messages. Your chosen programming language should also have its own standards followed, per the .proto style guide.

In addition to the messages, the protocol buffer classes should have methods for writing and reading (serializing and deserializing) the messages using the protocol buffer binary format.

Note: you should never add behavior to generated classes by inheriting from them. You may break internal mechanisms, and it's not good practice anyway. You should add behavior by wrapping the generated protocol buffer class in an application-specific class. This is good practice as it acts as an interface between an external specification and your unique environment of your application.

Extending your protocol buffer

To evolve or extend your protocol buffer, ensure it is backwards compatible:

  • do not change tag numbers of existing fields
  • do not add or delete required fields
  • you can delete optional or repeated fields
  • you can add new optioanl or repeated fields with fresh tag numbers. This means only using tag numbers that have never been used before, including those of deleted fields.
  • exceptions to the above