Skip to content

MessageSemantics

haberman edited this page Dec 25, 2011 · 6 revisions

The ownership and default semantics of protobuf messages have some subtle corner cases. The two key considerations to reconcile are:

  1. we want to be able to read deeply nested fields (eg. foo.bar.baz) without having to first test for message presence at every level (eg. if (foo.has_bar() && foo.bar.has_baz())).
  2. when serializing a message, we don't want to serialize empty submessages just because we read a default value out of that submessage.

Scalar fields

The semantics for scalar fields (numbers, bools, strings) are simple: if you just read a field's default value but never set it, the value is considered unset and will not be serialized.

  // C++ example:
  MyMessage msg;
  int32_t x = msg.myfield();  // Returns default.
  msg.has_myfield();          // Returns false; will not be serialized.

  msg.set_myfield(5);
  msg.has_myfield();          // Returns true; will be serialized.

  msg.clear_myfield();
  msg.has_myfield();          // Return false; will not be serialized.

The semantics for a dynamic language like Python are almost identical:

  # Python example:
  msg = MyMessage()
  x = msg.myfield
  msg.HasField("myfield")   # Returns false; will not be serialized.

  msg.myfield = 5
  msg.HasField("myfield")   # Returns true; will be serialized.

  msg.ClearField("myfield")
  msg.HasField("myfield")   # Returns false; will not be serialized.

Submessage fields

Submessage fields are more complicated because we want to be able to inspect deep messages without causing any implicitly-created submessages to be serialized. There is also the issue of submessage ownership; languages without garbage collection like C++ often create an ownership model where submessages are owned by the parent message:

  // C++ example:
  MyMessage msg;
  msg.bar().baz();  // Returns default value; msg.bar() is const.
  msg.has_bar();    // Returns false; msg.bar will not be serialized.

  msg.mutable_bar()->set_baz(5);
  msg.has_bar()     // Returns true; msg.bar will be serialized.

  // C++ has direct ownership of submessages, so you can't assign
  // submessage instances.
  msg.set_bar(MyBarMessage());  // XXX does not exist

This ownership model doesn't fit dynamic languages so well. The mutable_ business in C++ isn't a good match for dynamic language conventions where "const" containers are generally not used.

  x = foo.bar.baz
  foo.HasField("bar")   # Returns false; we only inspected it, so it won't be serialized.

  # Python users expect to be able to say this:
  foo.bar.baz = 5
  foo.HasField("bar")   # Returns true because we set a field of the submessage.

  # It would be non-idiomatic and annoying if the design was like C++.
  # This is *not* how the Python bindings actually work.
  foo.bar.baz = 5   # Returns ERROR (hypothetically), foo.bar is immutable.
  foo.mutable_bar.baz = 5

One other thing that dynamic language users expect is that they can "reparent" messages at will.

  bar = Bar()
  msg = MyMessage()
  msg.bar = bar   # Should we allow this?

Should we allow this kind of reparenting or not? There are pros and cons. The pros are convenience and efficiency, as well as composability:

  # If I'm composing a message reparenting lets me compose the sub-parts in a more
  # functional style.
  msg.bar = MakeBar();

  # If I can't reparent, the above looks more like:
  FillInBar(msg.bar)

  # If I've obtained a Bar from some other data source, I can make it part of
  # another message without having to copy.
  msg.bar = ParseBar()

On the other hand, allowing reparenting opens some cans of worms:

  # If I can reparent, I can create cycles, which must be detected as an error
  # at serialization time (which would have a potentially significant cost).
  # It could be useful to create such cycles in some cases, but since they
  # aren't serializable it might be better to disallow them.
  msg.msg = msg

  x = msg.foo.bar  # Read only, won't serialize msg.foo.
  foo = msg.foo
  foo.bar = 5      # Write of foo, now msg.foo will be serialized, is this unexpected?

  x = msg2.foo.bar  # Read only, won't serialize msg2.foo.
  msg3.foo = msg2.foo  # Should msg3.foo be serialized, since it was explicitly assigned?

Another issue: if the implicitly-created submessage has a field set but is later cleared, should the submessage be serialized?

  msg = MyMessage()
  msg.foo.bar = 5            # The write will cause foo to be serialized.
  msg.foo.ClearField("bar")  # Now should foo be serialized?
  msg.foo.Clear()            # How about now?