Skip to content

Life of a Segment

Paul Masurel edited this page Jun 22, 2018 · 3 revisions

Specifications

While the life of a segment will be unchanged, some of the finer detail of the segment state are implemented implicitely in tantivy 0.6. In that sense the document describes the refactoring of #325.

Segment? Merge? Garbage collection?

A tantivy index consists in a set of smaller immutable independent index called segment. Adding new documents and committing does not modify the existing segments, but adds fresh segments to the segment list.

A merging mechanic can then ensure that we do not have an explosion of the number of segments : Tantivy can decide to initiate a merge of N small segments into a larger one. When the merge successfully terminates, the files of the N smaller segments are useless. These file are guaranteed to be eventually deleted by a file garbage collector.

Commit

Tantivy does not work like a regular database in the sense that documents are not available right after being added. It does not really enforce a notion of transaction either.

Instead, the user is in charge of batching its add and delete operation in batches and explicitely .commit() them. Once a commit is successful, tantivy guarantees that all the operations previous to the .commit() are reflected in search, and persisted.

In case of a hardware or software failure, upon restart, the index is in the state of the last .commit().[^1]

Deletes

Deletes is a bit of an exception to the immutability of segments. Deleting documents in an existing segment works by creating a tombstone file that stores a bitset of the DocIds that have been deleted. None of the previous segment files are modified. The previous delete tombstone file is not modified either. Instead, a segment can have more than one tombstone associated. Each of them is associated to a specific commit opstamp.

Implementation details

File creation

tantivy needs to keep track of the file that it creates to be able to remove them on garbage collection. For that, it relies on a wrapper of a Directory and keeps track of the list of the created files BEFORE creating them.

The .managed_files.json contains the files that, -if they exists[^2]- and -if they are identified as needed by tantivy- should be deleted upon garbage collection.

Segment lifetime.

A living segment can be in the following state.

  • Construction
  • Uncommitted
  • UncommittedInMerge
  • Committed
  • CommittedInMerge

Footnotes

[^1] Synchronisation with an external process (for instance when indexing logs) can be made by adding a payload to the .commit() or conversely by bookeeping the notion of opstamp.

[^2] They are not guaranteed to exists. For instance, a failure right after an update of the managed.json file can leave the index in state where a managed file has been deleted or has never been created. This is not a problem. We want to guard ourselves from the reciprocal : An file that has been created by tantivy and still exists on the filesystem should always be listed in the managed.json file.