-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: secondary indexes #918
base: main
Are you sure you want to change the base?
Conversation
0ca9916
to
6d905c7
Compare
cc. @jhurliman |
@@ -179,6 +183,30 @@ The message encoding and schema must match that of the Channel record correspond | |||
| 8 | publish_time | Timestamp | Time at which the message was published. If not available, must be set to the log time. | | |||
| N | data | Bytes | Message data, to be decoded according to the schema of the channel. | | |||
|
|||
### Secondary Index Key (op=0x10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should come up with a more future-proof word than "secondary", in case we decide in a V2 to merge the primary indexes into this same record type.
> before reading into the Data section. This means that the Secondary Index Key in the Data section | ||
> is not normally used. However, if a MCAP is truncated and the summary section is lost, having the | ||
> Secondary Index Key appear before any Secondary Message Index records allows the MCAP to be fully | ||
> recovered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to have this record type, vs the parser just inferring from the existence of the other two record types that the index is in use? I can see dropping this would require moving the "name" field into the secondary chunk index record but that doesn't seem like the biggest thing we stick in those records anyway.
Today the parser knows a file is indexed via the presence of the index records - we don't need a third record type for that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I should follow up with a concrete suggestion - what about
SecondaryMessageIndex
name
channel_id
records
SecondaryChunkIndex
name
chunk_start_offset
first_key
last_key
message_index_offsets
metadata (??? - will elaborate in another comment)
then in the summary offset section, we'd have a new group pointing at "SecondaryChunkIndex". The "name" key is stored in both locations to allow a partially-written file to still have index data recovered, which is a purpose your third record type also supplies. The cost is the duplication of the "name" field in the SecondaryMessageIndex records. It would be good to get some data on how much this costs us - my assumption is that the effect would generally be swamped by the size of "records", this it doesn't justify the extra record type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forget the "metadata" part - I think it would be better implemented with a "chunk info" record as described in another comment.
| ----- | ------------------ | --------------------------------- | -------------------------------------------------------------------------------------------------------------- | | ||
| 2 | channel_id | uint16 | Channel ID. | | ||
| 2 | secondary_index_id | uint16 | Secondary Index ID. | | ||
| 4 + N | records | `Array<Tuple<Timestamp, uint64>>` | Array of timestamp and offset for each record. Offset is relative to the start of the uncompressed chunk data. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be key, offset I think
I think rather than conceptualizing this as "secondary indexes", it would be better to think about it as "custom index support" with the intention of it eventually absorbing the existing chunk/message index scheme as well. I think once we have this "secondary index scheme", what we'd probably want to do is split a "chunk info" record out of the existing "chunk index" record, to contain stuff like compression stats and chunk start/end that aren't really relevant to message indexes. Then your "secondary chunk index" can I think be used as the only kind of chunk index. If we want to we could also add to "chunk info" some "metadata" or "typed statistics" kind of field to allow custom index authors to display custom chunk statistics from the info command. |
[Summary Offset 0x01] | ||
[Summary Offset 0x05] | ||
[Summary Offset 0x07] | ||
[Summary Offset 0x08] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these summary offsets seem wrong, 0x01 is the header and that's not in the summary section.
I think they should be 0x03, 0x04, 0x10, 0x08, 0x12, 0x0A, 0x0B
I think we'll also want to slot the integrations into whatever existing info & summarization subcommands would touch it (probably at least "mcap info" and "mcap list chunks", "mcap cat"). That would provide a good demonstration mechanism.
For this, would you be including two secondary indexes under the proposed scheme? You would get sort by start/end for free but additional work/first-class treatment would probably be required if supporting search by overlap (which seems more niche for this case but I'm not sure what else people may need).
I think even constraining to a uint64 rather than a "Timestamp" would probably be a win. If the user records a massively disordered file based on their desired ordering, it seems ok to me to let tooling fall over or handle as desired, rather than prohibiting via spec. That's the same approach we have taken with the existing timestamps. That would knock out the sequence ID feature as well. Operations like "get 50 messages ordered on $secondary starting at t" would still be quicker than if the file had no index.
+1
+1. I think the "record before chunk" idea is interesting but easily splittable from this. |
|
||
## Secondary index keys | ||
|
||
The Secondary Index Key `name` field may contain the following options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "may" indicate I can put my own stuff in there and expect some tooling support? I think this would be good to shoot for. Rather than having studio or whatever hard code "header.stamp", "publish_time", etc, would it be viable to dynamically show a list of sort options based on the file's index section?
And likewise with the info command, CLI, reader support etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i feel like you should expect some tooling support for any key, but tooling can make extra assumptions about well-known keys.
| 2 | secondary_index_id | uint16 | Secondary Index ID. | | ||
| 8 | chunk_start_offset | uint64 | Offset to the chunk record from the start of the file. | | ||
| 8 | earliest_key | Timestamp | Earliest key in the chunk. Zero if the chunk contains no messages with this key. | | ||
| 8 | latest_key | Timestamp | Latest key in the chunk. Zero if the chunk contains no messages with this key. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a zero timestamp as a sentinal is something we do elsewhere but not strictly correct, since 0 is a valid timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about saying chunk indexes should be omitted when they would apply to no messages?
another thing to think about (I don't remember if it's part of this patch) is indexes that are chunk-level only. Since remote readers need to download the full chunk to make use of an index on it, and can probably decompress/scan the chunk faster than they can download it (not guaranteed but common) the benefits of the message level index can be pretty marginal, however the chunk-level index at the end of the file may still be very useful. Similar to postgres BRIN indexes https://www.postgresql.org/docs/15/brin-intro.html |
Adds new record types which facilitate fast message lookup by timestamps other than
log_time
. These closely mirror the existinglog_time
index structure, but can be based on any user-defined timestamp value.This PR is currently only for review. Before merging, we would:
mcap
CLIBackground
Roboticists need detailed time information in order to understand exactly what happened when for their robot. For a given frame or message, one or more of these may be recorded:
Since most robotics frameworks are designed to allow for distributed compute and recording, we assume that some messages are out-of-order for every timestamp ordering within an MCAP file.
Justification
Right now the MCAP format includes an optional index on
log_time
. However, a roboticist may want to use any of these other timestamp categories to view messages within a given time window, or in a particular time order. Some example use-cases are:Without index information in the file, the above use-cases can be served either by performing a windowed sort (which is how Studio currently supports ordering by header timestamp) or by reading the entire file. A windowed sort has a chance to produce incorrect results, if an out-of-order message appears outside the window.
Paths not taken
There were a few options I considered and rejected during development, listed here for discussion.
log_time
assumes that chunk key ranges are almost-disjoint. We do a top-level search through chunk indexes to find the few chunks that overlap with our time range, then a bottom-level search through message indices to find specific messages. If the indexed keys had a random distribution across the file, all chunk ranges would overlap completely, and the top-level search would lose effectiveness. Message timestamps of all kinds tend to be roughly in increasing order across a file, which provides the needed almost-disjoint chunk ranges. I may still add advice to the specification regarding index effectiveness for random timestamps.