-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not use UUIDs? #54
Comments
Hello! Theoretically, the chance of getting duplicate document IDs is only slightly higher than getting duplicate UUIDs using common algorithms. What do you think? |
I have no idea about the chance of getting duplicate document IDs. I did not calculate it. So I think it is an issue of interoperation and compatibility. Uuids should allow to synchronize, migrate and combine data between databases that support uuids without have to fear duplicated ids. Think of "foreign keys" between different databases in different DBMS. |
Thanks. My primary argument against using UUID as the compulsory (primary) document ID is that UUID generation and storage are a lot more expensive compare to long integer IDs. Thinking about compatibility - to simplify management of UUIDs in documents, how about having these interfaces:
|
The proposed interfaces would be fine for me. |
all right, shall do! |
these APIs will make their way into version 3.1 =) |
Great, thanks! |
I know I am a bit late to the discussion, but I just wanted to offer a word of warning about UUID's. If a client written in Java or .NET expects to parse them and create native UUID/GUID objects, they may have trouble if the way you serialize them to a string does not match their platform's native scheme. While UUID is a "standard", there are several types and some types have additional sub-types. CouchDB's HTTP interface offers an option for clients to retrieve a batch list of server-generated UUID's in a JSON document to use for inserts. You may want to consider adding that functionality to tiedot's HTTP interface. (None of these concerns really apply to the embedded interface.) MongoDB stores user-supplied UUID's as a particular type of byte array. It is a real mess when choosing how to display it in a web client. It can be displayed as a JSON sub-object with properties for type and a base64 representation of the UUID data, or an application (like Robomongo) can convert the value into any of three UUID string formats (Java, .NET, and Python). Tiedot currently uses a 64-bit integer for ID's. (I believe that is the only type currently allowed.) Some questions: Side note: Instead of UUID Id's like: 6ec80f37-fe6a-429a-99d8-d0f4500ea7b6 or 6ec80f37fe6a429a99d8d0f4500ea7b6 This is not really standard either, but the encoding is more dense than base-16 or base-10 numbers as text and easier to read and re-type (depending on the font - 1 vs. I vs. l (L)). Anyway, didn't mean to go this long - just wanted to give a heads-up to avoid some possible pitfalls with UUID's. |
Oops, I forgot to mention that, in order to get several important bug fixes out ASAP, I marked 3.1 release without the UUID feature =) That's definitely very helpful shmakes. Document ID generation was a minor headache for quite some time. The original thought resulting in the usage of 64-bit ID was according to my to my rough calculation, that the chance of an ID collision is small enough to be ignored given the current capability of tiedot - being able to handle around 10 million documents. Well, I certainly look forward to the day when tiedot handles 1bn+ docs, therefore it's definitely worth considering alternative doc ID formats... |
I would like to add that including a device and process identifier, similar to MongoDB, will make managing multiple instances[1] of tiedot that painless and safer. [1] whatever in a horizontal scaling, client-master or any other setup that involves aggregation at some stage. |
First of all I'd like to say +1 for UUIDs (v4 preferably). That been said; I would like to ask if this could be modified/hacked a bit to allow inserting, reading, updating, removing using custom String IDs. I have some collections where the objects need to retain their original ID that come from an external system/api. The IDs are strings (SHA1 hex strings if anyone is keeping score) and must to be managed via this. Is there any change to be able to supply our own String IDs at some point or any tips for adding this functionality without completely forking tiedot? :/
Thanks :) |
@geoah Thanks for your feedback!
|
@HouzuoGuo Thank you for the clarification and help on this. If I may hijack the thread for just one more second, I came up with the following simple implementation. Thank you very much in advance! |
@geoah
After all of them are taken care of, the custom-ID operations will be truly atomic. Oh and one more suggestion - for even better performance, consider interacting with custom-ID hash table directly without calling query processor. |
Since the internal I'm trying to wrap my head around this but since the wrapper methods call the internal ones, when Should we move this this discussion somewhere else so we don't pollute this thread? |
I wouldn't mind having the discussion here.
Since the three operations do not all happen at once, consider the possibility of two Nodes having identical custom ID, being stored at the same time:
In this case, the custom ID is indexed with two documents. The index mechanism will return both documents upon query, and if the document is to be deleted later on, it will have to be deleted twice - unless your Delete implementation works around it ;) |
A very (very very) quick and dirty solution should solve most of these issues. I know that I'm shooting performance in the head (twice) as everything is being locked instead of just locking a single data partition as you are doing in tiedot. It doesn't seem possible to ps. Out of curiosity... If I understand your implementation correctly one of the reasons for the high performance penalty if using strings IDs would be the data partitioning and the ability to only lock a single partition at a time. Couldn't this be solved by using a rainbow style partitioning? Or am I totally wrong about this being a main reason of the performance penalty? Thanks once more. |
pps. It's kinda late so I might not be making much sense... by "rainbow tables" I actually was referring to the way rainbow tables are sometimes (I think?) stored. The idea would be to have multiple partitions again but this time split it using leading characters instead of |
Thanks for the complement. Your locking approach is definitely on the right track. And the hash table implementation certainly does not care whether the input key is a hashed result or not, it can store arbitrary integer -> integer associations. Therefore, the currently used integer ID is as efficient as one Get operation on the hash table. Also knowing that the ID is supposed to be unique, the result of hash table's Get operation (which is document physical location) can be used to fetch document without having to verify against hash collision of document ID. By using a string ID, it will have to get hashed before being put into hash table. Next, to avoid hash collision, every GetDocByCustomID will have to JSON deserialise the document content and verify against hash collision. The hash table will also become less efficient as the ID lookup must return all results (if there are multiple) against an ID hash, because of potential collisions. |
I'd like to throw in my suggestion in case if for some reason tiedot moves away from integer ids to a string. We could use something with some more functionality in it. Being able to parse certain info right out of the ID could be useful IMO (such as created date). See ObjectId http://godoc.org/labix.org/v2/mgo/bson for how this can be accomplished. |
Is there any reason apart from space requirements that
the generated ids are no UUIDs. That would make it easier to
synchronized different tiedot databases.
The text was updated successfully, but these errors were encountered: