-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support maps and sets that preserve insertion order #11
Comments
For the Insertion-order Set, you can wrap a map. Here's an example of implementing ImSet in 46 lines of Java. Instead of wrapping another set like that example does, wrap a class InsertionOrderSet<E> implements ImSet<E> {
private final ImMap<E,Integer> inner = PersistentHashMap.empty();
private int index = 0;
InsertionOrderSet(ImMap<E,Integer> m, int i) { inner = m; index = i; }
@Override public ImSet<E> put(E e) {
int j = i + 1;
return new InsertionOrderSet(inner.put(e, j), j)
}
@Override public UnmodIterator<E> iterator() {
return inner.toImSortedSet((a, b) -> a.getValue() - b.getValue())
.map(entry -> entry.getKey())
.toImList()
.iterator()
} For the map, do something similar. Still wrapping a map, but the inner map should be of type If you don't need it immutable, there's always java.util.LinkedHashSet In my own work, if I want things sorted, I use a SortedMap or SortedSet. I seem to always want everything alphabetized or in another order that can be calculated after the fact. If I want to preserve whatever order things came in, I use a List. I'm not running off to implement this as part of Paguro (UncleJim) because:
I think a paper describing a great implementation of a highly efficient, immutable, insertion-order map would be the most likely thing to change my mind. Demonstrating a need could help convince me too. That's my first reaction, but I'll think about it. This Serialization thing took a long time to do. I'm also renaming the project, then I'd like to get back to the RRB-Tree, then get a Java 7 version out. Even if I felt strongly that this belonged in this project, it would probably start at priority 4 at the highest. I started this project 2-3 years ago with the idea of building a language on top of it, so I'm eager to move on to that at some point too. |
Just to be clear, the idea is to always use insertion-ordered maps instead of unsorted maps, in order to provide a more deterministic user experience. I believe this is why ES6 only has insertion-ordered maps (after all it comes at a cost). https://plus.google.com/+FrankieSardo/posts/1mHL85NgX4z provides some more rationale. Sorting on every iteration, or maintaining a backing sorted map, seems expensive. Some of the other Java persistent collection libraries already do (e.g. https://github.com/anjbur/java-immutable-collections/tree/master/src/main/java/org/javimmutable/collections/inorder) or plan to (e.g. usethesource/capsule#5) provide insertion-ordered maps. No worries if you don't consider this worth the effort. I just thought I'd voice my interest in this feature. |
Change Log: Here's the branch you can pull down and build to try out: IMPORTANT: It looks like javac 1.8.0_31 will NOT build this code, but 1.8.0.91 will. I'm still very ambivalent about whether to merge them or not. I agree they are useful and less surprising than the behavior of hash maps. But is that worth the performance hit? Worth a bigger jar file? I don't know. We need to actually measure the relative performance. I like to measure many sizes in multiples of 10, creation and iteration using JMH (Java Microbenchmarking Harness). But before that, we probably need Mutable versions for speed. For some operations it's fair to make comparisons against building PersistentHash(Map/Set)s and assume similar mutable ratios, but I need to think about that more. I'm thinking about splitting Paguro into Paguro-Core and Paguro-Extras. That way people who want an incredibly tiny jar, get an even smaller one, and those who want more features, get their features. It's a no-brainer to include this in Paguro-Extras. I'm not promising that today, but it's a thought for the future. Regarding your helpful links:NOTE: I redacted some of this in a later post PersistentHashSet is I think all immutable solutions use a PersistentTreeMap to track insertion order (WRONG). Everyone else's solution updates I like converting it at the last moment for a few reasons:
Footnotes:[1] Proof: if you're given elements in sorted order and you don't re-balance the binary tree, you'd end up with the internal structure of a linked list which is O(n) lookup instead of O(log2n). Each node of the ideal tree should have half the sub-nodes on either side of it. Therefore, ideal insertion order which yields the ideal tree without any re-balancing is the same as a breadth-first traversal of the completed balanced tree. One example of such ordering (for a tree of size n):
Random order would vary be between worst-case and ideal, but that's still better than guaranteed worst case! |
In Clojure, small HashMaps preserve insertion order (it uses an ArrayMap for up to 8 or 9 items): Iterating through 8 items in an array happens as quickly as any data structure on the JVM. Maybe that's all you really need? Or do you really need big sorted HashMaps? |
Small maps will be the common case, but I need to maintain order in any case. |
I re-read Frankie Sardo's post a few more times. I think the argument that the insertion order map has a deterministic iteration order and is therefore repeatable as it grows is a strong one. Though I will point out that for a given map, the insertion order does not change. It only changes when you add or remove items. His idea of storing a pointer to the key instead of the node is quite clever. I missed that detail the first few times through. So he's not creating a TreeSet. I still think calling it "Amortized Constant Time" is a stretch, but I am ready to agree that his solution leaves the Big O characteristics of the Clojure HashMap unchanged. That's probably better than my solution if you ever iterate through the HashMap. Preserving insertion order is a good idea. I'm glad you proposed it! Is it worth a 2.5x slow-down? That's where things get murky. My rule of thumb is that less-than-15% is a very reasonable price for simplicity, but 2x is too expensive. I'm not ready to replace Map and Set with insertion-order versions at this time. A more efficient algorithm would go a long way toward changing my mind. You are welcome to use the code I provided under the Eclipse 1.0 license. I mean, you can use it under the Apache 2.0 license with the understanding that it calls Eclipse 1.0 code (originally from Clojure), so unless you plan to rewrite that part, you'd be bound by the terms of both license agreements. You can even use it in proprietary apps, without giving anything back besides attribution. Did I mention 100% test coverage? My consideration of whether to include it permanently in this project should not be holding you up anymore. If you find bugs, I'll probably patch them the following weekend. I'm still leaning toward putting this in a Paguro-Extras package. Then I'd get a sense of how many people would choose to pay the performance price for it. That said, I started looking at RRB-Tree again this weekend. I think that's a higher priority for me for no other reason than wanting to finish what I started. Also, I'm really excited about it potentially replacing the PersistentVector. Not sure which is a higher priority for you... |
OK, I came up with a different implementation that uses java.util.TreeMap for speed inside the iterator. Iterators are inherently mutable and not thread safe, so I might as well use the fastest sorted collection. It also only uses the temporary treeset when the wrapped hashmap iterator doesn't return things in the order we want: In theory, this would only incur a portion of the O(log2 n) of java.util.TreeMap. I'm guessing if the first item could be anywhere in the wrapped hashmap, but it will average showing up after iterating about half-way through. The next few calls to next() will be equally likely to either grow or shrink the temporary treemap, so it probably remains at the size of half the un-iterated portion of the wrapped hashmap. But the temporary tree map shrinks linearly after that. So I think we only incur an average O(log2 (n/2)) at the beginning of the iteration, O(log2 (n/4)) half-way through the iteration, O(log2 (n/8)) three quarters of the way through, etc. I think it's still log2, but generally for small values of n. Iteration Benchmarks: Goes from roughly 3x slower than a PersistentHashMap at 10 items to 10x slower at 100K items. This is better than the first pass which went from 15x slower at 10 items to over 30x at 100K items. I need to time Sardo's code and look at this again when I've got more time. |
Thanks for all your work on this. I'll definitely give it a shot.
Given that I consider Paguro a very general-purpose collection library, I probably wouldn't make insertion-ordered sets/maps the default, but would offer them as an option. I think they warrant a (say) 15% increase in Jar size, but Paguro-Extras would also work for me.
Makes total sense. I'm happy to use either! |
RRB Tree is done, but converting to Kotlin and whatever I'm doing with Kategory has crept in at a higher priority. It still probably makes sense to add insertion-order maps at some point. I think about this request probably every week, but not ready to commit yet. I keep thinking that a linked or doubly-linked list is the right data structure to manage the ordering. Sardo's explanation link is now broken, so I can't go back and look at his theory. Just his implementation. If you have a copy of his theory, that could help. Anyway, I feel a little bad asking you to do any work when I'm not planning to implement this for some months. Since there are 2 data structures involved, I don't think it will affect the performance much to implement the slower one as a wrapper around the faster one. I've got to get through the Kotlin conversion first... I'll tentatively put this in for version 4.2. |
Would be great to have maps and sets that preserve insertion order, a la https://github.com/frankiesardo/linked or https://github.com/amalloy/ordered.
My main motivation for maintaining insertion order is to have/provide a more deterministic user experience (cf. ES6 maps).
The text was updated successfully, but these errors were encountered: