You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking for any pointers/advice/best practices for my use case:
Large annoy tree (100GB +), high frequency lookups, best or near best accuracy required (wherever diminishing returns start to show i guess, which right now seems to be around search_k= 30_000 for 10M items each with 3500 components)
Essentially I need to non-stop sequentially lookup every item in the tree as fast as possible. At my desired search_k value, the performance hit is starting to hurt.
Side question: If i were to build another annoy index with as many build-trees as i can fit in memory/disk, would this significantly reduce my search_k requirements to get similar results? edit: answer: possibly yes; at least a bit
NOTE: currently, multi-processing across 2/3'rds of my cores appears to be fastest, which i suspect is due to I/O waiting times...
Tertiary question: Is a shared memory approach with one tree in memory and many processes accessing it achievable or useful?
Quaternary questions: is there a fastest metric? edit: answer: yes. In my case hamming turned out to be fastest, and counterintuitively, the most accurate by a keyword-based metric. (although if I normalize my continuous vector before hand (which should break hamming), the build time itself seems to increase drastically.
The text was updated successfully, but these errors were encountered:
Looking for any pointers/advice/best practices for my use case:
Large annoy tree (100GB +), high frequency lookups, best or near best accuracy required (wherever diminishing returns start to show i guess, which right now seems to be around search_k= 30_000 for 10M items each with 3500 components)
Essentially I need to non-stop sequentially lookup every item in the tree as fast as possible. At my desired search_k value, the performance hit is starting to hurt.
Side question: If i were to build another annoy index with as many build-trees as i can fit in memory/disk, would this significantly reduce my search_k requirements to get similar results? edit: answer: possibly yes; at least a bit
NOTE: currently, multi-processing across 2/3'rds of my cores appears to be fastest, which i suspect is due to I/O waiting times...
Tertiary question: Is a shared memory approach with one tree in memory and many processes accessing it achievable or useful?
Quaternary questions: is there a fastest metric? edit: answer: yes. In my case hamming turned out to be fastest, and counterintuitively, the most accurate by a keyword-based metric. (although if I normalize my continuous vector before hand (which should break hamming), the build time itself seems to increase drastically.
The text was updated successfully, but these errors were encountered: