Making it go fast for high volume queries #668

FlimFlamm · 2024-08-19T00:22:34Z

Looking for any pointers/advice/best practices for my use case:

Large annoy tree (100GB +), high frequency lookups, best or near best accuracy required (wherever diminishing returns start to show i guess, which right now seems to be around search_k= 30_000 for 10M items each with 3500 components)

Essentially I need to non-stop sequentially lookup every item in the tree as fast as possible. At my desired search_k value, the performance hit is starting to hurt.

Side question: If i were to build another annoy index with as many build-trees as i can fit in memory/disk, would this significantly reduce my search_k requirements to get similar results? edit: answer: possibly yes; at least a bit

NOTE: currently, multi-processing across 2/3'rds of my cores appears to be fastest, which i suspect is due to I/O waiting times...

Tertiary question: Is a shared memory approach with one tree in memory and many processes accessing it achievable or useful?

Quaternary questions: is there a fastest metric? edit: answer: yes. In my case hamming turned out to be fastest, and counterintuitively, the most accurate by a keyword-based metric. (although if I normalize my continuous vector before hand (which should break hamming), the build time itself seems to increase drastically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making it go fast for high volume queries #668

Making it go fast for high volume queries #668

FlimFlamm commented Aug 19, 2024 •

edited

Loading

Making it go fast for high volume queries #668

Making it go fast for high volume queries #668

Comments

FlimFlamm commented Aug 19, 2024 • edited Loading

FlimFlamm commented Aug 19, 2024 •

edited

Loading