Skip to content

Optimizations in the case of a known, limited alphabet of characters? #345

@KGerhardt

Description

@KGerhardt

Hello,

I've been applying vectorscan to a series of genomics tools which had previously utilized regex searches implemented with the default engines of several languages. I was curious if it was possible to even further increase performance of vectorscan if an alphabet of the characters that will be seen in a string can be specified ahead of time. In genomic contexts, the possible character alphabet will typically be only 4 characters + an unknown sequence character for DNA and RNA and 20 + 1 for protein sequences, and we anticipate all strings to be searched will only include these characters.

Can additional performance gains be obtained if the search alphabet is more limited? If so, is this behavior reasonable to support, or is it outside the scope of vectorscan's development? Is the additional benefit of having only a limited set of characters already implicit in the creation of a search graph for patterns only containing these characters?

Any additional understanding and support you can give me is appreciated,

Kenji

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestwishlistSomething that would be nice to have but not a priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions