Optimizations in the case of a known, limited alphabet of characters?

Hello,

I've been applying vectorscan to a series of genomics tools which had previously utilized regex searches implemented with the default engines of several languages. I was curious if it was possible to even further increase performance of vectorscan if an alphabet of the characters that will be seen in a string can be specified ahead of time. In genomic contexts, the possible character alphabet will typically be only 4 characters + an unknown sequence character for DNA and RNA and 20 + 1 for protein sequences, and we anticipate all strings to be searched will only include these characters.

Can additional performance gains be obtained if the search alphabet is more limited? If so, is this behavior reasonable to support, or is it outside the scope of vectorscan's development? Is the additional benefit of having only a limited set of characters already implicit in the creation of a search graph for patterns only containing these characters?

Any additional understanding and support you can give me is appreciated,

Kenji

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizations in the case of a known, limited alphabet of characters? #345

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimizations in the case of a known, limited alphabet of characters? #345

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions