-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Hello,
I've been applying vectorscan to a series of genomics tools which had previously utilized regex searches implemented with the default engines of several languages. I was curious if it was possible to even further increase performance of vectorscan if an alphabet of the characters that will be seen in a string can be specified ahead of time. In genomic contexts, the possible character alphabet will typically be only 4 characters + an unknown sequence character for DNA and RNA and 20 + 1 for protein sequences, and we anticipate all strings to be searched will only include these characters.
Can additional performance gains be obtained if the search alphabet is more limited? If so, is this behavior reasonable to support, or is it outside the scope of vectorscan's development? Is the additional benefit of having only a limited set of characters already implicit in the creation of a search graph for patterns only containing these characters?
Any additional understanding and support you can give me is appreciated,
Kenji