A Python implementation of the Winnowing (local algorithms for document fingerprinting)
The original research paper can be found at http://dl.acm.org/citation.cfm?id=872770.
You may install winnowing
package via pip
as follows:
pip install winnowing
Alternatively, you may also install the package by cloning this repository.
git clone https://github.com/suminb/winnowing.git cd winnowing && python setup.py install
>>> from winnowing import winnow
>>> winnow('A do run run run, a do run run')
set([(5, 23942), (14, 2887), (2, 1966), (9, 23942), (20, 1966)])
>>> winnow('run run')
set([(0, 23942)]) # match found!
Quite honestly, I did not know what hash function to use. The paper did not talk about it. So I decided to use a part of SHA-1; more precisely, the last 16 bits of the digest.
You may use your own hash function as demonstrated below.
def hash_md5(text):
import hashlib
hs = hashlib.md5(text)
hs = hs.hexdigest()
hs = int(hs, 16)
return hs
# Override the hash function
winnow.hash_function = hash_md5
winnow('The cake was a lie')
(TODO: Write this section)