diff --git a/README.md b/README.md index 8a28f3b..e7623ef 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,68 @@ DOM-aware tokenization for Hugging Face language models. +## What? + +Natural language tokeniz(er,ation scheme)s are designed so +as to +a) group particles of meaning together +b) (omit/discard/hide) unimportant details +such that models consuming sequences of token IDs +are presented with what they need in a way they can most +easily (process/derive meaning from) +[in theory, models could consume streams of utf-8, but +the model will have to learn everything the tokenizer does +so consuming resources (layers/neurons/parameters) +and (portentally vastyl) extending training time.] + +for example, tokenizers aimed at languages that delimit with +whitespace generally have features to (omit/discard/embed/hide) +whitespace in their output so the model/consumer does not need +to care about it. + +this shiz aims to do the same but for HTML, such that: + +> +X +becomes: + +> < +> html +> > +> < +> head +> > +> < +> meta +> _ +> http +> equiv +> = +> utf +> 8 +> > +> </ +> meta +> > +> a +> b +> c +> d +> e +> f +> g +> h +> i +> j +> k +> l + + +tokenizers for generation need to be able to decode reversibly, +but generation isn't a goal for me/for now at least, so this +tokenizer will discard some of its input in order to better distil +the meaning of what it's looking at. + ## Installation ### With PIP