From f6849a529efc422b1069e329bfb3bbd48c45b8e2 Mon Sep 17 00:00:00 2001 From: Gary Benson Date: Sat, 1 Jun 2024 17:40:18 +0100 Subject: [PATCH] *** docs --- README.md | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/README.md b/README.md index 8b3e939..1a41156 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,51 @@ Output: ![<](https://gbenson.github.io/dt/ccbfee/3C.svg)![html](https://gbenson.github.io/dt/beedc6/html.svg)![>](https://gbenson.github.io/dt/f6d9ab/3E.svg)![<](https://gbenson.github.io/dt/f4aeb1/3C.svg)![head](https://gbenson.github.io/dt/a4dcf3/head.svg)![>](https://gbenson.github.io/dt/ccbfee/3E.svg)![<](https://gbenson.github.io/dt/beedc6/3C.svg)![meta](https://gbenson.github.io/dt/f6d9ab/meta.svg)![_](https://gbenson.github.io/dt/f4aeb1/5F.svg)![http](https://gbenson.github.io/dt/a4dcf3/http.svg)![equiv](https://gbenson.github.io/dt/ccbfee/equiv.svg)![=](https://gbenson.github.io/dt/beedc6/3D.svg)![content](https://gbenson.github.io/dt/f6d9ab/content.svg)![type](https://gbenson.github.io/dt/f4aeb1/type.svg)![_](https://gbenson.github.io/dt/a4dcf3/5F.svg)![content](https://gbenson.github.io/dt/ccbfee/content.svg)![=](https://gbenson.github.io/dt/beedc6/3D.svg)![text](https://gbenson.github.io/dt/f6d9ab/text.svg)![html](https://gbenson.github.io/dt/f4aeb1/html.svg)![charset](https://gbenson.github.io/dt/a4dcf3/charset.svg)![UTF](https://gbenson.github.io/dt/ccbfee/UTF.svg)![8](https://gbenson.github.io/dt/beedc6/8.svg)![>](https://gbenson.github.io/dt/f6d9ab/3E.svg)![<](https://gbenson.github.io/dt/f4aeb1/3C.svg)![meta](https://gbenson.github.io/dt/a4dcf3/meta.svg)![_](https://gbenson.github.io/dt/ccbfee/5F.svg)![name](https://gbenson.github.io/dt/beedc6/name.svg)![=](https://gbenson.github.io/dt/f6d9ab/3D.svg)![viewport](https://gbenson.github.io/dt/f4aeb1/viewport.svg)![_](https://gbenson.github.io/dt/a4dcf3/5F.svg)![content](https://gbenson.github.io/dt/ccbfee/content.svg)![=](https://gbenson.github.io/dt/beedc6/3D.svg)![width](https://gbenson.github.io/dt/f6d9ab/width.svg)![device](https://gbenson.github.io/dt/f4aeb1/device.svg)![width](https://gbenson.github.io/dt/a4dcf3/width.svg)![>](https://gbenson.github.io/dt/ccbfee/3E.svg)![<](https://gbenson.github.io/dt/beedc6/3C.svg)![title](https://gbenson.github.io/dt/f6d9ab/title.svg)![>](https://gbenson.github.io/dt/f4aeb1/3E.svg)![hello](https://gbenson.github.io/dt/a4dcf3/hello.svg)![world](https://gbenson.github.io/dt/ccbfee/world.svg)![](https://gbenson.github.io/dt/f4aeb1/3E.svg)![<](https://gbenson.github.io/dt/a4dcf3/3C.svg)![script](https://gbenson.github.io/dt/ccbfee/script.svg)![>](https://gbenson.github.io/dt/beedc6/3E.svg)![document](https://gbenson.github.io/dt/f6d9ab/document.svg)![getElementById](https://gbenson.github.io/dt/f4aeb1/getElementById.svg)![demo](https://gbenson.github.io/dt/a4dcf3/demo.svg)![innerHTML](https://gbenson.github.io/dt/ccbfee/innerHTML.svg)![Hello](https://gbenson.github.io/dt/beedc6/Hello.svg)![JavaScript](https://gbenson.github.io/dt/f6d9ab/JavaScript.svg)![](https://gbenson.github.io/dt/ccbfee/3E.svg)![...](https://gbenson.github.io/dt/ffffff/dotdotdot.svg) +## Why? + +Natural language tokeniz(er,ation scheme)s are designed so +as to +a) group particles of meaning together +b) (omit/discard/hide) unimportant details +such that models consuming sequences of token IDs +are presented with what they need in a way they can most +easily (process/derive meaning from) +[in theory, models could consume streams of utf-8, but +the model will have to learn everything the tokenizer does +so consuming resources (layers/neurons/parameters) +and (portentally vastyl) extending training time.] + +for example, tokenizers aimed at languages that delimit with +whitespace generally have features to (omit/discard/embed/hide) +whitespace in their output so the model/consumer does not need +to care about it. + +this shiz aims to do a similar thing but for HTML: +whitespace is discarded, +tag names, attribute names and attribbte values are tokenized +along with the textual content of the document, + +and special tokens are inserted to give context, so e.g. +start and end tags are wrapped in `<`, ``, +attribute names are preceded by `_` +and attribute values preceeded by `=`. + +## Limitations + +tokenizers are usually able to operate in either direction: +both *encoding* natural language into sequences of token IDs +for the model's input, +and *decoding* sequences of token IDs generated by the model +into natural language text. + +generation isn't a goal for me, for now at least: I'm interested +in extracting meaning, + + +, so this +tokenizer will discard some of its input in order to better distil +the meaning of what it's looking at. + ## Installation ### With PIP