Skip to content

v0.4.67

Compare
Choose a tag to compare
@georg-jung georg-jung released this 11 Dec 00:31
· 55 commits to master since this release
  • The Tokenize methods are now called Encode as that better expresses what they do.
    • The old methods still exist for now as redirects, marked as Obsolete.
  • Add new CreateBatchEnumerator and CreateAsyncBatchEnumerator APIs that support encoding of inputs that are longer than what fits in one model input (overlap/stride).
  • Make the Encode overload that returns ReadOnlyMemorys re-use it's internal buffers.
  • Use FrozenDictionary on .NET 8.
  • Add support for reading configuration from tokenizer.json files.
  • Add a LoadFromHuggingFaceAsync method to ease getting started.
  • Decode produces cleaner outputs (e.g. "hello, world" instead of "hello , world").
  • Produce results more similar to Hugging Face (don't unicode normalize token lookup keys).
  • Greatly improve test coverage and thus correctness verification.