Releases: LLukas22/llm-rs-python
GGML quantization update
Huggingface Hub integrations into AutoModel
AutoModel
can now automatically download GGML converted models and normal Transformer models from the Huggingface Hub.
AutoConverter, AutoQuantizer and AutoModel
Added the ability to automatically convert any supported model from the Huggingface Hub via the AutoConverter
.
Models which were converted this way, can be easily quantized or loaded via the AutoQuantizer
or AutoModel
without the need to specifiy the architecture.
Added quantization support
The ability to quantize models is now available for every architecture via quantize
.
LoRA & MPT Support
Tokenization & GIL free generation
Added the tokenize
and decode
functions to each model, to enable access to the internal tokenizer.
The generation of tokens is now GIL free, meaning other background threads can run at the same time.
Support Multiple Model Architectures
Since llama-rs
was renamed to llm
and now supports multiple model architectures, this wrapper was also expanded to support the new trait system and library structure.
Supported architectures for now:
- Llama
- GPT2
- GPTJ
- GPT-NeoX
- Bloom
The loader was also reworked and now supports the mmap-able ggjt
. To support this the SessionConfig
was expandend with the prefer_mmap
field.
Added SessionConfig options for Model
0.0.2 Added SessionConfig
Basic Functionality
0.0.1 Update Cargo.toml