Add the F8E4M3 dtype for CUDA and CPU #2546

* Offset it * Freeze * Offset it * Offset it * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Remove debugs * Polish it up * Polish it up * Clippy * Remove test file * Add config for if neox * Fix bug * Fix bug * Cast cache type on rust side * Cast types * To dtype * Drop temp * Update casting * Update casting * Update casting * Create dtype in bf16 * Check type * Debug * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Debug * Debug * Debug * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Reseting * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Remove debug * Debug * Debug * Remove debug * Remove debug * Debug * Remove debug * Debug * Remove debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Try to use 3dim rotemb fused * Try to use 3dim rotemb fused * Remove contig and debug * Check handling * Cleanup * Fix * Remove prints * Lower block dim * Use fused layernorm * Pass batch size * Simplify internal API * Simplify internal API * Try slow * Try candle layer norm * Try candle layer norm * Fix dep of candle layer norm * Reshape input for rank 2 * Reshape input for rank 2 * Fix ref * Code style * Make dep optional * Ensure contig * Ensure contig * Ensure contig * Debug contig dmmv error * Debug contig dmmv error * Debug contig dmmv error * Debug contig dmmv error * Try other method * Try other method * Try other method * Try other method * Try other method * Use typestate to optimize * Use typestate to optimize * Fixes * Fixes * Fixes * Fixes * Fixes * Debug via using slow rmsnorm * Debug via using slow rope * Remove debug * More debugging * Remove debug * Remove debug * Remove debug * Add better error enum * Fix diff marker * Fix some things * Fix some things * Fix some things * Fix dummy backends * Re add from storage noop * Fix removed kvconcat custom op * Fix erroneous feature gate * Complete metal backend refactoring * Check if calling * Check if calling * Update default for force dmmv * Load atomic * Debug * Use mmvq * Update * Add the empty functions * Add rope new_partial function * Make variant of qmatmul pub * Make variant of qmatmul pub * Add the varbuilder set_device function * Only link stdc++ if target has msvc * Only link stdc++ if target has msvc * Only link stdc++ if target has msvc * Only link stdc++ if target has msvc * Handle case of device mapping * Handle case of device mapping * Add getter * Fix * Fix * Support nvcc flags in flash attn * Support nvcc flags in flash attn * Support nvcc flags in flash attn * Support nvcc flags in flash attn * Support nvcc flags in flash attn * Fixes * Fixes * Fix the tests * Fix the tests

* Support flash-attn in quantized phi3. (huggingface#2194) * Use flash-attn in gemma. (huggingface#2195) * Use flash-attn in gemma. * Fix flash-attn for head dim 256. * Remove candle-layer-norm --------- Co-authored-by: Laurent Mazare <[email protected]>

* Add unfold * Format

* Add the quantize_onto api * Take ref * Clippy * Format * Add error checking

* Use flash-attn in gemma. * Fix for the fast bf16 cublas gemm. * Fix some clippy lints. * Fix another lint. * Proper clippy fix.

* define structs * construct ResidualConvUnit * forward() for ResidualConvUnit * implement FeatureFusionBlock * implement Scratch * implement DPTHead * add identity module * implement forward for DTPHead * add get_intermediate_layers to DinoVisionTransformer * implement DepthAnythingV2 * some minor tweaks * fix compile errors * fix var builder prefixes * setup initial example * use fixed patch size of 37 (518 / 14) * debugged until output * print min and max values * add some dynamism to the output location * scale input image * extract prep function * extract output path function * normalize image with magic mean and std * add spectral coloring * squeeze in the right place * make enterpolation optional * use bail instead of panic * omit unnecessary Shape call * remove empty curly braces * use bail instead of assert * use vb and pp * remove closures * extract config object * Apply rustfmt. * Fix some clippy lints. * More lints. * Use the array methods. --------- Co-authored-by: laurent <[email protected]>

* feat(gemm): implement Gemm operator in candle-onnx * feat(onnx): Add support for ArgMax operator in candle-onnx * Apply rustfmt. * Remove argmax as it was already present. --------- Co-authored-by: Laurent <[email protected]>

* Add: DINOv2Reg4 with PlantCLEF2024 weights and example ( See https://arxiv.org/abs/2309.16588 and https://zenodo.org/records/10848263 ) * Remove extra files + update README to download them + remove extra lines * minor fix (README remove extra spaces) * minor fix (README: Fix image url) * Modif: Add back interpolate_pos_encoding() + fix when no interpolation + remove extra comments + Update README ( source image changed and so the predictions ) * Fix: Improve code lisibility with '$ cargo clippy' and '$ cargo fmt' * Another clippy fix. --------- Co-authored-by: x-VEspit <[email protected]> Co-authored-by: laurent <[email protected]>

…e#2299)

* Add i32 dtype for cpu and cuda, with kernels * Fix cuda i32 * Fix cpu i32 * Add cuda map impls for i32 * Start to add to metal * Add the kernels * Oops * Fix dtype cast in safetensors * Oops * Oops * Add bf16 to i32 and vice versa casts

* Add the flux autoencoder. * Add the encoder down-blocks. * Upsampling in the decoder. * Sketch the flow matching model. * More flux model. * Add some of the positional embeddings. * Add the rope embeddings. * Add the sampling functions. * Add the flux example. * Fix the T5 bits. * Proper T5 tokenizer. * Clip encoder path fix. * Get the clip embeddings. * No configurable weights in layer norm. * More weights related fixes. * Yet another shape fix. * DType fix. * Fix a couple more shape issues. * DType fixes. * Fix the latent dims. * Fix more shape issues. * Autoencoder fixes. * Get some generations out. * Bugfix. * T5 padding. * Clippy fix. * Add the decode only mode. * Fix. * More fixes. * Finally get some generations to work. * Add readme.

* add models support and example for THUDM/glm-4 * fix the ci report * fmt * fix * Update README.org * Update README.org * fmt * Update README.org * README.md add codegeex4 * README.md add glm4 * Typo. * change expect into ? --------- Co-authored-by: Laurent Mazare <[email protected]>

* add mmdit of stable diffusion 3 lint add comments * correct a misplaced comment * fix cargo fmt * fix clippy error * use bail! instead of assert! * use get_on_dim in splitting qkv

* chore: changes from formatting on save * fix: usage of `actions/checkout@v2`

Also squeeze the first dimension of the codes tensor in the example file to get the expected three dimensions.

* Soft NMS with thresholds * NMS Test * Soft nms w/ boxes removed below threshold * Soft nms test * No longer removing bounding boxes to fit Soft-NMS focus * Initialize confidence * Added comments * Refactored out updating based on IOU/sigma * Score_threshold -> confidence_threshold for clarity * Remove bboxes below confidence threshold * Softnms basic functionality test * Softnms confidence decay test * Softnms confidence threshold test * Softnms no overlapping bbox test * Testing confidence after no overlap test * Single bbox and no bbox tests * Signify test completion * Handling result of test functions * Checking all pairs of bboxes instead of a forward pass * Equal confidence overlap test * Clarified tests for implementation * No longer dropping boxes, just setting to 0.0 * Formatted w/ cargo

…ds (huggingface#2308) * Add documentation examples for `Tensor` methods * Apply fmt. * Cosmetic tweaks. --------- Co-authored-by: Laurent <[email protected]>

* Clippy fixes. * Bump the web_sys required version.

* Add GGUF bf16 type support * Add non avx impl for vec_dot_bf16 * Fix from_u32 * Fix loading * Fix dequant of bf16

* Expose the softcap methods * Add some tests * Fix generics

* Update kernels for metal bf16 * Fix typo * Check if have bfloat

* onnx: workaround pow with negative base rather than fully defining pow in the cpu backend (as in huggingface#2318), this implements a much smaller change which is sufficient to evaluate silero-vad onnx models. Specifically, checking if pow is run with 2.0 exponent, and if so evaluate as simply `x*x` instead of the cpu backend of `e^(2.0 * ln(x))`. * PR: use Tensor::powf insead powf correctly handles a negative base.

index_select does not support negative indexing, but this change adds just enough workarounds in onnx to allow evaluating silero-vad models (which make use of negative indices).

* silero-vad v5 example This change adds an example of how to run silero-vad v5 * PR: rename 'vad' to 'silero-vad' * Update README.md --------- Co-authored-by: Laurent Mazare <[email protected]>

…gingface#2442) * Fix for parler-tts, do not add the last slice of padding tokens. * Support for the mini model.

Co-authored-by: Yi Xu <[email protected]>

* Update cudarc to 0.12. * Some cudnn tweaks.

* correct optional SE layer dimensions. * head_dim instead of num_heads is 32. * update test example output.

* Allow loading images with given std and mean * OpenCLIP text encoder component * Two MobileCLIP models * Clippy fixes. --------- Co-authored-by: Laurent <[email protected]>

* fix FLUX.1 weights * added flux1-dev.safetensors

* Clippy fixes for 1.81.0. * Another fix.

* Bump the version to 0.6.1. (huggingface#2438) * onnx: workaround pow with negative base (huggingface#2439) * onnx: workaround pow with negative base rather than fully defining pow in the cpu backend (as in huggingface#2318), this implements a much smaller change which is sufficient to evaluate silero-vad onnx models. Specifically, checking if pow is run with 2.0 exponent, and if so evaluate as simply `x*x` instead of the cpu backend of `e^(2.0 * ln(x))`. * PR: use Tensor::powf insead powf correctly handles a negative base. * onnx: support negative index in Gather (huggingface#2440) index_select does not support negative indexing, but this change adds just enough workarounds in onnx to allow evaluating silero-vad models (which make use of negative indices). * silero-vad v5 example (huggingface#2321) * silero-vad v5 example This change adds an example of how to run silero-vad v5 * PR: rename 'vad' to 'silero-vad' * Update README.md --------- Co-authored-by: Laurent Mazare <[email protected]> * Fix for parler-tts, do not add the last slice of padding tokens. (huggingface#2442) * Fix for parler-tts, do not add the last slice of padding tokens. * Support for the mini model. * Add FastViT model. (huggingface#2444) * fix: qwen2 lm_head loading huggingface#2443 (huggingface#2445) Co-authored-by: Yi Xu <[email protected]> * Update cudarc to 0.12. (huggingface#2451) * Update cudarc to 0.12. * Some cudnn tweaks. * FastViT fixes. (huggingface#2452) * correct optional SE layer dimensions. * head_dim instead of num_heads is 32. * update test example output. * MobileCLIP models S1 and S2 (huggingface#2454) * Allow loading images with given std and mean * OpenCLIP text encoder component * Two MobileCLIP models * Clippy fixes. --------- Co-authored-by: Laurent <[email protected]> * Fix FLUX.1 weights (huggingface#2457) * fix FLUX.1 weights * added flux1-dev.safetensors * Clippy fixes for 1.81.0. (huggingface#2461) * Clippy fixes for 1.81.0. * Another fix. * Make Error::msg more in line with anyhow::Error::msg * Add context trait * Even more flexible * Format --------- Co-authored-by: Laurent Mazare <[email protected]> Co-authored-by: shua <[email protected]> Co-authored-by: Jani Monoses <[email protected]> Co-authored-by: ilookee <[email protected]> Co-authored-by: Yi Xu <[email protected]> Co-authored-by: Eugene Hauptmann <[email protected]>

* Add api to get current seed * Remove cell for rwlock

* Add the i16 dtype * Added I16 and I32 to fix the missing arms issue (candle-onnx/eval) * Update rust-ci.yml * Update ci_cuda.yaml * fmt adjustment * Revert "Update rust-ci.yml" This reverts commit f659d36. * Revert "Update ci_cuda.yaml" This reverts commit 62a4b39.

Commits on May 27, 2024

Merge

EricLBuehler committed May 27, 2024

Configuration menu

View commit details

Copy full SHA for c10fc33

Browse repository at this point

Copy the full SHA

c10fc33 View commit details

Browse the repository at this point in the history

Commits on Jun 30, 2024

Patch metal function

EricLBuehler committed Jun 30, 2024

Configuration menu

View commit details

Copy full SHA for b7a3e34

Browse repository at this point

Copy the full SHA

b7a3e34 View commit details

Browse the repository at this point in the history

Commits on Jul 15, 2024

Complete merge

EricLBuehler committed Jul 15, 2024

Configuration menu

View commit details

Copy full SHA for c967be9

Browse repository at this point

Copy the full SHA

c967be9 View commit details

Browse the repository at this point in the history

Commits on Aug 7, 2024

Simplify things a bit

EricLBuehler committed Aug 7, 2024

Configuration menu

View commit details

Copy full SHA for 27ca77e

Browse repository at this point

Copy the full SHA

27ca77e View commit details

Browse the repository at this point in the history

Commits on Oct 3, 2024

Fix set_dtype

EricLBuehler committed Oct 3, 2024

Configuration menu

View commit details

Copy full SHA for 20a57c4

Browse repository at this point

Copy the full SHA

20a57c4 View commit details

Browse the repository at this point in the history

Commits on Oct 6, 2024

Add initial f8 e4m3 type

EricLBuehler committed Oct 6, 2024

Configuration menu

View commit details

Copy full SHA for 121bdfd

Browse repository at this point

Copy the full SHA

121bdfd View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the F8E4M3 dtype for CUDA and CPU #2546

Add the F8E4M3 dtype for CUDA and CPU #2546

Commits on May 15, 2024

Commits on May 16, 2024

Commits on May 18, 2024

Commits on May 19, 2024

Commits on May 27, 2024

Commits on May 28, 2024

Commits on May 29, 2024

Commits on May 30, 2024

Commits on Jun 1, 2024

Commits on Jun 3, 2024

Commits on Jun 4, 2024

Commits on Jun 9, 2024

Commits on Jun 11, 2024

Commits on Jun 29, 2024

Commits on Jun 30, 2024

Commits on Jul 15, 2024

Commits on Jul 26, 2024

Commits on Jul 31, 2024

Commits on Aug 4, 2024

Commits on Aug 5, 2024

Commits on Aug 7, 2024

Commits on Aug 9, 2024

Commits on Aug 14, 2024

Commits on Aug 21, 2024

Commits on Aug 22, 2024

Commits on Sep 2, 2024

Commits on Sep 5, 2024

Commits on Sep 6, 2024

Commits on Sep 11, 2024

Commits on Sep 13, 2024

Commits on Sep 15, 2024

Commits on Oct 2, 2024

Commits on Oct 3, 2024

Commits on Oct 6, 2024