fish-speech.rs

Initial setup

This repo requires a working Rust installation (see official docs). Packaging for homebrew and Linux to come.

Save the Fish Speech checkpoints to ./checkpoints. I recommend using huggingface-cli:

# If it's not already on system
brew install huggingface-cli

mkdir -p checkpoints/fish-speech-1.4
huggingface-cli download jkeisling/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4

Note that we don't support the official .pth weights.

System requirements

Nvidia GPU or Apple Silicon are highly recommended. CPU inference is supported as a fallback, but it's pretty slow. Please raise an issue if you want CPU accelerated.

Usage

For now, we're keeping compatibility with the official Fish Speech inference CLI scripts. (Inference server and Python bindings coming soon!)

Generate speaker conditioning tokens

# saves to fake.npy by default
cargo run --release --features metal --bin encoder -- -i ./tests/resources/sky.wav

For 1.2, you'll need to specify version and checkpoints manually:

cargo run --release --bin encoder -- --input ./tests/resources/sky.wav --output-path fake.npy --fish-version 1.2 --checkpoint ./checkpoints/fish-speech-1.2-sft

Generate semantic codebook tokens

For Fish 1.4 (default):

# Switch to --features cuda for Nvidia GPUs
cargo run --release --features metal --bin llama_generate -- \
  --text "That is not dead which can eternal lie, and with strange aeons even death may die." \
  --prompt-text "When I heard the release demo, I was shocked, angered, and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference." \
  --prompt-tokens fake.npy

For Fish 1.2, you'll have to specify version and checkpoint explicitly:

cargo run --release --features metal --bin llama_generate -- --text "That is not dead which can eternal lie, and with strange aeons even death may die." --fish-version 1.2 --checkpoint ./checkpoints/fish-speech-1.2-sft

For additional speed, compile with Flash Attention support.

Warning

The candle-flash-attention dependency can take more than 10 minutes to compile even on a good CPU, and can require more than 16 GB of memory! You have been warned.

Also, of October 2024 the bottleneck is actually elsewhere (in inefficient memory copies and kernel dispatch), so on already fast hardware (like an RTX 4090) this currently has less of an impact.

# Cache the Flash Attention build
# Leave your computer, have a cup of tea, go touch grass, etc.
mkdir ~/.candle
CANDLE_FLASH_ATTN_BUILD_DIR=$HOME/.candle cargo build --release --features flash-attn --bin llama_generate

# Then run with flash-attn flag
cargo run --release --features flash-attn --bin llama_generate -- \
  --text "That is not dead which can eternal lie, and with strange aeons even death may die." \
  --prompt-text "When I heard the release demo, I was shocked, angered, and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference." \
  --prompt-tokens fake.npy

Decode tokens to WAV

For Fish 1.4 (default):

# Switch to --features cuda for Nvidia GPUs
cargo run --release --features metal --bin vocoder -- -i out.npy -o fake.wav

For Fish 1.2:

cargo run --release --bin vocoder -- --fish-version 1.2 --checkpoint ./checkpoints/fish-speech-1.2-sft

License

Warning

This codebase is licensed under the original CC-BY-NC-SA-4.0 license. For non-commercial use only!

Please support the original authors by using the official API for production.

This model is permissively licensed under the BY-CC-NC-SA-4.0 license. The source code is released under BSD-3-Clause license.

Massive thanks also go to:

All candle_examples maintainers for highly useful code snippets across the codebase
WaveyAI's mel spec for the STFT implementation

Original README below

Fish Speech V1.4 is a leading text-to-speech (TTS) model trained on 700k hours of audio data in multiple languages.

Supported languages:

English (en) ~300k hours
Chinese (zh) ~300k hours
German (de) ~20k hours
Japanese (ja) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours

Please refer to Fish Speech Github for more info.
Demo available at Fish Audio.

Citation

If you found this repository useful, please consider citing this work:

@misc{fish-speech-v1.4,
  author = {Shijia Liao, Tianyu Li, etc},
  title = {Fish Speech V1.4},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/fishaudio/fish-speech}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.vscode		.vscode
configs		configs
docs		docs
fish_speech_core		fish_speech_core
fish_speech_python		fish_speech_python
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fish-speech.rs

Initial setup

System requirements

Usage

Generate speaker conditioning tokens

Generate semantic codebook tokens

Decode tokens to WAV

License

Original README below

Citation

About

Releases

Packages

Languages

License

EndlessReform/fish-speech.rs

Folders and files

Latest commit

History

Repository files navigation

fish-speech.rs

Initial setup

System requirements

Usage

Generate speaker conditioning tokens

Generate semantic codebook tokens

Decode tokens to WAV

License

Original README below

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages