Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to obtain moshi response using API #157

Open
1 task done
treya-lin opened this issue Nov 21, 2024 · 2 comments
Open
1 task done

How to obtain moshi response using API #157

treya-lin opened this issue Nov 21, 2024 · 2 comments
Labels
question Further information is requested

Comments

@treya-lin
Copy link

treya-lin commented Nov 21, 2024

Due diligence

  • I have done my due diligence in trying to find the answer myself.

Topic

The pytorch implementation

Question

Hello, thanks for your great work. I am trying the python API to see if I can use existing audio files to simulate streaming input obtain moshi's reply, but it didn't work as expected so I assume I am not using it the proper way. Could you kindly take a look?

my main question:

  1. when I have an existing audio and I want moshi to listen and respond to it, it always respond with greeting first, and then maybe remain silent, or it may say something (sometimes). does it mean I need to add a very long pause to wait for it to reply?What is the best practice to make it reply to a piece of given speech?

Some other questions:
2. if I want to use my earlier input, moshi's reply, my new input to get a new round of reply from it, how should I form my input? (like how should I hack it so that moshi will know what she replied earlier?)
3. can I control more on how it replies? say, if I have a script already, can I make moshi to follow that script to converse with me?

my code that I used when I tried to solve question 1:

  1. mostly borrowed from moshi's readme
  2. modifications:
    (1) I padded my input audio file to make sure the number of samples the multiple of 1920.
    (2) I put the models in a local dir so I changed the default_repo
    (3) I added 4 seconds of silence at the end of my audio. Initially I didn't add silence, and moshi didn't produce the reply, so I thought maybe I need to simulate human pause, but either way it didn't work properly.
loaders.DEFAULT_REPO = "/data/resources/models/kyutai/moshika-pytorch-bf16/"
device = "cuda"
mimi_weight = os.path.join(loaders.DEFAULT_REPO, loaders.MIMI_NAME)
mimi = loaders.get_mimi(mimi_weight, device=device)
mimi.set_num_codebooks(8)  # up to 32 for mimi, but limited to 8 for moshi.

def padding(wav: np.ndarray, multiple: int = 1920) -> np.ndarray:
    """
    Pads the audio signal to make its length a multiple of the specified value.
    
    Parameters:
    - wav (np.ndarray): The input audio signal, shape: [T].
    - multiple (int): The target multiple to pad the length to.
    

    """
    if not isinstance(wav, np.ndarray):
        raise ValueError("Input wav must be a NumPy array.")

    if multiple <= 0:
        raise ValueError("Multiple must be a positive integer.")

    # Calculate the current length and the padding needed
    current_length = wav.shape[0]
    padding_length = (multiple - (current_length % multiple)) % multiple

    # Add zero-padding to the end of the audio
    if padding_length > 0:
        wav = np.pad(wav, (0, padding_length), mode='constant', constant_values=0)

    return wav

# create an input data of a speech audio and add 4s silence at the end.
wav, sr = librosa.load(wavpath, sr = 24000,mono=True)
silence_duration = 4  # 
silence = np.zeros(int(sr * silence_duration), dtype=np.float32)
wav = np.concatenate((wav, silence))
wav = padding(wav) 
wav = torch.tensor(wav).unsqueeze(0).unsqueeze(0).to(device)  # Shape: [B=1, C=1, T]

# encode the input
with torch.no_grad():
    nonstream_codes = mimi.encode(wav)  # [B, K = 8, T]
    non_stream_decoded = mimi.decode(nonstream_codes)

    # Supports streaming too.
    frame_size = int(mimi.sample_rate / mimi.frame_rate) # 1920
    all_codes = []
    with mimi.streaming(batch_size=1):
        for offset in range(0, wav.shape[-1], frame_size):
            frame = wav[:, :, offset: offset + frame_size]
            codes = mimi.encode(frame)
            assert codes.shape[-1] == 1, codes.shape
            all_codes.append(codes)

import gc
def clear_cache():
    gc.collect()
    torch.cuda.empty_cache()
    
out_wav_chunks = []
# Now we will stream over both Moshi I/O, and decode on the fly with Mimi.
with torch.no_grad(), lm_gen.streaming(1), mimi.streaming(1):
    for idx, code in enumerate(all_codes):
        tokens_out = lm_gen.step(code.cuda())
        # tokens_out is [B, 1 + 8, 1], with tokens_out[:, 1] representing the text token.
        if tokens_out is not None:
            wav_chunk = mimi.decode(tokens_out[:, 1:])
            out_wav_chunks.append(wav_chunk)
        print(idx, end='\r')
out_wav = torch.cat(out_wav_chunks, dim=-1)
clear_cache()

# save the output file
out_wav_np = out_wav.squeeze().cpu().numpy()
output_path = "output_moshi.wav"
torchaudio.save(output_path, torch.tensor(out_wav_np).unsqueeze(0), sample_rate=24000)

I tried many times with many audio of different length but it always just returned moshi saying something like "hey what'up" or "hey how's it going". There is once or twice that it replied something meaningful after greeting, but still, I hope it can just "listen " to my words and reply without always greating first . I am trying to look into the code too, but I think I am not doing it the proper way. Could you please give more guide on how to use the API to play around it? Thank you! Any suggestion is much appreciated!

@treya-lin treya-lin added the question Further information is requested label Nov 21, 2024
@treya-lin
Copy link
Author

treya-lin commented Nov 21, 2024

Examples:(github does not accept wav so I had to upload as webm sorry...)

  1. in this example, moshi only greets but didn't reply meaningful content, it greets and then remained silent till the end
    A_4.webm
    output_moshi_4.webm

  2. This seems to be the very few times when it did reply, but it does not consistently respond like this, sometimes it only greets. And I don't understand why it greets when the input is talking?
    A_0.webm
    output_moshi.webm

@jlian2
Copy link

jlian2 commented Dec 24, 2024

Hey do you think this API can take two channel speech as input, (the same as dGSLM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants