Hi VITA team,
Thanks for open sourcing this - I've learnt a bunch from it.
Do you have the training code for how you trained the audio encoder and connector (doesn't have to be neat - can just be a code dump of whatever you have)? Trying to reproduce but having trouble. Have questions like - Did you align audio with Qwen by freezing Qwen and only training the encoder or connector? Or did you fine-tune some of the Qwen model to align with the encoder or connector.
It seems like all the scripts freeze the audio_encoder so I'm assuming it's not in the repo.
