You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I made some changes to the model (3D convs) and trained the small one with 128 tokens on 128p 16-frame videos pre-compressed with CogvideoX's VAE and MSE loss.
Turned out better than I expected considering how fast the training was on consumer hardware (couple hours).
There's a lot of potential here, and I think I can improve the performance a lot further.
The text was updated successfully, but these errors were encountered:
@NilanEkanayake I would be extremely interested in this because I am currently trying to tokenise sign language videos to input into an LLM here for translation tasks!
@NilanEkanayake I would be extremely interested in this because I am currently trying to tokenise sign language videos to input into an LLM here for translation tasks!
It compresses fixed-length videos, so not sure how well it would work for that. You'd have to string multiple tokenized videos together depending on the length of the input.
You might have better luck training a custom model from scratch, where the model takes in the videos and produces a translation, instead of using an LLM with a video tokenizer on top.
Have you tried using pose estimation methods to feed to the LLM instead? Would bypass the tokenizer quality and be a lot more flexible.
I made some changes to the model (3D convs) and trained the small one with 128 tokens on 128p 16-frame videos pre-compressed with CogvideoX's VAE and MSE loss.
Turned out better than I expected considering how fast the training was on consumer hardware (couple hours).
There's a lot of potential here, and I think I can improve the performance a lot further.
The text was updated successfully, but these errors were encountered: