Experiments with video tokenization. #37

NilanEkanayake · 2024-10-15T01:13:34Z

I made some changes to the model (3D convs) and trained the small one with 128 tokens on 128p 16-frame videos pre-compressed with CogvideoX's VAE and MSE loss.
Turned out better than I expected considering how fast the training was on consumer hardware (couple hours).

There's a lot of potential here, and I think I can improve the performance a lot further.

tanzheen · 2024-10-15T04:49:10Z

@NilanEkanayake I would be extremely interested in this because I am currently trying to tokenise sign language videos to input into an LLM here for translation tasks!

NilanEkanayake · 2024-10-15T10:44:31Z

@NilanEkanayake I would be extremely interested in this because I am currently trying to tokenise sign language videos to input into an LLM here for translation tasks!

It compresses fixed-length videos, so not sure how well it would work for that. You'd have to string multiple tokenized videos together depending on the length of the input.

You might have better luck training a custom model from scratch, where the model takes in the videos and produces a translation, instead of using an LLM with a video tokenizer on top.

Have you tried using pose estimation methods to feed to the LLM instead? Would bypass the tokenizer quality and be a lot more flexible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiments with video tokenization. #37

Experiments with video tokenization. #37

NilanEkanayake commented Oct 15, 2024 •

edited

Loading

tanzheen commented Oct 15, 2024

NilanEkanayake commented Oct 15, 2024 •

edited

Loading

Experiments with video tokenization. #37

Experiments with video tokenization. #37

Comments

NilanEkanayake commented Oct 15, 2024 • edited Loading

tanzheen commented Oct 15, 2024

NilanEkanayake commented Oct 15, 2024 • edited Loading

NilanEkanayake commented Oct 15, 2024 •

edited

Loading

NilanEkanayake commented Oct 15, 2024 •

edited

Loading