User/rcadene/2024 10 07 vla #467

danaaubakirova · 2024-10-10T15:30:51Z

What this does

Explain what this PR does. Feel free to tag your PR with the appropriate label(s).

Examples:

Title	Label
Fixes #[issue]	(🐛 Bug)
Adds new dataset	(🗃️ Dataset)
Optimizes something	(⚡️ Performance)

How it was tested

Explain/show how you tested your changes.

Examples:

Added test_something in tests/test_stuff.py.
Added new_feature and checked that training converges with policy X on dataset/environment Y.
Optimized some_function, it now runs X times faster than previously.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

DATA_DIR=tests/data pytest -sx tests/test_stuff.py::test_something

python lerobot/scripts/train.py --some.option=true

SECTION TO REMOVE BEFORE SUBMITTING YOUR PR

Note: Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.

Note: Before submitting this PR, please read the contributor guideline.

danaaubakirova · 2024-10-10T15:33:40Z

lerobot/common/policies/vla/modeling_vla.py

+ )
+
+ hidden_states = llava_output.hidden_states[-1] # Use last layer's hidden state
+ hidden_states = hidden_states[:, -4:, :] #make 4 a config parameter 


Taking the last 4 embeddings as an input to the action decoder. Because chunk_size is divisible by 4.

danaaubakirova · 2024-10-10T15:35:26Z

lerobot/common/policies/vla/modeling_vla.py

+ processed_inputs = self.processor(
+ text=batch["prompt"], videos=list(batch["observation.images"]), return_tensors="pt", padding=True, do_rescale=False
+ ).to(self.device)


for some reasons I lose the original batch size? it should be 2. Maybe because we have a 5-dim input, not sure what to do with the camera index. I remove normalization as well, because the processor takes unnormalized PIL images/frames

danaaubakirova · 2024-10-10T15:37:10Z

lerobot/common/policies/vla/modeling_vla.py

+ Returns:
+ action_logits: Tensor of predicted actions.
+ """
+ batch_size = hidden_states.size(0) # Ensure batch size is extracted


This part is the core and I think it might contain the mistakes. the way we repeat the input to match the chunk_size and also the encoder_out value needs to be checked and reviewed.

danaaubakirova and others added 8 commits September 30, 2024 14:13

some tests

6ef3315

adding VLAPolicy class and Action Decoder

7f29372

changes to vision

143efd1

updates

def40b5

WIP

bd49f7d

Formating + vla.yaml

ff4306d

removing qwen from modeling_vla.py Adding llava_onevision

d0129a7

adding positional embeddings to the action decoder

903bd23

danaaubakirova marked this pull request as draft October 10, 2024 15:31

danaaubakirova commented Oct 10, 2024

View reviewed changes

changing processing input

9e3d385

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User/rcadene/2024 10 07 vla #467

User/rcadene/2024 10 07 vla #467

danaaubakirova commented Oct 10, 2024

danaaubakirova Oct 10, 2024

danaaubakirova Oct 10, 2024

danaaubakirova Oct 14, 2024

danaaubakirova Oct 10, 2024

User/rcadene/2024 10 07 vla #467

Are you sure you want to change the base?

User/rcadene/2024 10 07 vla #467

Conversation

danaaubakirova commented Oct 10, 2024

What this does

How it was tested

How to checkout & try? (for the reviewer)

SECTION TO REMOVE BEFORE SUBMITTING YOUR PR

danaaubakirova Oct 10, 2024

Choose a reason for hiding this comment

danaaubakirova Oct 10, 2024

Choose a reason for hiding this comment

danaaubakirova Oct 14, 2024

Choose a reason for hiding this comment

danaaubakirova Oct 10, 2024

Choose a reason for hiding this comment