Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User/rcadene/2024 10 07 vla #467

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

danaaubakirova
Copy link
Collaborator

What this does

Explain what this PR does. Feel free to tag your PR with the appropriate label(s).

Examples:

Title Label
Fixes #[issue] (πŸ› Bug)
Adds new dataset (πŸ—ƒοΈ Dataset)
Optimizes something (⚑️ Performance)

How it was tested

Explain/show how you tested your changes.

Examples:

  • Added test_something in tests/test_stuff.py.
  • Added new_feature and checked that training converges with policy X on dataset/environment Y.
  • Optimized some_function, it now runs X times faster than previously.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

DATA_DIR=tests/data pytest -sx tests/test_stuff.py::test_something
python lerobot/scripts/train.py --some.option=true

SECTION TO REMOVE BEFORE SUBMITTING YOUR PR

Note: Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.

Note: Before submitting this PR, please read the contributor guideline.

@danaaubakirova danaaubakirova marked this pull request as draft October 10, 2024 15:31
)

hidden_states = llava_output.hidden_states[-1] # Use last layer's hidden state
hidden_states = hidden_states[:, -4:, :] #make 4 a config parameter
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the last 4 embeddings as an input to the action decoder. Because chunk_size is divisible by 4.

Comment on lines +84 to +86
processed_inputs = self.processor(
text=batch["prompt"], videos=list(batch["observation.images"]), return_tensors="pt", padding=True, do_rescale=False
).to(self.device)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reasons I lose the original batch size? it should be 2. Maybe because we have a 5-dim input, not sure what to do with the camera index. I remove normalization as well, because the processor takes unnormalized PIL images/frames

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Returns:
action_logits: Tensor of predicted actions.
"""
batch_size = hidden_states.size(0) # Ensure batch size is extracted
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is the core and I think it might contain the mistakes. the way we repeat the input to match the chunk_size and also the encoder_out value needs to be checked and reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants