Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DinoV2 & Depth Anything V2: Bigger Models #2288

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

jeroenvlek
Copy link
Contributor

@jeroenvlek jeroenvlek commented Jun 25, 2024

Sorry to open a draft again, but while adding more models I found out why the DinoV2 example loads its safetensors from a different location than Facebook's own HF space: Facebook's safetensors have the query, key and value layers separate, whereas the original Python code, and thus the Candle implementation, have a single QKV layer. Furthermore, the original model file relies on separate head weights (guessing the head is not even necessary for DepthAnything, so might remove from example)

So what I've done here:

  • Load the weights from Facebook's HF space and pack them together into a single QKV layer
  • Make the head optional with an optional VarBuilder
  • Downloaded all head files from Facebook's GitHub and converted them to safetensors in my own HF space

Question: Do you agree with this approach?

Happy to adapt to other ideas, but if you agree, I will continue adding the base, large and giant models.

PS: I saw you also did some formatting on my previous PR and fixed some Clippy warnings. Maybe it's an idea to have a Contributor page that details the formatting requirements and other conventions?

@jeroenvlek jeroenvlek marked this pull request as draft June 25, 2024 09:35
@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jun 26, 2024

Disclaimer: I had some extra time, so I just proceeded in this direction. Happy to receive any feedback and adapt, of course!

I've added base, large and giant. The latter only for Dino as the DepthAnything giant model is still pending review.

Also removed the awkard config object handling and added the model to the README

@jeroenvlek jeroenvlek marked this pull request as ready for review June 26, 2024 08:55
@jeroenvlek jeroenvlek changed the title Depth Anything V2: Bigger Models DinoV2 & Depth Anything V2: Bigger Models Jun 26, 2024
@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jun 26, 2024

There still seems to be a bug in there, comparing this large implementation with the web based model output. This output has jitter and less pronounced depth. Will investigate.

// edit

I think I found it. The way they compute the patch dimensions and then use the forward method to pass those along still gives me some headache

@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jun 27, 2024

Just noticed that load_image returns (channels, height, width) but load_image_and_resize returns (channels, width, height)

I'll open an issue for that one.

//edit

#2291

@jeroenvlek jeroenvlek marked this pull request as draft June 27, 2024 12:55
@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jun 29, 2024

Small update: Originally I thought the difference in output quality was due to model size, but there were still some bugs in there. Most seem to be fixed. It's still quite hard to get the input conditions the same. For example, they convert to RGB from BGR, but then they resize with OpenCV and they're back to BGR (I verified this in the Python code by setting channels to zero).

@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jun 29, 2024

@LaurentMazare I uploaded a new depth image of the bike picture. In your experience, do you think this could be caused by the difference between UpsampleNearest2D and F.interpolate(img,height,width, mode="bilinear", align_corners=True)?

//edit

GPT-4o suggests it may be a source of discrepancies:

To fully align the Rust implementation with the PyTorch interpolate function, especially for mode='bilinear' with align_corners=True, you would need a more detailed and comprehensive implementation of bilinear interpolation as shown above. The nearest-neighbor interpolation provided in the Rust code serves as a good starting point but does not handle the complexity required for bilinear interpolation with the align_corners parameter.

I'll experiment a bit with an implementation that does do bilinear filtering and aligns the corners, like in PyTorch

@LaurentMazare
Copy link
Collaborator

It's hard to say but I wouldn't expect it to make that much of a difference - I'm assuming that the image that you uploaded is the python output which looks great vs the rust output which is not that good. Usually for the bits that are not supported in candle, I just modify the python code to have the candle behavior e.g. I would try without the bilinear on the python side here to check that the python output is still meaningful.

@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jun 29, 2024

What I uploaded is the Rust output of the large model . The Python output is smoother. That's why I'm suspicious of the interpolation (used also deep inside the feature fusion blocks).

Let me investigate and apologies for all the noise :)

@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jul 3, 2024

@LaurentMazare I need a bit of guidance here: The pushed code follows the flow of the Python implementation fairly correct: The logic is the same. However, the results are still not as good as they should.

Python small:

small_20240224_072209

Rust small (this branch):

3_depth_20240224_072209

This is still not great. So I went further and opened another branch (not pushed) where I added:

  • image loading and resizing using opencv (I understand you might not want that as dep, this is just for testing)
  • Bilinear interpolation (Only in CPU storage, there doesn't seem to be a default Cuda kernel and it seems Candle doesn't have custom kernels)
  • Lots of debug statements and in between statistics (min, max and mean)
  • Load the Dino weights from the DepthAnythingV2 safetensors file and not from Facebook

The OpenCV loading and prep brings the inputs fairly close together (still -5.86% diff in means though), but then already in the DinoV2 model, things start to diverge rather badly:

model_layer_analysis.ods

Now, I'm not expecting you to look at or follow the analysis, but there's differences over 1400% between the means there.

My main question would be: How well do other ports of PyTorch models to Candle follow the original numerically? How much tolerance is there for numerical deviation? I would argue that in principle there shouldn't be any deviations at all and these statistics should match rather closely? And, how chaotic can a model behave, i.e. could a 5.86% difference as input snowball into these kind of deviations?

@jeroenvlek jeroenvlek marked this pull request as ready for review July 6, 2024 08:18
@jeroenvlek
Copy link
Contributor Author

Submitting this PR for now, as it is definitely an improvement over #2279

There's still unexplained numerical deviations in there, that already start in the applications of the DinoV2 blocks. Whether, this is something fundamental or obvious I was not able to discover, but I have stared at this problem long enough :)

@LaurentMazare
Copy link
Collaborator

Thanks for looking into this, the blocky results with your latest version certainly feel like some form of interpolation is off (or some upsampling of the depth results?). Anyway the PR looks good but would it be possible not to make changes on the dinov2 side? This model code might already be used so better not to break compatibility and try to reduce the diff as much as possible.

@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jul 6, 2024

I wouldn't mind doing that, but that was a point of the first post: The weights Facebook put on HF have the query, key and value tensors separate.

However, I believe the Depth Anything weights (which contain a copy of the Dino weights) are packed in the old/different format, so let me try with that.

Regarding interpolation: I tried bilinear on the candle side (like mentioned above Cpu only), and I tried nearest neighbor interpolation inside the Pytorch implementation's interpolate_pos_encosing() like your comment here

According to my rudimentary analysis, however, the mean divergence already starts while getting the intermediate blocks (so the forward method of the Dino blocks produces different results)

Anyway, happy to reduce the diff. Will work on that tomorrow :)

@jeroenvlek
Copy link
Contributor Author

jeroenvlek commented Jul 8, 2024

Thanks for looking into this, the blocky results with your latest version certainly feel like some form of interpolation is off (or some upsampling of the depth results?). Anyway the PR looks good but would it be possible not to make changes on the dinov2 side? This model code might already be used so better not to break compatibility and try to reduce the diff as much as possible.

@LaurentMazare Just to be clear: These new prefixes follow the Facebook naming and they also have the bigger models (base, large, giant). If we keep the old prefixes, then we need to compact the Q, K, V layers into a single layer, add the head weights, and export the 3 models to safetensors.

Is that really what you want? (I mean Candle uses versioning, although I do understand these are breaking changes)

@jeroenvlek
Copy link
Contributor Author

Back to the old prefixes. Oddly, using their weights instead of the Facebook weights, yields a worse result:

depth_20240224_072209

Bilinear upsampling does improve it a bit, but I removed it because:

  • I didn't completely trust GPT-4o generated code, especially after challenging GPT-4o and it changed the way strides were handled
  • to keep the PR small
  • to my knowledge there's no default kernel in Cuda that does bilinear upsampling
  • It only improves it, but it doesn't fix the fundamental result being blobby

For posterity, this is with Facebook weights and CPU bilinear upscaling:

depth_20240224_072209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants