-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RT-DETR postprocessing bug #32579
base: main
Are you sure you want to change the base?
RT-DETR postprocessing bug #32579
Conversation
…RTDetrForObjectDetection The format of labels for RTDetrForObjectDetection was not clearly specified, leading to confusion. Added detailed comments explaining the label structure to reduce ambiguity and improve ease of use.
@dwchoo Hi thanks for bringing this issue. Initially this model was aimed for Just visualizing |
@SangbumChoi Thank you for your response. I'd like to clarify that I haven't performed any fine-tuning on the model. My concerns are based on an analysis of the existing code. I believe there's an issue with the line Here's my reasoning:
When I modify the code to Given this, I suggest that the issue isn't about visualizing a Could you please review this specific part of the code? I believe addressing this issue would improve the model's performance regardless of the |
Here is some code you can test.import torch
import requests
from PIL import Image
import matplotlib.pyplot as plt
from transformers import RTDetrForObjectDetection, RTDetrImageProcessor
# toothbrush coco image
url = 'http://images.cocodataset.org/train2017/000000464902.jpg'
#url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image_processor = RTDetrImageProcessor.from_pretrained("PekingU/rtdetr_r50vd")
model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd")
inputs = image_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
results = image_processor.post_process_object_detection(
outputs,
target_sizes=torch.tensor([image.size[::-1]]),
threshold=0.3,
use_focal_loss=True) ###### Change here True -> False
for result in results:
for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
score, label = score.item(), label_id.item()
box = [round(i, 2) for i in box.tolist()]
print(f"{model.config.id2label[label]}: {score:.2f} {box}") Here's a visualization code you can use:import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
plt.imshow(image)
ax = plt.gca()
for result in results:
for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
label = model.config.id2label[label_id.item()]
score = score.item()
box = [round(i, 2) for i in box.tolist()]
x_min, y_min, x_max, y_max = box
rect = plt.Rectangle((x_min, y_min), x_max - x_min, y_max - y_min,
linewidth=2, edgecolor='r', facecolor='none')
ax.add_patch(rect)
plt.text(x_min, y_min, f'{label} {score:.2f}', color='white',
fontsize=12, bbox=dict(facecolor='red', alpha=0.5))
plt.axis('off')
plt.show() |
Hi, @dwchoo and @SangbumChoi thanks for taking the time to dive into a problem! As far as I understand RT-DETR in the current implementation in The solution might be one of
|
@qubvel @SangbumChoi
I'm open to further discussion on this approach. What are your thoughts? |
Since it transformers based RTDETR is based on COCO format I also generally agree with the @dwchoo's comments. (Which means agree with current PR) Also I think I have to recorrect about the |
@dwchoo @SangbumChoi thanks for the discussion! I've experimented a bit and fixed the ability of RT-DETR to be trained with a "void" class + cross-entropy loss for labels (and other losses too). I guess this "use_focal_loss" is not the right name for the argument here, it should be more about how the model was trained and what activation function it is expected to be applied. For the model that will be fine-tuned with cross-entropy loss + "void" class, we should later apply softmax + remove the last "void" class, so I would argue not to remove I agree that it should be specified in the docs to avoid some misunderstanding of the parameter. Please let me know what you think! |
@qubvel @SangbumChoi RT-DETR's architecture fundamentally differs from traditional YOLO models. Unlike YOLO, which uses a grid system with a fixed number of anchors necessitating a "void" class, RT-DETR employs 'Object Queries' and 'Uncertainty-minimal Query Selection' to identify and locate objects. This paper approach enables RT-DETR to effectively distinguish between background and objects without the need for a "void" class. Given this structural difference, the I believe that maintaining the model's original design without a "void" class aligns better with RT-DETR's architecture and intended functionality. Perhaps we could explore alternative ways to enhance the model's flexibility that are more in line with its core design principles? I'm open to further discussion and would be interested in hearing your thoughts on these points. |
I agree with this. This is the reason why we are having little bit confusion is naming issue. Let me make this discuss clear. Also be aware that @qubvel did additional loss function which is multiclass vs. multilabel scenario. Small argument in following statement.
We cannot assure that RT-DETR is aimed for without "void" class. Uncertainty-minimal Query Selection is for accurate query selection from vanilla model and Void class is to add foreground and background classification which is similar effect but working in independent stage (query selection vs. loss part). In conclusion, it is all about the policy of |
@SangbumChoi
The authors address foreground object detection through the Uncertainty U (Eq.2), which minimizes discrepancies between object location (P) and classification (C). This approach, integrated into the loss function (Eq.3), effectively distinguishes background from foreground without a explicit void class.
Unlike YOLO-based models that use fixed anchors to differentiate objects and background, DETR-series models, including RT-DETR, directly locate and classify objects through Transformer encoder-decoder architecture. RT-DETR advances this concept by replacing DETR's 'no object' class with the Uncertainty U mechanism. I believe optimizing RT-DETR's performance within its original framework would be more beneficial. However, I'm open to further discussion on this matter, as diverse perspectives can lead to valuable insights in our field. |
Hi @dwchoo, thanks for your answer and for providing details from the original paper. As you mentioned RT-DETR was designed to be trained without void class, but it was also designed to use a sigmoid function in postprocessing. As far as I see all model configs in original repo specify However, in the original code, there is a cross-entropy loss function and postprocessing that may be used for it (I mean softmax + So the one option is to strictly follow the original implementation and remove all unused code, including the not used loss functions and postprocessing steps. The second option, while keeping the default implementation aligned with the original one, is to fix optional loss functions and allow the user to decide which losses to use and whether the I don't see how it may compromise RT-DETR advantages, taking into consideration that default behavior kept aligned with the original one. Please let me know what you think! |
@qubvel Could I ask for a bit more time on this? I've contacted the RT-DETR authors by email to get their direct feedback on this issue. While I'm still waiting for their response, I'm also planning to try contacting them through the original RT-DETR repository. The question of the void class seems to be quite significant for this model, and I'm eager to ensure that we're proceeding with the most accurate understanding possible. I'm hoping to clarify whether this is simply an oversight or if perhaps I've misunderstood some aspect of the model's design. I think hearing from the authors themselves will help clear things up and show us what to do next. I hope it's okay if we wait a little bit for their response. Thank you again for your patience and understanding. I'm looking forward to continuing our discussion once I have more information to share. |
@SangbumChoi @qubvel Quick update: I raised a PR with the original RT-DETR repo. The author acknowledged the issue but preferred to keep things as is for now, given the extensive changes required. That said, I think we could still improve things in the How about we:
This should clear up confusion and potential bugs. What do you think? If you're on board, I'm happy to update the PR with these changes. Looking forward to hearing your thoughts! |
@dwchoo After reviewing the conversation of the author and your PR, I think this current PR and suggestion looks like it is kind of workaround solution. As you mentioned I think you are right that author mentioned about 0, 1, 2 situation they have only shared the code of 0 solution. (2nd solution does not work properly at the moment in original + this huggingface code) |
@qubvel Thank you for your ongoing engagement. I'd like to propose the following based on my analysis:
This approach maintains the model's integrity while addressing the specific issue at hand. I believe it offers a good balance between fixing the problem and minimizing changes to the codebase. I'm looking forward to your thoughts on this proposal. |
Hi @dwchoo, thanks a lot for working on this and for your structured responses, I really appreciate this discussion and hope it will also be useful for the community members who will find it 🤗 Here are my thoughts on these points:
The RTDetrHungarianMatcher is an extended version of DetrHungarianMatcher, and it can handle void class the same way transformers/src/transformers/models/detr/modeling_detr.py Lines 2190 to 2200 in 132e875
As you can see in this PR not too many changes have to be done to add a void class, its more about fixing already existing code. And I was able to conduct some experiments with no issues with it.
Agreed, that we are addressing the postprocessing problem, however, it also touches the main functionality because each model has its own pre- and post-processing pipelines that can change obtained results significantly.
We, potentially, can remove the If you have the bandwidth and agree with my thoughts, you can handle renaming + deprecating of the argument in this PR to keep your contribution to the library. Please let me know what you think 🤗 |
@qubvel , I apologize for the delayed reply. I believe your perspective is correct. Given the mix of various opinions, it might be beneficial to clarify and organize our approach. I would greatly appreciate your guidance on how to proceed with the PR modifications (in conjunction with #32658). May I suggest summarizing our approach as follows?
Could you please confirm if this summary aligns with your vision, or if there are any adjustments needed? Thank you for your patience and guidance throughout this process. |
What does this PR do?
This PR addresses two important issues:
post_process_object_detection
method ofRTDetrImageProcessor
for the RT-DETR model.labels
parameter inRTDetrForObjectDetection
.Fixes # #32578(issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@amyeroberts
Improve RT-DETR documentation: Clarify bounding box format for labels
Current Documentation
labels
parameter is not explicitly stated in theRTDetrForObjectDetection
documentation.Missing Information
The
labels
parameter requires bounding boxes in the following format:This information is crucial for correctly calculating the loss but is currently missing from the documentation.
Proposed Solution
Add the following clarification to the documentation for the
labels
parameter inRTDetrForObjectDetection
:"The bounding box coordinates in the 'boxes' key should be in the format (center_x, center_y, width, height) and have normalized values in the range [0, 1]."
Impact
Adding this information will significantly improve the user experience by:
Additional Notes
Related Links