Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Issue dedup relevance #26

Merged

Conversation

sshivaditya2019
Copy link
Collaborator

@sshivaditya2019 sshivaditya2019 commented Oct 1, 2024

Resolves #25

  • Developed a new similarity search function, uses l2 distance (Euclidean Distance) now.

@sshivaditya2019 sshivaditya2019 marked this pull request as draft October 1, 2024 20:56
@sshivaditya2019 sshivaditya2019 changed the title feat: added edit distance re-ranker fix: Issue dedup relevance Oct 2, 2024
@sshivaditya2019 sshivaditya2019 marked this pull request as ready for review October 2, 2024 04:22
@sshivaditya2019
Copy link
Collaborator Author

QA Using Euclidean Distance

Using 90% similarity for match and 75% for warning works best right now.

@0x4007
Copy link
Member

0x4007 commented Oct 2, 2024

Why did you use Euclidean Distance?

I don't have a lot of deep context on the problem you are solving, but the last time I did similar work (years ago) I used dice's coefficient which worked quite well.

Can you implement and compare performance?


Just read wikipedia and asked ChatGPT. Both suggest Dice is more appropriate for the task.

matchRepoOrgToSimilarIssueRepoOrg(payload.repository.owner.login, issue.node.repository.owner.login, payload.repository.name, issue.node.repository.name)
)
.map((issue) => {
const modifiedUrl = issue.node.url.replace("github.com", "www.github.com");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const modifiedUrl = issue.node.url.replace("github.com", "www.github.com");
const modifiedUrl = issue.node.url.replace("https://github.com", "https://www.github.com");

Seems more precise.

@sshivaditya2019
Copy link
Collaborator Author

Why did you use Euclidean Distance?

It represents the straight-line distance between any two points (vectors) on the plane. Similar vectors tend to cluster closely together, while unrelated vectors are positioned further apart.

I don't have a lot of deep context on the problem you are solving, but the last time I did similar work (years ago) I used dice's coefficient which worked quite well.

Can you implement and compare performance?

Just read wikipedia and asked ChatGPT. Both suggest Dice is more appropriate for the task.

I believe that to compute the Dice coefficient, we need to generate bit vectors and perform operations on them. However, I tried using a similar metric, Jaccard similarity, and it didn't yield good results. These metrics primarily evaluate the number of bits or elements that intersect between the two sets. Since we're using unit normalized vectors (length 1), most vector elements will range between 0 and 1, making the Dice coefficient less suitable, as it is better suited for binary or categorical data rather than continuous data.

@0x4007
Copy link
Member

0x4007 commented Oct 2, 2024

4o:

I believe that to compute the Dice coefficient, we need to generate bit vectors and perform operations on them.

Right, I see the challenge with Dice on our continuous data. Since we’ve already tried Cosine Similarity and are now testing Euclidean Distance, it’s good to explore different approaches.

Cosine focuses on vector direction, whereas Euclidean captures the magnitude differences. Both have their merits depending on the nature of the vectors (angle vs distance). If Cosine wasn’t optimal in previous attempts, I’m curious about how Euclidean compares in performance on this data.

That said, we might also want to experiment with Manhattan Distance or even Minkowski Distance as a middle ground between Euclidean and Cosine. They could provide further insight into how distance measures impact our similarity task.

@sshivaditya2019
Copy link
Collaborator Author

I experimented with Manhattan distance (also known as Taxi Cab Distance or L1 Distance) and found that it produced results similar to L2 Distance (Euclidean Distance). One advantage of L2 Distance is that, when normalized, the distance value can be constrained within a unit circle around the vector point, meaning it often falls between 0 and 1. Both Euclidean and Manhattan distances are specific cases of Minkowski Distance, where L1 corresponds to Manhattan and L2 corresponds to Euclidean

Source

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Oct 2, 2024

Updated the new UI for similar issues message. It would not create a new comment; instead, it would edit the issue specification itself to include details about similar issues.

QA: Issue

@0x4007
Copy link
Member

0x4007 commented Oct 2, 2024

It would not edit inside the issue specification itself without creating a new comment:

What does that mean? I can edit anybody's specification with collaborator permissions. We just need to ensure the app has sufficient permissions I'm sure.

@sshivaditya2019
Copy link
Collaborator Author

It would not edit inside the issue specification itself without creating a new comment:

What does that mean? I can edit anybody's specification with collaborator permissions. We just need to ensure the app has sufficient permissions I'm sure.

Sorry, I meant to say that it wouldn’t create a comment; instead, it would edit directly within the issue specification.

src/handlers/issue-deduplication.ts Outdated Show resolved Hide resolved
src/handlers/issue-deduplication.ts Outdated Show resolved Hide resolved
src/handlers/issue-deduplication.ts Outdated Show resolved Hide resolved
@0x4007
Copy link
Member

0x4007 commented Oct 3, 2024

I made changes here but why is there no build CI?

0x4007 added a commit to ubiquity-os/plugin-template that referenced this pull request Oct 3, 2024
@0x4007 0x4007 merged commit bb7f7d6 into ubiquity-os-marketplace:development Oct 3, 2024
2 checks passed
@ubiquity-os ubiquity-os bot mentioned this pull request Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Similar Issues Adjustments
3 participants