Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(SemanticSearch): Added a new semantic search agent that uses fuzzy string mathcing and levenshtein distance. #103

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Hero2323
Copy link

Added a new Semantic Search Agent that can be used as follows:

atarashi -a SemanticSearch /path/to/file.c

…. The project was not building without this update, using the same package values specified.
…zy string mathcing and levenshtein distance.
Copy link
Member

@Kaushl2208 Kaushl2208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good, Needs test!!

Maybe we can add the agent to Build and Test stage?

@Kaushl2208 Kaushl2208 added need test GSoC-24 Pull request submitted under Google Summer Of Code 2024 labels Oct 24, 2024
Copy link
Member

@Kaushl2208 Kaushl2208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hero2323 , I tested the working! Found some issues :)
Also, Update the README on how to use SemanticSearch Agent. (processLicenseList flag) etc.
Please take a look.

Comment on lines +39 to +40
def __init__(self, licenseList):
super().__init__(licenseList)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, licenseList):
super().__init__(licenseList)
def __init__(self, licenseList, verbose=0):
super().__init__(licenseList)

If verbose type output is planned, The input flag for verbose is defined but not passed. Prone to throw error

fuzzy_similarity_matrix_2 = np.zeros(len(self.licenseList))
for i in range(len(self.licenseList)):
fuzzy_similarity_matrix_2[i] = fuzz.ratio(appended_comment, self.licenseList.loc[i, 'text'])
if pd.notna(licenseList.loc[i, 'license_header']):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if pd.notna(licenseList.loc[i, 'license_header']):
if pd.notna(self.licenseList.loc[i, 'license_header']):

licenseList variable is not accessible

args = parser.parse_args()

inputFile = args.inputFile
licenseList = args.processedLicenseList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also make it more reliable, If user doesnt provide processedLicenseList, Agent should pick what we already have :)

Something like:

defaultProcessed = resource_filename("atarashi",
                                       "data/licenses/processedLicenses.csv")

if processedLicense is None:
    processedLicense = defaultProcessed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSoC-24 Pull request submitted under Google Summer Of Code 2024 need test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants