-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task]: Spotting commands in the stream from coding assistants like cline #844
Comments
Initial implementation inhttps://github.com//pull/917 Note that:
Accuracy: 0.88 |
This was reverted in #930. It was causing the runner to run out of space. See the slack discussion |
Another PR created here to fix the build space problem #931 |
This should be closed by #931 |
Reopened this (again) and disabled the suspicious commands (in #1204). We need more context restriction on where this is run. |
On discussion we need this merged first https://github.com/jhrozek/codegate-open/blob/51dfd5e50f50e2a9b5deb61afcc52297872520bc/src/codegate/pipeline/functions/output.py#L53 (and possibly gather some of the tool semantics) We will reconvene when that is done. Added @jhrozek as an assignee to flag when this is ready. We also need to take a look at the top N MCP servers, to see what tool parameters they support. We need to support 'experimental' flags - this is not specific to this case, but this would allow curious folk to switch features on and off. This does not block the accuracy work, which can proceed in parallel. CC @lukehinds , @poppysec , @blkt |
Description
We have done some work to spot suspicious commands in #34. The task here is to write this code into codegate. This involves
Extensions for the future
We will probably have to intercept the commands at
and write the comment back at
As a baseline we decided to use the
hybrid-all-MiniLM-L6-v2
with post-processing by a small ANN. We didn't want the extra cost of codebert, but the local ANN seems to produce some benefit.Additional Context
We need to decide which model to use for the embeddings. all-minilm-L6-v2 works well, especially with a post ANN process step. It is already in codegate, so we get it for free. microsoft/codebert-base works better as expected, but at a cost of 476 MB.
The ANNs are much smaller
ls -lh | grep hybrid
-rw-r--r-- 1 nigel staff 228K 29 Jan 18:21 hybrid-all-MiniLM-L6-v2.model
-rw-r--r-- 1 nigel staff 420K 29 Jan 18:21 hybrid-microsoft-codebert-base.model
The text was updated successfully, but these errors were encountered: