Refactor db #16

Keyrxng · 2024-09-18T00:06:39Z

Resolves #13
Resolves #15

combined into a single table for embedding storage
combine adapters into one single embeddings entity
Added mentlegen's proxy with my types to get rid of lots the need for lots of typeguards for each webhook callback
no longer storing body > title, so to make up for that I simSearch using the title and body separately
refactored tests to use the mswjs DB and handlers like other plugins tend to do (mocking modules kills me for some reason, but the db and handlers I understand 😂)

QA:

ignore the row id 8 with the "::", I added `body > title` logic and removed it immediately.

I have another pull with the setup_instructions logic included, left out as it was not in scope for this PR.

gentlementlegen

I am fine with the code changes themselves, but I am not sure what to test here. @Keyrxng could you give me a use case?

Keyrxng · 2024-09-21T04:25:42Z

I am fine with the code changes themselves, but I am not sure what to test here. @Keyrxng could you give me a use case?

I'm unsure what you mean a use case? Like to practically work with the embeddings in unison with an LLM?

If I was testing this PR I'd ensure that the comments and specs are being stored properly. You could create embeddings for a spec or a readme then create a query embedding and see far fruitful it is with that alone. Although chatbot/embedding effectiveness I'd think is sort of out of scope for this. The intention of the PR was to make it easier to work with the embeddings from more than one angle.

The class system should make it easier for us to refine our embedding scope before we compare, that way we improve it's accuracy because we can remove entire "sections" of data and have it only consider what we tell it to. The infra isn't in place yet of course, but the class system for e.g we install a chatbot on ubq.fi, easy to do and good value for money, it should be scoped to high level org stuff and guard railed.

Moreover, the class system can be used as a weighted tier system for prioritizing certain sections over others under certain conditions.

#18 introduces logic that'll gather readmes which would give us something that we could effectively test in terms of having a well-rounded sort of chatbot with actionable milestones for confirming it effectiveness.

I'm not sure I answered your question or not please let me know

gentlementlegen · 2024-09-21T04:30:36Z

@Keyrxng thank you for the clarification. To ask simply, I wondered what I should be looking for when testing, the comments that contain HTML should be stripped from their tags and stored as raw text in the db is that correct?

0x4007 · 2024-09-21T07:39:09Z

I think we should store with full source code and then stripped just text and compare the results.

But eventually we should settle on one way and I'm pretty sure it's gonna end up being with the full markup

src/adapters/utils/markdown-to-plaintext.ts

src/handlers/create-comment-embedding.ts

Keyrxng · 2024-09-21T13:36:22Z

I think we should store with full source code and then stripped just text

I agree, I think we ought to refer to others in the space and see how they are doing it.

Data preparation
Chunking the application sources into smaller parts is a non-trivial task. In general, functions, class methods, structs, enums, and all the other language-specific constructs are good candidates for chunks. They are big enough to contain some meaningful information, but small enough to be processed by embedding models with a limited context window. You can also use docstrings, comments, and other metadata can be used to enrich the chunks with additional information.

Parsing the codebase
While our example uses Rust, you can use our approach with any other language. You can parse code with a Language Server Protocol (LSP) compatible tool. You can use an LSP to build a graph of the codebase, and then extract chunks. We did our work with the rust-analyzer. We exported the parsed codebase into the LSIF format, a standard for code intelligence data. Next, we used the LSIF data to navigate the codebase and extract the chunks. For details, see our code search demo.

Everyone does something similar but different although they all chunk their codebase, split it into semantically effective sections. So we need to determine how we are going to prep our codebase(s) for chunking, if we'll do it all manually or involve some other recommended tooling or whatever.

I've mentioned this before but we need to effectively parse multiple different programming languages. We have our top 5 that we use within our org so we should aim to cover at least those with whatever implementation we go with, others should not be considered a priority imo unless whatever tools we use easily support multiple languages.

Keyrxng · 2024-09-30T16:07:54Z

I can't help but notice the last two features did not come with tests included despite them being quite pricey in terms of reward. I thought that we were enforcing that new features extend test suites to at least cover the basic added functionality?

I'd appreciate being included in code review for AI plugins/features going forward if that is alright with the team.

Keyrxng added 16 commits September 17, 2024 17:50

chore: createComment

879d3d7

chore: refactor adapters

e769769

chore: add proxy callbacks for cleaner context types

50dfaf2

chore: create-comment-embeddings

b59b298

chore: delete-comment-embeddings

5366f26

chore: update-comment-embeddings

1654da3

chore: create-task-embeddings

991a2ae

chore: delete-task-embeddings

5dd0813

chore: update-task-embeddings

59eccc1

chore: task dedupe

a1ab5bc

chore: types, guards

a540b47

chore: tests

215ec6e

chore: log and fix tests

09be08a

chore: knip, format, logs, fix test

d707c9f

chore: create upon edit if needed

48349ff

chore: fix test

44ea012

Keyrxng requested a review from 0x4007 September 18, 2024 00:08

Keyrxng mentioned this pull request Sep 18, 2024

Feat/onboarding bot #18

Closed

Keyrxng requested a review from gentlementlegen September 19, 2024 08:38

gentlementlegen reviewed Sep 20, 2024

View reviewed changes

0x4007 reviewed Sep 21, 2024

View reviewed changes

src/adapters/utils/markdown-to-plaintext.ts Outdated Show resolved Hide resolved

src/handlers/create-comment-embedding.ts Outdated Show resolved Hide resolved

Keyrxng added 5 commits September 30, 2024 15:35

chore: minor changes

6f009de

Merge remote-tracking branch 'upstream/development' into refactor-db

ed80e8d

chore: refactor matchmaking into proxy callback, fix tests

f34d03a

chore: status > statusCode

44e36f4

chore: remove packageManager

18859db

Keyrxng added 5 commits September 30, 2024 16:08

chore(deps): add supabase cli

3246324

chore: content table migration

ee708ef

chore: remove migrations

2915e70

chore: remove const fn, update db fn, update findSimilarContent

88489d8

chore: remove log reliance from test

9516584

Keyrxng mentioned this pull request Sep 30, 2024

Search ubiquity-os/plugins-wishlist#50

Closed

Keyrxng closed this Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor db #16

Refactor db #16

Keyrxng commented Sep 18, 2024 •

edited

Loading

gentlementlegen left a comment

Keyrxng commented Sep 21, 2024

gentlementlegen commented Sep 21, 2024

0x4007 commented Sep 21, 2024

Keyrxng commented Sep 21, 2024 •

edited

Loading

Keyrxng commented Sep 30, 2024 •

edited

Loading

Refactor db #16

Refactor db #16

Conversation

Keyrxng commented Sep 18, 2024 • edited Loading

ignore the row id 8 with the "::", I added body > title logic and removed it immediately.

gentlementlegen left a comment

Choose a reason for hiding this comment

Keyrxng commented Sep 21, 2024

gentlementlegen commented Sep 21, 2024

0x4007 commented Sep 21, 2024

Keyrxng commented Sep 21, 2024 • edited Loading

Keyrxng commented Sep 30, 2024 • edited Loading

Keyrxng commented Sep 18, 2024 •

edited

Loading

ignore the row id 8 with the "::", I added `body > title` logic and removed it immediately.

Keyrxng commented Sep 21, 2024 •

edited

Loading

Keyrxng commented Sep 30, 2024 •

edited

Loading