-
Notifications
You must be signed in to change notification settings - Fork 1
feat: Add full-text search support via PostgreSQL to_tsquery syntax #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds support for full-text search (FTS) in ChronDB using PostgreSQL's to_tsquery syntax. The implementation includes: - Support for @@ to_tsquery syntax in WHERE clauses - Integration with Lucene index for FTS operations - Mapping of PostgreSQL text search operators to Lucene queries - Proper handling of FTS conditions in query execution - Support for multiple FTS conditions with AND/OR operators The changes allow users to perform full-text searches using familiar PostgreSQL syntax while leveraging Lucene's powerful search capabilities under the hood. Key files modified: - src/chrondb/api/sql/parser/clauses.clj: Added FTS condition parsing - src/chrondb/api/sql/execution/query.clj: Added FTS query handling - src/chrondb/index/lucene.clj: Enhanced Lucene index for FTS - src/chrondb/index/protocol.clj: Updated index protocol for FTS Signed-off-by: Avelino <[email protected]>
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughThis pull request integrates full-text search (FTS) support into the SQL execution module by introducing a function to detect FTS conditions and enhancing query handling logic. It updates the SQL parser, server startup, and API search functions to accommodate an additional index parameter and normalization of search terms. Furthermore, it overhauls the Lucene and memory index implementations to improve logging, error handling, and functionality, while also adding comprehensive tests for these components along with new tests for Git storage operations. Changes
Possibly related PRs
Suggested labels
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
This commit adds support for PostgreSQL-style to_tsquery syntax in SQL WHERE clauses,
allowing users to write queries like:
SELECT * FROM documents WHERE content @@ to_tsquery('query')
The implementation:
- Parses to_tsquery expressions in WHERE clauses
- Maps them to Lucene full-text search
- Maintains compatibility with existing FTS_MATCH syntax
- Updates API endpoints to use consistent field/branch parameters
This change improves SQL compatibility and provides a more familiar syntax
for PostgreSQL users.
fix: lint issues in clauses.clj
Signed-off-by: Avelino <[email protected]>
This commit updates the Lucene index tests to expect document IDs instead of full documents in search results. This change aligns with the index protocol specification and makes the tests more focused on the core indexing functionality. The changes include: - Modifying test assertions to check for document IDs instead of full documents - Updating test cases for basic operations, document updates, and nil value handling - Maintaining test coverage while improving test clarity This refactoring helps ensure consistent behavior across different index implementations and makes the tests more maintainable. Signed-off-by: Avelino <[email protected]>
This commit adds support for PostgreSQL's to_tsquery syntax in the Lucene index implementation. The changes include: - Added to_tsquery parsing and conversion to Lucene query syntax - Updated memory index tests to use consistent formatting - Added test cases for to_tsquery functionality - Improved error handling for invalid query syntax The implementation allows users to use PostgreSQL-style text search queries while maintaining compatibility with existing Lucene search functionality. Signed-off-by: Avelino <[email protected]>
The commit simplifies the test case for git commit errors by using the `_` wildcard for unused arguments instead of `args`. This makes the code more concise and follows Clojure conventions for ignoring unused parameters. Signed-off-by: Avelino <[email protected]>
This commit adds comprehensive documentation for ChronDB's full-text search capabilities using PostgreSQL's to_tsquery syntax. The documentation covers: - Basic FTS syntax and examples - How the search works internally - Field optimization with _fts suffix - Comparison with FTS_MATCH function The documentation is added to examples-postgresql.md and linked in SUMMARY.md. Signed-off-by: Avelino <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (11)
src/chrondb/example.clj (1)
13-13: Improved code readability by removing extraneous blank linesRemoval of unnecessary blank lines improves code readability without affecting functionality.
Also applies to: 38-38, 48-48, 53-53, 57-57, 60-60, 65-65, 67-67
src/chrondb/api/v1/routes.clj (3)
68-71: The TODO comments should be considered for implementationThe TODO comments indicate plans to allow specifying search fields and branches via query parameters, which would make the API more flexible. Consider implementing these features as they align well with the full-text search capabilities being added.
Also, I notice the comments on lines 70-71 are in Portuguese while the rest of the codebase is in English. For consistency, consider translating these comments to English.
72-73: Fix indentation in the response callThere's a slight indentation issue in the response/response call that should be fixed for code readability and consistency.
(response/response - {:results (index/search index default-field query default-branch)}))) + {:results (index/search index default-field query default-branch)})))
61-73: Update function documentation to reflect new parametersThe function docstring doesn't mention the new parameters being passed to the
index/searchfunction. Consider updating it to mention the default field and branch being used, as this is important for users to understand the search behavior.(defn handle-search "Handles document search requests. Parameters: - index: The index implementation - query: The search query string + + Internally, the function uses 'content' as the default search field and 'main' as the default branch. Returns: HTTP response with search results" [index query]src/chrondb/api/sql/parser/statements.clj (1)
35-42: Consider moving table spec cleaning to a helper functionThe table specification cleaning logic is duplicated across all four SQL statement parsers. To follow the DRY principle, consider extracting this logic into a helper function.
+ (defn- clean-table-spec + "Cleans a table specification by removing trailing semicolons and spaces. + Returns [schema table-name] tuple." + [raw-table-spec] + (when raw-table-spec + (let [cleaned-spec (-> raw-table-spec + (str/replace #";$" "") + (str/trim)) + parts (str/split cleaned-spec #"\.")] + (if (= (count parts) 2) + [(first parts) (str/trim (second parts))] + [nil cleaned-spec])))) ;; Then in each parser: (let [raw-table-spec (when (and from-index (> from-index 0)) ...) - [schema table-name] (when raw-table-spec - (let [cleaned-spec (-> raw-table-spec - (str/replace #";$" "") - (str/trim)) - parts (str/split cleaned-spec #"\.")] - (if (= (count parts) 2) - [(first parts) (str/trim (second parts))] - [nil cleaned-spec]))) + [schema table-name] (clean-table-spec raw-table-spec) ...]Also applies to: 71-78, 125-132, 176-183
src/chrondb/api/sql/server.clj (1)
10-11: Ensure the docstring clarifies argument types.
While the docstring mentionsindex-or-port, consider specifying that if it’s a numeric value, it’s treated as the port; otherwise, it’s an index implementation. This helps clarify function usage for new developers.- - index-or-port: Either an index implementation OR a port number (default: 5432) + - index-or-port: Either an index implementation (Lucene or custom) OR a port number (default: 5432)src/chrondb/api/sql/parser/clauses.clj (1)
74-145: Robust FTS parsing with fallback warnings.
- Parsing logic for
FTS_MATCH(...)andfield @@ to_tsquery(...)is well-structured, including logs for debugging.- Consider explicitly handling OR conditions in the future, as they’re currently skipped or assumed as AND.
src/chrondb/index/lucene.clj (4)
15-41: Be cautious about double-appending_fts.
When the field name itself already ends with_fts, the code may generate a second_ftssuffix, resulting in verbose fields liketitle_fts_fts. If that’s unintended, you could skip adding_ftsa second time.Consider applying a safeguard like:
- (when (or (= field-name "name") - (= field-name "description") - (= field-name "content") - (= field-name "text") - (= field-name "location") - (str/ends-with? field-name "_fts")) + (when (or (and (#{ "name" "description" "content" "text" "location"} field-name)) + (and (str/ends-with? field-name "_fts") + (not (str/ends-with? field-name "_fts_fts"))))
43-49: Remove or document the forgottendoc->mapfunction.
Since this function is commented out, consider removing it entirely or adding comments explaining future use cases to reduce accidental clutter.
60-65: Check concurrency implications forreader-atomandsearcher-atom.
Multiple threads might update these atoms simultaneously, risking transient read/write issues. Evaluate if additional synchronization or a single-threaded index writer policy is necessary.
92-152: Leading wildcard usage may cause performance challenges.
Allowing leading wildcards and adding*for short queries can significantly expand search complexity on large datasets. Consider restricting or optimizing such searches to avoid heavy performance overhead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
src/chrondb/api/sql/execution/query.clj(7 hunks)src/chrondb/api/sql/parser/clauses.clj(2 hunks)src/chrondb/api/sql/parser/statements.clj(4 hunks)src/chrondb/api/sql/server.clj(1 hunks)src/chrondb/api/v1.clj(1 hunks)src/chrondb/api/v1/routes.clj(2 hunks)src/chrondb/core.clj(2 hunks)src/chrondb/example.clj(4 hunks)src/chrondb/index/lucene.clj(1 hunks)src/chrondb/index/memory.clj(2 hunks)src/chrondb/index/protocol.clj(1 hunks)test/chrondb/api/sql/execution/query_test.clj(12 hunks)test/chrondb/api/sql/sql_query_test.clj(1 hunks)test/chrondb/index/lucene_test.clj(2 hunks)test/chrondb/index/memory_test.clj(1 hunks)test/chrondb/storage/git_test.clj(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: test
- GitHub Check: test
🔇 Additional comments (51)
src/chrondb/api/v1.clj (1)
25-27:Details
✅ Verification successful
Updated search function signature to support full-text search
The
handle-searchfunction has been modified to callindex/searchwith additional parameters, specifying the field to search ("name") and the branch to search in ("main"). This change aligns with the PR's objective of implementing full-text search using PostgreSQL's to_tsquery syntax.Let's verify that this change is consistent with other search function implementations:
🏁 Script executed:
#!/bin/bash # Searching for other index/search calls to ensure consistent parameter usage rg -A 2 "index/search" --glob "!src/chrondb/api/v1.clj" --glob "!src/chrondb/example.clj"Length of output: 5070
Verified: Consistent Full-Text Search Implementation
The
handle-searchfunction insrc/chrondb/api/v1.cljnow calls(index/search index "name" query "main"), which is fully consistent with other usages throughout the codebase (e.g., intest/chrondb/index/memory_test.cljandtest/chrondb/index/lucene_test.clj). This modification meets the PR’s goal of leveraging PostgreSQL'sto_tsquerysyntax for full-text search.
- Confirmed that similar search calls across tests use the fields "name" and branch "main".
- No discrepancies were found in parameter consistency with other implementations.
- The revised search signature aligns with the intended functionality and coding standards.
test/chrondb/api/sql/sql_query_test.clj (1)
106-106: Modified handle-select function call to support the FTS index parameterThe change from passing
parsedtonilas the second argument inquery/handle-selectadapts the test to the updated function signature, which now likely accepts an optional index parameter for full-text search support.src/chrondb/example.clj (1)
26-27: Updated search-users implementation to support full-text searchSimilar to the change in
src/chrondb/api/v1.clj, theindex/searchcall now includes field and branch parameters. This ensures consistent implementation of the full-text search capability across the codebase.test/chrondb/api/sql/execution/query_test.clj (1)
92-92:Details
✅ Verification successful
Consistently updated all handle-select calls to support FTS capability
All calls to
query/handle-selecthave been updated to passnilas the second argument instead of the query variable. This consistent change across all test cases adapts the tests to the updated function signature, which now likely accepts an optional index parameter for full-text search functionality.Let's verify the updated function signature of handle-select:
Also applies to: 105-105, 118-118, 131-131, 145-145, 157-157, 171-171, 199-199, 219-219, 236-236, 247-247, 258-258
🏁 Script executed:
#!/bin/bash # Search for the handle-select function definition to confirm the new signature rg -A 5 "defn handle-select" --glob "src/chrondb/api/sql/execution/*.clj"Length of output: 1053
FTS Update Verification Complete: Consistent Parameter Usage Confirmed
All instances ofquery/handle-selecthave been updated to passnilas the second argument, matching the newly defined function signature insrc/chrondb/api/sql/execution/query.cljwhich now accepts an optional index parameter for full-text search (FTS). This update is correctly applied in all affected test cases (lines 92, 105, 118, 131, 145, 157, 171, 199, 219, 236, 247, 258).src/chrondb/core.clj (2)
103-105: LGTM: Improved logging for index creationAdding logging statements for the Lucene index creation process improves observability of the application startup sequence.
123-123: LGTM: Updated SQL server initialization to include indexProperly passing the index to the SQL server is essential for enabling the new full-text search capabilities.
src/chrondb/index/memory.clj (2)
9-18: Use of deprecated code marker for document-matches? functionGood practice to mark the unused function with
#_rather than removing it entirely, as it documents the evolution of the codebase and might be useful for reference.
32-32: Appropriate logging for MemoryIndex searchThe logging statement clearly indicates that this implementation uses basic string matching rather than full-text search, which is important for users to understand the limitations of the memory index implementation.
src/chrondb/api/sql/parser/statements.clj (2)
36-39: Improved SQL parsing robustnessThe addition of code to clean table specifications by removing trailing semicolons and trimming whitespace is a good improvement to make the SQL parser more robust. This change handles common syntax variations that users might include in their queries.
72-75: Consistent cleanup logic across all SQL statement parsersI notice the same cleanup logic has been consistently applied across SELECT, INSERT, UPDATE, and DELETE statements. This ensures uniform handling of table specifications throughout the parser, which is a good practice.
Also applies to: 126-129, 177-180
src/chrondb/api/sql/server.clj (2)
4-4: Dependency update looks good.
Requiring[chrondb.index.lucene :as lucene-index]aligns with the new Lucene-based indexing feature, removing the dependency on any memory-based index.
14-22: Constructor logic is clear.
The fallback tolucene-index/create-lucene-indexwhenindexis nil is straightforward, ensuring the server starts gracefully on the specified port. No issues found.test/chrondb/storage/git_test.clj (7)
46-52: Repository opener appears correct.
Theopen-repofunction properly constructs and builds a repository from the provided path usingFileRepositoryBuilder. Good job handling environment and Git directory discovery.
120-140: Branch operations test coverage.
Verifying that documents differ across branches is important for multi-branch scenarios. These tests provide good coverage of separate states in “main” and “dev.”
141-176: Query operations tests look comprehensive.
Testing prefix and table-based queries, plus multi-branch queries, ensures robust validation of Git-based storage retrieval logic.
177-190: File operation error handling.
Forcing commit errors and testing non-existent documents verify edge-case behaviors. Nicely done.
191-211: Concurrent operations handled effectively.
Gradual delays and verifying final saved documents validate concurrency mechanics. Code looks good.
212-221: Existing directory setup test.
Ensuring Git storage starts even if the directory already exists is a valuable scenario. Nice coverage.
222-236: Document path generation coverage.
These tests confirm flexible ID patterns and prefix retrieval. This helps ensure consistent pathing in Git storage.src/chrondb/index/protocol.clj (1)
2-3: Protocol documentation improvements.
Simplified descriptions are concise, and addingclosemethod is a welcome addition for resource management. No concerns.Also applies to: 7-11
src/chrondb/api/sql/parser/clauses.clj (2)
5-6: Logging and utility references look good.
The addition of[chrondb.util.logging :as log]is consistent with the rest of the codebase and should help with debugging.
64-69: Docstring correctly outlines new FTS parsing.
Highlighting recognition ofFTS_MATCHand@@ to_tsqueryclarifies the function scope.test/chrondb/index/lucene_test.clj (13)
27-29: LGTM: Properly adjusted theindex/searchcall with new parameters.
34-35: LGTM: Consistent update to include the field and branch for searching.
37-38: LGTM: Correct deletion flow followed by a fresh search to confirm removal.
46-48: LGTM: Updated the search call in the test to reflect the new function signature.
53-55: LGTM: Searching by age field now includes the branch argument, which is appropriate.
60-61: LGTM: Verifies the engine handles empty search strings properly.
66-67: LGTM: Checks handling of non-existent terms, ensuring the search returns an empty result.
69-83: Comprehensive accent-handling test.
This test effectively checks diacritic folding and ensures accented and unaccented queries align.
84-103: Thorough coverage for wildcard and prefix searches.
This suite adequately tests infix and prefix matching scenarios.
104-120: Good job verifying FTS-optimized fields.
Ensures that custom fields tagged for full-text search are correctly indexed and retrieved.
122-135: Robust error handling tests.
Checking writer-closed scenarios helps confirm stability under failure conditions.
137-141: Verifies invalid directory handling.
Helps ensure the index creation gracefully returns nil without crashing.
143-150: Tests behavior with a nil index reader.
This scenario is critical to confirm the search operation fails gracefully.test/chrondb/index/memory_test.clj (7)
1-5: Namespace and dependencies initialization look correct.
Sets up the test environment properly with references to memory index protocols.
6-11: Fixture creation is straightforward and well-structured.
Ensures each test runs against a fresh memory index instance.
14-20: Verifies successful indexing.
Checks both the returned result and storage data structure for consistency.
21-31: Tests removal of an indexed document.
Appropriately confirms the document can no longer be retrieved post-deletion.
32-48: Ensures field-based search logic works with multiple documents.
Validates the count and content of returned results.
49-62: Good coverage of searching via a content field.
Shows that custom search fields function as expected.
63-79: Checks case insensitivity in searches.
Verifies consistent result sets for different text cases.src/chrondb/api/sql/execution/query.clj (4)
86-95: Neat helper to identify FTS conditions.
Clearly checks the:typefield for:fts-match—simple and effective.
97-235: Implementation of combined FTS and standard conditions is comprehensive.
You’re only processing the first FTS condition (first-fts), so if multiple FTS conditions exist, only the first is used. Verify if that’s intended or if you need to handle multiple. Otherwise, the branching logic and standard condition filtering appear coherent, and the code is well-instrumented with logs.
275-355: Index parameter integration looks consistent.
Properly accounts for potential count and aggregate queries, preserving existing behaviors while extending the logic to handle FTS results.
549-567: Handles the new index parameter inhandle-querygracefully.
The error logging and fallback for unknown queries are consistent with the rest of the flow.src/chrondb/index/lucene.clj (5)
3-13: No issues with the newly imported namespaces and classes.
These imports align well with the subsequent functionality and appear correctly scoped.
78-90: Document removal is well-structured.
Good use of exception handling and a clear boolean return to signify success or failure. This pattern helps avoid silent errors.
154-160: Graceful resource teardown path.
Closing the writer, reader, and directory in this order mitigates risk of resource leaks or read/write conflicts. Implementation looks solid here.
162-187: Comprehensive index creation and fallback strategy.
Building or reopening an index with a graceful fallback for missing or corrupt indexes ensures reliability. The logging provides helpful diagnostic details.
67-77:Details
❓ Verification inconclusive
Verify thread safety in
index-document.
Because theindex-documentmethod resets thereader-atomandsearcher-atomafter committing, concurrent indexing could lead to inconsistent states if not properly synchronized.
🏁 Script executed:
#!/bin/bash # Searching for concurrency references and usage patterns of `index-document` across the codebase rg "index-document"Length of output: 2685
Attention: Potential Concurrency Issue in Document Indexing
The current implementation in
src/chrondb/index/lucene.clj(lines 67–77) resets thereader-atomandsearcher-atomimmediately after calling.commiton the writer. Given that this function is invoked from multiple parts of the codebase (as evidenced by its usage in several API endpoints and tests), concurrent calls toindex-documentcan lead to race conditions and an inconsistent state between the writer, reader, and searcher.
- Location:
src/chrondb/index/lucene.clj, lines 67–77- Issue: Lack of synchronization when resetting shared atoms can cause conflicts if multiple threads index documents simultaneously.
- Suggestion: Evaluate the thread safety of the indexing process. If concurrent indexing is expected, consider introducing synchronization (e.g., via locking mechanisms or atomic updates) around the operations that commit the writer and update the reader and searcher.
| (search [_ field query-string branch] | ||
| (log/log-warn (str "MemoryIndex search called (field: " field ", query: " query-string ", branch: " branch "). Basic string matching performed, not full FTS.")) | ||
| (let [all-docs (seq (.values data)) | ||
| query-lower (str/lower-case query-string)] | ||
| (filter | ||
| (fn [doc] | ||
| (some (fn [[k v]] | ||
| (when (string? v) | ||
| (cond | ||
| (= (name k) field) (str/includes? (str/lower-case v) query-lower) | ||
| (= field "content") (str/includes? (str/lower-case v) query-lower) | ||
| :else false))) | ||
| doc)) | ||
| all-docs)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branch parameter is not used in filtering
While the search signature has been updated to accept the branch parameter, it's not actually used in the filtering logic. This could lead to inconsistencies with other index implementations that might use this parameter.
Consider updating the filter function to also check if the document's branch matches the specified branch parameter.
(filter
(fn [doc]
+ (and
+ (or (nil? branch) (= (:branch doc) branch))
(some (fn [[k v]]
(when (string? v)
(cond
(= (name k) field) (str/includes? (str/lower-case v) query-lower)
(= field "content") (str/includes? (str/lower-case v) query-lower)
:else false)))
- doc))
+ doc)))
all-docs))))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| (search [_ field query-string branch] | |
| (log/log-warn (str "MemoryIndex search called (field: " field ", query: " query-string ", branch: " branch "). Basic string matching performed, not full FTS.")) | |
| (let [all-docs (seq (.values data)) | |
| query-lower (str/lower-case query-string)] | |
| (filter | |
| (fn [doc] | |
| (some (fn [[k v]] | |
| (when (string? v) | |
| (cond | |
| (= (name k) field) (str/includes? (str/lower-case v) query-lower) | |
| (= field "content") (str/includes? (str/lower-case v) query-lower) | |
| :else false))) | |
| doc)) | |
| all-docs)))) | |
| (search [_ field query-string branch] | |
| (log/log-warn (str "MemoryIndex search called (field: " field ", query: " query-string ", branch: " branch "). Basic string matching performed, not full FTS.")) | |
| (let [all-docs (seq (.values data)) | |
| query-lower (str/lower-case query-string)] | |
| (filter | |
| (fn [doc] | |
| (and | |
| (or (nil? branch) (= (:branch doc) branch)) | |
| (some (fn [[k v]] | |
| (when (string? v) | |
| (cond | |
| (= (name k) field) (str/includes? (str/lower-case v) query-lower) | |
| (= field "content") (str/includes? (str/lower-case v) query-lower) | |
| :else false))) | |
| doc))) | |
| all-docs)))) |
This commit adds support for full-text search (FTS) in ChronDB using PostgreSQL's to_tsquery syntax. The implementation includes:
The changes allow users to perform full-text searches using familiar PostgreSQL syntax while leveraging Lucene's powerful search capabilities under the hood.
Key files modified:
Summary by CodeRabbit
New Features
Enhancements
Documentation
to_tsqueryin the documentation.Tests