Skip to content

Conversation

@temrb
Copy link
Owner

@temrb temrb commented Oct 28, 2025

Summary by CodeRabbit

  • Breaking Changes

    • Core application functionality has been removed. Existing functionality will no longer work.
  • New Features

    • Added support for crawling Upstash documentation (Redis, Vector, QStash, and Workflow).
  • Chores

    • Reorganized build and execution scripts for improved workflow management.
    • Centralized filesystem path configuration for better maintainability.

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

Walkthrough

The changes introduce a centralized path configuration module, refactor all file operations to use configurable paths instead of hardcoded values, reorganize npm scripts for improved build orchestration, add Upstash documentation job configuration, enhance crawling discovery with timeout mechanisms and improved exclude pattern handling, and strengthen error handling and initialization guards throughout the codebase.

Changes

Cohort / File(s) Summary
Path Centralization
src/paths.ts
New module defining centralized PATHS constant with environment-aware path resolution for storage, indexes, databases, and job outputs. Exports PathKey type and getPath() function.
Configuration Jobs
configurations/jobs/upstash.ts, configurations/index.ts
Adds Upstash job configuration covering Redis, Vector, QStash, and Workflow documentation with CSS selectors and exclude patterns. Registers upstash entry in configurations/index.ts jobs export.
Path Integration Across Core
src/core.ts, src/job-store.ts, src/llm-service.ts, src/queue.ts, src/schema.ts, src/scripts/*
Updates all modules to import and use PATHS instead of hardcoded directory strings (e.g., PATHS.storage, PATHS.jobsDb, PATHS.jobsOutput, PATHS.llms, PATHS.indexes). Consistent pattern across files.
Discovery Refactoring
src/core.ts
Introduces discovery timeout mechanism with Promise.race, expands navigation selector patterns, normalizes exclude patterns with /** variants, filters sitemap URLs against expanded excludes, and improves timeout cleanup.
Queue Initialization Guards
src/queue.ts
Adds isInitialized() public method and ensureInitialized() private guard. Enforces initialization checks across public operations. Updates constructor to default to PATHS.queueDb. Adds SQLITE_CONSTRAINT error handling for duplicate job IDs.
Server Health Endpoints
src/server.ts
Adds GET /health and GET /ready endpoints for liveness and readiness checks (including queue stats). Simplifies graceful shutdown logic by removing intermediate promise mechanism.
Infrastructure Improvements
src/llm-service.ts, src/worker.ts
Introduces timestamp threshold (1000ms) for staleness checks in LLM artifacts. Converts worker's fire-and-forget LLM processing to awaited try/catch with success logging.
Script Reorganization
package.json
Restructures npm scripts: adds build, cli:run, cli, check:llm-artifacts:run, check:llm-artifacts. Updates dev and cli to orchestrate via new composite scripts using bun run generate:jobs. Removes prebuild.

Sequence Diagram(s)

sequenceDiagram
    participant Main
    participant Discovery
    participant Browser
    participant Timeout
    
    rect rgb(200, 220, 255)
    Note over Main,Timeout: New Discovery Phase with Timeout
    Main->>Discovery: Start discovery task
    Discovery->>Timeout: Create timeoutPromise (configurable delay)
    par Navigation & Discovery
        Discovery->>Browser: Load extended navigational selectors<br/>(sidebar, nav, etc.)
        Browser->>Browser: Wait for selector with per-step timeout
    and Timeout Monitoring
        Timeout->>Timeout: Wait for timeout
    end
    alt Discovery succeeds first
        Browser->>Discovery: Selector found
        Discovery->>Main: Return discovered URLs
        Discovery->>Timeout: Cleanup timeout
    else Timeout fires first
        Timeout->>Discovery: Timeout elapsed
        Discovery->>Main: Log warning, continue with seed URLs
        Discovery->>Browser: Cleanup context
    end
    end
    
    rect rgb(220, 255, 220)
    Note over Main: URL Processing (Sitemap + Seed)
    Main->>Main: Expand exclude patterns (add /** variants)
    Main->>Main: Filter sitemap URLs against expanded excludes
    Main->>Main: Add both sitemap + seed URLs to requests
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Path centralization pattern: Consistent across ~8 files but requires verification that all path references are correctly updated and environment variable fallbacks work as intended.
  • Discovery timeout mechanism (src/core.ts): New Promise.race logic and navigation selector expansion warrant careful review of timeout handling, cleanup, and fallback to seed URLs.
  • Queue initialization guards (src/queue.ts): Verify ensureInitialized() is invoked at all necessary entry points and that SQLITE_CONSTRAINT error handling doesn't mask other constraint violations.
  • Artifact staleness logic (src/llm-service.ts): Timestamp threshold (1000ms) addition requires verification against typical filesystem time resolution on target platforms.

Poem

🐰 Organized burrows, paths made clear,
Timeouts keep the flow sincere,
Upstash docs now crawled with care,
Guard clauses everywhere,
The rabbit's warren—structured, fair!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The pull request title "init" is extremely vague and generic. While the changeset contains several meaningful updates including the introduction of a new PATHS module for centralizing filesystem paths, refactoring existing modules to use this new path management system, and adding a new Upstash job configuration, the single-word title "init" conveys none of these details. A developer scanning the commit history would have no meaningful understanding of what this pull request actually accomplishes. The title violates the principle of being clear and specific enough that teammates can understand the primary change. Update the pull request title to be more descriptive and specific about the main changes. Consider titles such as "Centralize filesystem paths and refactor path usage" or "Add PATHS module for centralized path management" that would clearly communicate the primary objective of the changeset to reviewers scanning the history.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch init-docker

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
src/llm-service.ts (1)

90-101: Bug: only handles single ${jobName}.json; fails for split outputs (jobName-1.json, jobName-2.json, ...).

When write() splits outputs, processJobOutput will throw and search()’s JIT regeneration path breaks. Aggregate all matching files.

Apply this diff:

@@
   public async processJobOutput(jobName: string): Promise<void> {
     await this.initializationPromise;
-    const jsonPath = join(JOBS_OUTPUT_DIR, `${jobName}.json`);
-    if (!existsSync(jsonPath)) {
-      throw new Error(
-        `Job output file for '${jobName}' does not exist at ${jsonPath}.`
-      );
-    }
-
-    await this.generateArtifacts(jobName, jsonPath);
+    const primary = join(JOBS_OUTPUT_DIR, `${jobName}.json`);
+    let jsonPaths: string[] = [];
+    if (existsSync(primary)) {
+      jsonPaths = [primary];
+    } else {
+      // Fallback to split files pattern
+      const pattern = join(JOBS_OUTPUT_DIR, `${jobName}*.json`);
+      const matches = await (await import("glob")).glob(pattern);
+      jsonPaths = matches.sort();
+    }
+    if (jsonPaths.length === 0) {
+      throw new Error(
+        `No job output files for '${jobName}' under ${JOBS_OUTPUT_DIR}.`
+      );
+    }
+    await this.generateArtifacts(jobName, jsonPaths);
   }
@@
-  public async generateArtifacts(
-    jobName: string,
-    jsonPath: string
-  ): Promise<void> {
+  public async generateArtifacts(
+    jobName: string,
+    jsonPathOrPaths: string | string[],
+  ): Promise<void> {
@@
-    const llmTextPath = join(LLMS_DIR, `${jobName}.txt`);
-    const rawJson = await readFile(jsonPath, 'utf-8');
-    const data = JSON.parse(rawJson) as CrawledData[];
+    const llmTextPath = join(LLMS_DIR, `${jobName}.txt`);
+    const paths = Array.isArray(jsonPathOrPaths)
+      ? jsonPathOrPaths
+      : [jsonPathOrPaths];
+    const dataArrays = await Promise.all(
+      paths.map(async (p) => JSON.parse(await readFile(p, "utf-8")) as CrawledData[]),
+    );
+    const data = dataArrays.flat();
src/server.ts (3)

225-229: Output filename collisions across jobs/configs; include jobId to ensure uniqueness.

Multiple jobs with the same name (batch or concurrent) write to the same base file, risking overwrites and wrong result links.

Apply this diff:

@@
-      const configWithFileName: Config = {
-        ...config,
-        outputFileName: generateOutputFileName(jobName),
-      };
+      const configWithFileName: Config = {
+        ...config,
+        outputFileName: generateOutputFileName(`${jobName}-${jobId}`),
+      };
@@
-        const configWithFileName = {
-          ...config,
-          outputFileName: generateOutputFileName(name),
-        };
+        const configWithFileName = {
+          ...config,
+          outputFileName: generateOutputFileName(`${name}-${jobId}`),
+        };

Also applies to: 288-292


439-473: Path traversal risk in /get/:jobName/llms.txt. Sanitize/validate jobName before use.

Untrusted jobName can escape LLMS_DIR. Validate against a safe pattern or allowlist before passing to llmService.

Apply this diff:

@@
-  app.get('/get/:jobName/llms.txt', async (req: Request, res: Response) => {
-    const { jobName } = req.params;
+  app.get('/get/:jobName/llms.txt', async (req: Request, res: Response) => {
+    const { jobName } = req.params;
+    // Allow only alphanumerics, dash, underscore (mirrors filenames we create)
+    const SAFE_NAME_RE = /^[A-Za-z0-9_-]+$/;
+    if (!SAFE_NAME_RE.test(jobName)) {
+      logger.warn({ jobName }, 'Rejected unsafe jobName');
+      res.status(400).json({ message: 'Invalid job name.' });
+      return;
+    }
@@
-    if (!jobName || !llmService.jobExists(jobName)) {
+    if (!jobName || !llmService.jobExists(jobName)) {
       res.status(404).json({
         message: `Knowledge file for job '${jobName}' not found.`,
         availableJobs: getAllJobNames(),
       });
       return;
     }

404-408: createReadStream options: pass an object, not a string.

Use { encoding: 'utf-8' } for correctness and typings.

Apply this diff:

-      const fileStream = createReadStream(job.outputFile, 'utf-8');
+      const fileStream = createReadStream(job.outputFile, { encoding: 'utf-8' });
src/core.ts (1)

480-499: Bug: isWithinTokenLimit returns boolean, but code treats it as a token count. Data may be dropped.

Compute token count explicitly (e.g., encode().length) and compare against limits.

Apply this diff:

-    if (globalConfig.maxTokens !== 'unlimited') {
-      const tokenCount: number | false = isWithinTokenLimit(
-        contentString,
-        globalConfig.maxTokens
-      );
-
-      if (typeof tokenCount === 'number') {
-        if (estimatedTokens + tokenCount > globalConfig.maxTokens) {
+    if (globalConfig.maxTokens !== 'unlimited') {
+      // Compute token count explicitly
+      const { encode } = await import('gpt-tokenizer');
+      const tokenCount: number = encode(contentString).length;
+      if (estimatedTokens + tokenCount > globalConfig.maxTokens) {
           // Only write the batch if it's not empty (something to write)
           if (currentResults.length > 0) {
             await writeBatchToFile();
           }
           // Since the addition of a single item exceeded the token limit, halve it.
           estimatedTokens = Math.floor(tokenCount / 2);
           currentResults.push(data);
-        } else {
-          currentResults.push(data);
-          estimatedTokens += tokenCount;
-        }
-      }
+        } else {
+          currentResults.push(data);
+          estimatedTokens += tokenCount;
+        }
+      }
🧹 Nitpick comments (5)
configurations/jobs/upstash.ts (1)

5-65: Consider extracting common exclude patterns to reduce duplication.

All four job configurations share several identical exclude patterns (**/tutorials, **/integrations, **/help, **/commands/**, https://context7.com/**). Extracting these into a constant would improve maintainability.

Example refactor:

const COMMON_EXCLUDES = [
  '**/tutorials',
  '**/integrations',
  '**/help',
  '**/commands/**',
  'https://context7.com/**',
] as const;

export default defineJob([
  {
    entry: 'https://upstash.com/docs/redis/overall/getstarted',
    match: ['https://upstash.com/docs/redis/**'],
    selector: '#content-area',
    exclude: [
      ...COMMON_EXCLUDES,
      '**/compare',
    ],
    waitForSelectorTimeout: 15000,
  },
  // ... other configs
]);

Note: The https://context7.com/** exclusion appears unrelated to the Upstash domain. Verify this is intentional and not a copy-paste remnant from another configuration.

src/llm-service.ts (1)

202-208: Prefer mtimeMs to avoid ad-hoc threshold logic.

Use high‑resolution mtimeMs and drop thresholdMs; simpler and more robust across filesystems.

Apply this diff:

-    const thresholdMs = 1000; // Allow for coarse filesystem timestamp resolution
-    const jsonMtime = jsonStats.mtime.getTime();
+    const jsonMtime = jsonStats.mtimeMs;
...
-      jsonMtime > indexStats.mtime.getTime() + thresholdMs ||
-      jsonMtime > metadataStats.mtime.getTime() + thresholdMs
+      jsonMtime > indexStats.mtimeMs ||
+      jsonMtime > metadataStats.mtimeMs
src/paths.ts (2)

5-7: Normalize to absolute paths to avoid surprises with relative env values.

Resolve env/fallback to absolute paths for consistency.

Apply this diff:

-import { join } from "path";
+import { join, resolve } from "path";
@@
-function resolvePath(envVar: string | undefined, fallback: string): string {
-  return envVar && envVar.trim().length > 0 ? envVar : fallback;
-}
+function resolvePath(envVar: string | undefined, fallback: string): string {
+  const p = envVar && envVar.trim().length > 0 ? envVar : fallback;
+  return resolve(p);
+}

16-29: Freeze PATHS at runtime.

Prevent accidental mutation; TS’s as const is compile-time only.

Apply this diff:

-export const PATHS = {
+export const PATHS = Object.freeze({
   root: ROOT_DIR,
   data: dataDir,
   output: outputDir,
   storage: storageDir,
   llms: resolvePath(process.env.LLMS_DIR, join(dataDir, "llms")),
   indexes: resolvePath(process.env.INDEXES_DIR, join(dataDir, "indexes")),
   queueDb: resolvePath(process.env.QUEUE_DB_PATH, join(dataDir, "queue.db")),
   jobsDb: resolvePath(process.env.JOBS_DB_PATH, join(dataDir, "jobs.db")),
   jobsOutput: resolvePath(
     process.env.JOBS_OUTPUT_DIR,
     join(outputDir, "jobs"),
   ),
-} as const;
+}) as const;
src/queue.ts (1)

65-68: Add busy_timeout to reduce SQLITE_BUSY under load.

Helps concurrent access in a single process with WAL.

Apply this diff:

     this.db = new Database(this.dbPath);
     this.db.pragma("journal_mode = WAL"); // Better concurrency
+    this.db.pragma("busy_timeout = 5000"); // Avoid immediate SQLITE_BUSY
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53f2fa9 and cb0ec22.

⛔ Files ignored due to path filters (1)
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • concatenated.xml (0 hunks)
  • configurations/index.ts (2 hunks)
  • configurations/jobs/upstash.ts (1 hunks)
  • package.json (1 hunks)
  • src/core.ts (12 hunks)
  • src/job-store.ts (2 hunks)
  • src/llm-service.ts (2 hunks)
  • src/paths.ts (1 hunks)
  • src/queue.ts (11 hunks)
  • src/schema.ts (2 hunks)
  • src/scripts/check-llm-artifacts.ts (1 hunks)
  • src/scripts/generate-llm-artifacts.ts (1 hunks)
  • src/server.ts (3 hunks)
  • src/worker.ts (1 hunks)
💤 Files with no reviewable changes (1)
  • concatenated.xml
🧰 Additional context used
🧬 Code graph analysis (10)
src/llm-service.ts (1)
src/paths.ts (1)
  • PATHS (16-29)
src/scripts/generate-llm-artifacts.ts (1)
src/paths.ts (1)
  • PATHS (16-29)
configurations/jobs/upstash.ts (1)
configurations/types.ts (1)
  • defineJob (99-105)
src/server.ts (3)
src/paths.ts (1)
  • PATHS (16-29)
src/queue.ts (1)
  • crawlQueue (360-360)
src/job-store.ts (1)
  • jobStore (180-180)
src/queue.ts (1)
src/paths.ts (1)
  • PATHS (16-29)
src/job-store.ts (1)
src/paths.ts (1)
  • PATHS (16-29)
src/worker.ts (2)
src/llm-service.ts (1)
  • llmService (266-266)
src/logger.ts (1)
  • logger (8-20)
src/scripts/check-llm-artifacts.ts (1)
src/paths.ts (1)
  • PATHS (16-29)
src/schema.ts (1)
src/paths.ts (1)
  • PATHS (16-29)
src/core.ts (2)
src/paths.ts (1)
  • PATHS (16-29)
src/schema.ts (1)
  • generateOutputFileName (155-158)
🔇 Additional comments (16)
configurations/index.ts (1)

13-13: LGTM!

The upstash job registration follows the established pattern and integrates correctly with the existing job registry.

Also applies to: 27-27

package.json (1)

49-55: LGTM!

The script reorganization improves the build orchestration by:

  • Ensuring job index generation runs before TypeScript compilation
  • Creating clear separation between build and run phases
  • Making the workflow more maintainable with intermediate steps
src/job-store.ts (1)

5-5: LGTM!

Clean refactor to use centralized path configuration from PATHS while maintaining the flexibility of allowing a custom path override.

Also applies to: 33-33

src/schema.ts (1)

2-2: LGTM!

Proper use of path.join with centralized PATHS ensures cross-platform compatibility for output file path generation.

Also applies to: 5-5, 156-157

src/scripts/generate-llm-artifacts.ts (1)

5-5: LGTM!

Correct use of PATHS with backslash normalization for cross-platform glob compatibility.

Also applies to: 7-7

src/scripts/check-llm-artifacts.ts (1)

5-5: LGTM!

Consistent with the parallel changes in generate-llm-artifacts.ts, correctly using PATHS with backslash normalization.

Also applies to: 7-7

src/worker.ts (1)

62-77: Excellent error handling improvement!

The change from fire-and-forget to awaited execution with try/catch properly handles LLM artifact generation failures without failing the main job. The "non-fatal" designation is appropriate since artifact generation is a post-processing step.

src/llm-service.ts (1)

8-14: Good move: centralizing paths via PATHS.

Replacing hardcoded directories with PATHS.llms/indexes/jobsOutput improves configurability and testability.

src/server.ts (2)

18-27: Good: PATHS.jobsOutput and PATHS.indexes centralization.

This removes hardcoded paths and aligns with the new paths module.


598-618: Shutdown: ensure close() calls are safe and awaited if async.

If jobStore.close() becomes async, unhandled rejection may skip process.exit.

Please confirm both crawlQueue.close() and jobStore.close() are synchronous. If not, wrap in try/catch and await Promises. I can provide a patch once confirmed.

src/queue.ts (2)

48-51: Good default to PATHS.queueDb.

Keeps configuration centralized.


93-101: Solid initialization guards across public methods.

Prevents misuse before initialize() and simplifies failure modes.

Also applies to: 118-119, 151-152, 199-200, 219-220, 261-262, 280-281, 297-298, 317-318, 351-356

src/core.ts (4)

156-158: Good: storage dir now under PATHS.storage per job.

Isolated storage reduces cross-job interference.


271-334: Nice discovery timeout and cleanup pattern.

Promise.race with clearTimeout in finally prevents hangs; discovery selectors look comprehensive.


275-282: No issues found — chromium.launch timeout option is valid.

Playwright's chromium.launch supports a timeout option (milliseconds) with a default of 30000ms. The code's 15-second timeout is correctly implemented and requires no changes.


512-515: No changes required — Dataset.forEach properly awaits async callbacks.

Dataset.forEach in Crawlee awaits any Promise returned by the callback before processing the next item, so the batching logic is safe from race conditions. The code at lines 512-515 is correct as written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants