-
Notifications
You must be signed in to change notification settings - Fork 0
init #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
init #10
Conversation
WalkthroughThe changes introduce a centralized path configuration module, refactor all file operations to use configurable paths instead of hardcoded values, reorganize npm scripts for improved build orchestration, add Upstash documentation job configuration, enhance crawling discovery with timeout mechanisms and improved exclude pattern handling, and strengthen error handling and initialization guards throughout the codebase. Changes
Sequence Diagram(s)sequenceDiagram
participant Main
participant Discovery
participant Browser
participant Timeout
rect rgb(200, 220, 255)
Note over Main,Timeout: New Discovery Phase with Timeout
Main->>Discovery: Start discovery task
Discovery->>Timeout: Create timeoutPromise (configurable delay)
par Navigation & Discovery
Discovery->>Browser: Load extended navigational selectors<br/>(sidebar, nav, etc.)
Browser->>Browser: Wait for selector with per-step timeout
and Timeout Monitoring
Timeout->>Timeout: Wait for timeout
end
alt Discovery succeeds first
Browser->>Discovery: Selector found
Discovery->>Main: Return discovered URLs
Discovery->>Timeout: Cleanup timeout
else Timeout fires first
Timeout->>Discovery: Timeout elapsed
Discovery->>Main: Log warning, continue with seed URLs
Discovery->>Browser: Cleanup context
end
end
rect rgb(220, 255, 220)
Note over Main: URL Processing (Sitemap + Seed)
Main->>Main: Expand exclude patterns (add /** variants)
Main->>Main: Filter sitemap URLs against expanded excludes
Main->>Main: Add both sitemap + seed URLs to requests
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/llm-service.ts (1)
90-101: Bug: only handles single ${jobName}.json; fails for split outputs (jobName-1.json, jobName-2.json, ...).When write() splits outputs, processJobOutput will throw and search()’s JIT regeneration path breaks. Aggregate all matching files.
Apply this diff:
@@ public async processJobOutput(jobName: string): Promise<void> { await this.initializationPromise; - const jsonPath = join(JOBS_OUTPUT_DIR, `${jobName}.json`); - if (!existsSync(jsonPath)) { - throw new Error( - `Job output file for '${jobName}' does not exist at ${jsonPath}.` - ); - } - - await this.generateArtifacts(jobName, jsonPath); + const primary = join(JOBS_OUTPUT_DIR, `${jobName}.json`); + let jsonPaths: string[] = []; + if (existsSync(primary)) { + jsonPaths = [primary]; + } else { + // Fallback to split files pattern + const pattern = join(JOBS_OUTPUT_DIR, `${jobName}*.json`); + const matches = await (await import("glob")).glob(pattern); + jsonPaths = matches.sort(); + } + if (jsonPaths.length === 0) { + throw new Error( + `No job output files for '${jobName}' under ${JOBS_OUTPUT_DIR}.` + ); + } + await this.generateArtifacts(jobName, jsonPaths); } @@ - public async generateArtifacts( - jobName: string, - jsonPath: string - ): Promise<void> { + public async generateArtifacts( + jobName: string, + jsonPathOrPaths: string | string[], + ): Promise<void> { @@ - const llmTextPath = join(LLMS_DIR, `${jobName}.txt`); - const rawJson = await readFile(jsonPath, 'utf-8'); - const data = JSON.parse(rawJson) as CrawledData[]; + const llmTextPath = join(LLMS_DIR, `${jobName}.txt`); + const paths = Array.isArray(jsonPathOrPaths) + ? jsonPathOrPaths + : [jsonPathOrPaths]; + const dataArrays = await Promise.all( + paths.map(async (p) => JSON.parse(await readFile(p, "utf-8")) as CrawledData[]), + ); + const data = dataArrays.flat();src/server.ts (3)
225-229: Output filename collisions across jobs/configs; include jobId to ensure uniqueness.Multiple jobs with the same name (batch or concurrent) write to the same base file, risking overwrites and wrong result links.
Apply this diff:
@@ - const configWithFileName: Config = { - ...config, - outputFileName: generateOutputFileName(jobName), - }; + const configWithFileName: Config = { + ...config, + outputFileName: generateOutputFileName(`${jobName}-${jobId}`), + }; @@ - const configWithFileName = { - ...config, - outputFileName: generateOutputFileName(name), - }; + const configWithFileName = { + ...config, + outputFileName: generateOutputFileName(`${name}-${jobId}`), + };Also applies to: 288-292
439-473: Path traversal risk in /get/:jobName/llms.txt. Sanitize/validate jobName before use.Untrusted jobName can escape LLMS_DIR. Validate against a safe pattern or allowlist before passing to llmService.
Apply this diff:
@@ - app.get('/get/:jobName/llms.txt', async (req: Request, res: Response) => { - const { jobName } = req.params; + app.get('/get/:jobName/llms.txt', async (req: Request, res: Response) => { + const { jobName } = req.params; + // Allow only alphanumerics, dash, underscore (mirrors filenames we create) + const SAFE_NAME_RE = /^[A-Za-z0-9_-]+$/; + if (!SAFE_NAME_RE.test(jobName)) { + logger.warn({ jobName }, 'Rejected unsafe jobName'); + res.status(400).json({ message: 'Invalid job name.' }); + return; + } @@ - if (!jobName || !llmService.jobExists(jobName)) { + if (!jobName || !llmService.jobExists(jobName)) { res.status(404).json({ message: `Knowledge file for job '${jobName}' not found.`, availableJobs: getAllJobNames(), }); return; }
404-408: createReadStream options: pass an object, not a string.Use { encoding: 'utf-8' } for correctness and typings.
Apply this diff:
- const fileStream = createReadStream(job.outputFile, 'utf-8'); + const fileStream = createReadStream(job.outputFile, { encoding: 'utf-8' });src/core.ts (1)
480-499: Bug: isWithinTokenLimit returns boolean, but code treats it as a token count. Data may be dropped.Compute token count explicitly (e.g., encode().length) and compare against limits.
Apply this diff:
- if (globalConfig.maxTokens !== 'unlimited') { - const tokenCount: number | false = isWithinTokenLimit( - contentString, - globalConfig.maxTokens - ); - - if (typeof tokenCount === 'number') { - if (estimatedTokens + tokenCount > globalConfig.maxTokens) { + if (globalConfig.maxTokens !== 'unlimited') { + // Compute token count explicitly + const { encode } = await import('gpt-tokenizer'); + const tokenCount: number = encode(contentString).length; + if (estimatedTokens + tokenCount > globalConfig.maxTokens) { // Only write the batch if it's not empty (something to write) if (currentResults.length > 0) { await writeBatchToFile(); } // Since the addition of a single item exceeded the token limit, halve it. estimatedTokens = Math.floor(tokenCount / 2); currentResults.push(data); - } else { - currentResults.push(data); - estimatedTokens += tokenCount; - } - } + } else { + currentResults.push(data); + estimatedTokens += tokenCount; + } + }
🧹 Nitpick comments (5)
configurations/jobs/upstash.ts (1)
5-65: Consider extracting common exclude patterns to reduce duplication.All four job configurations share several identical exclude patterns (
**/tutorials,**/integrations,**/help,**/commands/**,https://context7.com/**). Extracting these into a constant would improve maintainability.Example refactor:
const COMMON_EXCLUDES = [ '**/tutorials', '**/integrations', '**/help', '**/commands/**', 'https://context7.com/**', ] as const; export default defineJob([ { entry: 'https://upstash.com/docs/redis/overall/getstarted', match: ['https://upstash.com/docs/redis/**'], selector: '#content-area', exclude: [ ...COMMON_EXCLUDES, '**/compare', ], waitForSelectorTimeout: 15000, }, // ... other configs ]);Note: The
https://context7.com/**exclusion appears unrelated to the Upstash domain. Verify this is intentional and not a copy-paste remnant from another configuration.src/llm-service.ts (1)
202-208: Prefer mtimeMs to avoid ad-hoc threshold logic.Use high‑resolution mtimeMs and drop thresholdMs; simpler and more robust across filesystems.
Apply this diff:
- const thresholdMs = 1000; // Allow for coarse filesystem timestamp resolution - const jsonMtime = jsonStats.mtime.getTime(); + const jsonMtime = jsonStats.mtimeMs; ... - jsonMtime > indexStats.mtime.getTime() + thresholdMs || - jsonMtime > metadataStats.mtime.getTime() + thresholdMs + jsonMtime > indexStats.mtimeMs || + jsonMtime > metadataStats.mtimeMssrc/paths.ts (2)
5-7: Normalize to absolute paths to avoid surprises with relative env values.Resolve env/fallback to absolute paths for consistency.
Apply this diff:
-import { join } from "path"; +import { join, resolve } from "path"; @@ -function resolvePath(envVar: string | undefined, fallback: string): string { - return envVar && envVar.trim().length > 0 ? envVar : fallback; -} +function resolvePath(envVar: string | undefined, fallback: string): string { + const p = envVar && envVar.trim().length > 0 ? envVar : fallback; + return resolve(p); +}
16-29: Freeze PATHS at runtime.Prevent accidental mutation; TS’s
as constis compile-time only.Apply this diff:
-export const PATHS = { +export const PATHS = Object.freeze({ root: ROOT_DIR, data: dataDir, output: outputDir, storage: storageDir, llms: resolvePath(process.env.LLMS_DIR, join(dataDir, "llms")), indexes: resolvePath(process.env.INDEXES_DIR, join(dataDir, "indexes")), queueDb: resolvePath(process.env.QUEUE_DB_PATH, join(dataDir, "queue.db")), jobsDb: resolvePath(process.env.JOBS_DB_PATH, join(dataDir, "jobs.db")), jobsOutput: resolvePath( process.env.JOBS_OUTPUT_DIR, join(outputDir, "jobs"), ), -} as const; +}) as const;src/queue.ts (1)
65-68: Add busy_timeout to reduce SQLITE_BUSY under load.Helps concurrent access in a single process with WAL.
Apply this diff:
this.db = new Database(this.dbPath); this.db.pragma("journal_mode = WAL"); // Better concurrency + this.db.pragma("busy_timeout = 5000"); // Avoid immediate SQLITE_BUSY
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
bun.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
concatenated.xml(0 hunks)configurations/index.ts(2 hunks)configurations/jobs/upstash.ts(1 hunks)package.json(1 hunks)src/core.ts(12 hunks)src/job-store.ts(2 hunks)src/llm-service.ts(2 hunks)src/paths.ts(1 hunks)src/queue.ts(11 hunks)src/schema.ts(2 hunks)src/scripts/check-llm-artifacts.ts(1 hunks)src/scripts/generate-llm-artifacts.ts(1 hunks)src/server.ts(3 hunks)src/worker.ts(1 hunks)
💤 Files with no reviewable changes (1)
- concatenated.xml
🧰 Additional context used
🧬 Code graph analysis (10)
src/llm-service.ts (1)
src/paths.ts (1)
PATHS(16-29)
src/scripts/generate-llm-artifacts.ts (1)
src/paths.ts (1)
PATHS(16-29)
configurations/jobs/upstash.ts (1)
configurations/types.ts (1)
defineJob(99-105)
src/server.ts (3)
src/paths.ts (1)
PATHS(16-29)src/queue.ts (1)
crawlQueue(360-360)src/job-store.ts (1)
jobStore(180-180)
src/queue.ts (1)
src/paths.ts (1)
PATHS(16-29)
src/job-store.ts (1)
src/paths.ts (1)
PATHS(16-29)
src/worker.ts (2)
src/llm-service.ts (1)
llmService(266-266)src/logger.ts (1)
logger(8-20)
src/scripts/check-llm-artifacts.ts (1)
src/paths.ts (1)
PATHS(16-29)
src/schema.ts (1)
src/paths.ts (1)
PATHS(16-29)
src/core.ts (2)
src/paths.ts (1)
PATHS(16-29)src/schema.ts (1)
generateOutputFileName(155-158)
🔇 Additional comments (16)
configurations/index.ts (1)
13-13: LGTM!The upstash job registration follows the established pattern and integrates correctly with the existing job registry.
Also applies to: 27-27
package.json (1)
49-55: LGTM!The script reorganization improves the build orchestration by:
- Ensuring job index generation runs before TypeScript compilation
- Creating clear separation between build and run phases
- Making the workflow more maintainable with intermediate steps
src/job-store.ts (1)
5-5: LGTM!Clean refactor to use centralized path configuration from PATHS while maintaining the flexibility of allowing a custom path override.
Also applies to: 33-33
src/schema.ts (1)
2-2: LGTM!Proper use of
path.joinwith centralized PATHS ensures cross-platform compatibility for output file path generation.Also applies to: 5-5, 156-157
src/scripts/generate-llm-artifacts.ts (1)
5-5: LGTM!Correct use of PATHS with backslash normalization for cross-platform glob compatibility.
Also applies to: 7-7
src/scripts/check-llm-artifacts.ts (1)
5-5: LGTM!Consistent with the parallel changes in
generate-llm-artifacts.ts, correctly using PATHS with backslash normalization.Also applies to: 7-7
src/worker.ts (1)
62-77: Excellent error handling improvement!The change from fire-and-forget to awaited execution with try/catch properly handles LLM artifact generation failures without failing the main job. The "non-fatal" designation is appropriate since artifact generation is a post-processing step.
src/llm-service.ts (1)
8-14: Good move: centralizing paths via PATHS.Replacing hardcoded directories with PATHS.llms/indexes/jobsOutput improves configurability and testability.
src/server.ts (2)
18-27: Good: PATHS.jobsOutput and PATHS.indexes centralization.This removes hardcoded paths and aligns with the new paths module.
598-618: Shutdown: ensure close() calls are safe and awaited if async.If jobStore.close() becomes async, unhandled rejection may skip process.exit.
Please confirm both crawlQueue.close() and jobStore.close() are synchronous. If not, wrap in try/catch and await Promises. I can provide a patch once confirmed.
src/queue.ts (2)
48-51: Good default to PATHS.queueDb.Keeps configuration centralized.
93-101: Solid initialization guards across public methods.Prevents misuse before initialize() and simplifies failure modes.
Also applies to: 118-119, 151-152, 199-200, 219-220, 261-262, 280-281, 297-298, 317-318, 351-356
src/core.ts (4)
156-158: Good: storage dir now under PATHS.storage per job.Isolated storage reduces cross-job interference.
271-334: Nice discovery timeout and cleanup pattern.Promise.race with clearTimeout in finally prevents hangs; discovery selectors look comprehensive.
275-282: No issues found — chromium.launch timeout option is valid.Playwright's chromium.launch supports a timeout option (milliseconds) with a default of 30000ms. The code's 15-second timeout is correctly implemented and requires no changes.
512-515: No changes required — Dataset.forEach properly awaits async callbacks.Dataset.forEach in Crawlee awaits any Promise returned by the callback before processing the next item, so the batching logic is safe from race conditions. The code at lines 512-515 is correct as written.
Summary by CodeRabbit
Breaking Changes
New Features
Chores