Context0

Context0 is a powerful and flexible web crawling toolkit designed to generate structured knowledge bases from websites. It features a robust, queue-based architecture for asynchronous processing, a REST API for programmatic control, and a rich command-line interface for direct execution.

Its core purpose is to extract meaningful content and transform it into a format ready for Large Language Models (LLMs), complete with automatic embedding generation and a built-in semantic search API.

✨ Features

Multiple Operation Modes: Use the interactive CLI, the REST API, or run the background worker for maximum flexibility.
Asynchronous & Resilient: A queue-based worker system (powered by SQLite) ensures reliable, concurrent, and fault-tolerant crawling.
LLM-Ready Artifacts: Automatically generates text embeddings and vector indexes from crawled content for powerful semantic search.
Semantic Search API: Query your crawled knowledge base using natural language directly via a simple API endpoint.
Rich Configuration: Define complex crawl jobs with fine-grained control over URLs to match/exclude, content selectors (CSS & XPath), and more.
Persistent Job Store: Keeps a complete history of all crawl jobs, their status, and links to their output.
Automatic Data Handling: Intelligently splits large outputs based on token count or file size to respect LLM context limits.
Developer Friendly: Built with TypeScript, Zod for validation, and Pino for structured logging.

🚀 Getting Started

Prerequisites

Bun: This project uses Bun as the package manager and runtime.
Node.js: While Bun is used, having a recent Node.js version is recommended.
Windows Users: It is highly recommended to use Windows Subsystem for Linux (WSL) for compatibility.

Installation

Clone the repository:

git clone <your-repository-url>
cd <repository-directory>

Install dependencies: This command will also download the necessary Playwright browser binaries.
```
bun install
```
Set up your environment: Copy the example environment file and customize it as needed.
```
cp .env.example .env
```
Key variables in .env:
- API_PORT: Port for the API server.
- API_KEY: An optional secret key to protect your API endpoints.
- WORKER_CONCURRENCY: Number of crawl jobs the worker can run in parallel.

Quick Start: Crawl, Index, and Search

Let's run a pre-defined job, generate its LLM artifacts, and query it.

List available crawl jobs:
```
bun run cli list
```
Run a single job: This command runs all configurations for the zod job and saves the output to output/jobs/zod.json.
```
bun run cli single zod
```
Generate LLM artifacts: This script processes the JSON output, creates a text file, generates embeddings, and saves a vector index.
```
bun run generate:llm-artifacts
```
Start the API server:
```
bun run start:server
```
Perform a semantic search: Query the indexed zod documentation for information about "refinements".
```
curl "http://localhost:5000/get/zod/llms.txt?subject=what%20are%20refinements"
```

🛠️ Usage

Context0 can be operated in three main ways:

1. Command-Line Interface (CLI)

The CLI is ideal for direct control and running jobs synchronously.

List all jobs:
```
bun run cli list
```
Run a single job (e.g., react-19): The CLI will run all configs found in react-19.ts and aggregate the results.
```
bun run cli single react-19
```
Run a batch of specific jobs: This runs the prisma and trpc jobs sequentially.
```
bun run cli batch prisma trpc
```
Run in interactive mode: Running single or batch without arguments will launch an interactive prompt to select jobs.
```
bun run cli batch
```

2. API Server & Worker

For automated workflows, the API server and worker provide a robust, asynchronous system.

Start the Worker: The worker polls the queue for new jobs to process.
```
bun run start:worker
```
Start the API Server: The server exposes endpoints to queue jobs and check their status.
```
bun run start:server
```
(Optional) Start Both:
```
bun start
```

Key API Endpoints:

POST /crawl: Queue a crawl job by name (e.g., { "name": "zod" }).
GET /crawl/status/:jobId: Check the status of a job (pending, running, completed, failed).
GET /crawl/results/:jobId: Download the JSON output of a completed job.
GET /get/:jobName/llms.txt: Stream the full text knowledge file.
GET /get/:jobName/llms.txt?subject=<query>: Perform a semantic search on the knowledge file.

Check out the full OpenAPI documentation, available at /api-docs when the server is running.

3. LLM Features

The core value of Context0 is its ability to create searchable knowledge bases.

Artifacts: For each job, the system generates "LLM artifacts":
- data/llms/{job-name}.txt: A clean, concatenated text version of all crawled content.
- data/indexes/{job-name}.index: An HNSWLib vector index for fast semantic search.

Generation: Artifacts are generated automatically when a job completes via the worker. You can also generate or update them manually:

# Generate artifacts for any outdated job outputs
bun run generate:llm-artifacts

# Check if any artifacts are out of sync with their source files
bun run check:llm-artifacts

⚙️ Configuration

All crawl jobs are defined as TypeScript files in the configurations/jobs/ directory.

Job vs. Config

A Job is a file in configurations/jobs/, like prisma.ts. Its name is derived from the filename (e.g., prisma).
A Config is a single crawl configuration object within a job file. A job file can export a single config or an array of configs.

Example: A Single-Config Job (zod.ts)

import { defineJob } from '../types.js';

export default defineJob({
 entry: 'https://zod.dev',
 match: [
  'https://zod.dev/basics',
  'https://zod.dev/api',
  // ... more URLs
 ],
 selector: 'article',
});

Example: A Multi-Config Job (next-js-16.ts)

import { defineJob } from '../types.js';

export default defineJob([
 {
  entry: 'https://nextjs.org/docs/app/getting-started/proxy',
  match: ['https://nextjs.org/docs/app/api-reference/**'],
  selector: 'article',
 },
 {
  entry: 'https://nextjs.org/docs/architecture/accessibility',
  match: ['https://nextjs.org/docs/architecture/accessibility'],
  selector: 'article',
 },
]);

Main Configuration Fields

Field	Description
`entry`	Required. The starting URL for the crawl.
`match`	Required. A glob pattern or array of patterns for URLs to crawl.
`selector`	Required. A CSS selector or XPath to extract content from each page.
`exclude`	An optional glob pattern or array of patterns to exclude from the crawl.
`outputFileName`	A custom filename for the output JSON. Defaults to `{job-name}.json`.
`maxFileSize`	Max output file size in MB before splitting.
`onVisitPage`	A custom function to perform actions on a page (e.g., click buttons).

Global Configuration

Global settings like maxPagesToCrawl and maxTokens (for output splitting) can be adjusted in configurations/global.config.ts.

💻 Development Commands

bun run build: Compile TypeScript to the dist/ directory.
bun run dev: Run the CLI in batch mode (interactive).
bun run fmt: Format the codebase with Prettier.
bun run swagger: Regenerate the Swagger/OpenAPI documentation.
bun run generate:jobs: Important: Run this after adding or removing a job file to update the central job registry.

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
.github		.github
.vscode		.vscode
configurations		configurations
output		output
scripts		scripts
specs		specs
src		src
.env.example		.env.example
.gitignore		.gitignore
.repomixignore		.repomixignore
CLAUDE.md		CLAUDE.md
License		License
README.md		README.md
bun.lock		bun.lock
concatenated.xml		concatenated.xml
package.json		package.json
repomix.config.json		repomix.config.json
swagger.js		swagger.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Context0

✨ Features

🚀 Getting Started

Prerequisites

Installation

Quick Start: Crawl, Index, and Search

🛠️ Usage

1. Command-Line Interface (CLI)

2. API Server & Worker

3. LLM Features

⚙️ Configuration

Job vs. Config

Main Configuration Fields

Global Configuration

💻 Development Commands

About

Uh oh!

Languages

License

temrb/Context0

Folders and files

Latest commit

History

Repository files navigation

Context0

✨ Features

🚀 Getting Started

Prerequisites

Installation

Quick Start: Crawl, Index, and Search

🛠️ Usage

1. Command-Line Interface (CLI)

2. API Server & Worker

3. LLM Features

⚙️ Configuration

Job vs. Config

Main Configuration Fields

Global Configuration

💻 Development Commands

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages