feat: introduce crates proxing by seuros · Pull Request #10 · contriboss/vein

seuros · 2026-01-20T20:10:47Z

While cleaning this codebase, i noticed we could reuse vein for other ecosystems.

I removed the tree-sitter and binary feature (i will readd them later).

The plan is :
[X] Add Crates compatible api
[ ] Add Python compatible api
[ ] Add Npm compatible api

Vein will act as a proxy.

Fetch once from upstream , use forever.

Other ecosystem like Go use repos directly.

I might Add Dart (But i need to study the api first so it won't conflict).

Copilot

Pull request overview

This pull request introduces support for proxying crates.io packages alongside the existing RubyGems proxy functionality. The implementation adds a sparse registry index handler and crate download support following the Cargo sparse registry protocol. Additionally, the PR refactors caching logic into a reusable http_cache module and removes several gem-specific features including tree-sitter symbol extraction, binary architecture validation, and recursive SBOM generation.

Changes:

Adds crates.io sparse registry protocol support with index caching and download proxying
Refactors HTTP caching logic into a dedicated http_cache module for reuse across ecosystems
Removes tree-sitter, goblin, and other dependencies related to Ruby-specific analysis features

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
src/crates.rs	Module declaration for crates.io support
src/crates/types.rs	Index path computation and config types for sparse registry
src/crates/handlers.rs	Sparse index request handler with caching
src/http_cache.rs	Extracted HTTP caching utilities with ETag/Last-Modified support
src/proxy/types.rs	Added crate download path parsing and validation
src/proxy.rs	Integrated crates handlers and refactored compact index to use http_cache
src/gem_metadata/*.rs	Removed modules for symbols, binary_arch, indexer, version_req
src/gem_metadata/sbom.rs	Simplified SBOM generation, removed recursive resolution
src/config/database.rs	Removed max_connections field from database config
Cargo.toml	Removed dependencies: tree-sitter, goblin, semver, and others

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-20T20:16:32Z

src/crates/types.rs

+pub fn index_path(name: &str) -> String {
+    let name_lower = name.to_lowercase();
+    match name_lower.len() {
+        0 => panic!("empty crate name"),


Using panic! for empty crate names is inappropriate for a library function that processes user input. This should return a Result or Option instead to allow proper error handling. If a malicious or malformed request provides an empty crate name, this will crash the server.

Copilot · 2026-01-20T20:16:33Z

src/crates/handlers.rs

+async fn fetch_with_headers(url: &str, extra_headers: &HeaderMap) -> Result<Response<Body>> {
+    let client = EasyHttpWebClient::default();


Similar to fetch_crate, this function creates a new EasyHttpWebClient on every call. This is inefficient for index requests which may be frequent. Consider reusing a client instance to enable connection pooling.

Copilot · 2026-01-20T20:16:33Z

src/proxy.rs

+                        Ok(resp) => {
+                            ctx.cache = CacheStatus::Hit; // May be revalidated, but close enough


The cache status is hard-coded to CacheStatus::Hit at line 154, but the actual outcome from fetch_cached_text could be Miss, Revalidated, or Pass. The comment acknowledges this is not accurate. This will cause incorrect cache metrics and logging. The actual cache outcome should be extracted from the result and used instead.

Suggested change

Ok(resp) => {

ctx.cache = CacheStatus::Hit; // May be revalidated, but close enough

Ok((resp, cache_status)) => {

ctx.cache = cache_status;

Copilot · 2026-01-20T20:16:33Z

src/proxy/types.rs

+    pub fn from_crate_download_path(path: &str) -> Option<Self> {
+        let parts = path.strip_prefix("/api/v1/crates/")?;
+        let segments: Vec<&str> = parts.split('/').collect();
+        if segments.len() < 3 || segments[2] != "download" {


The validation only checks if there are fewer than 3 segments, but the code proceeds to access segments[2] without verifying the length is exactly 3. If there are more than 3 segments (e.g., /api/v1/crates/name/version/download/extra), this will pass validation but may represent an invalid path. Consider checking for exactly 3 segments or using segments.get(2) instead of direct indexing.

Suggested change

if segments.len() < 3 || segments[2] != "download" {

if segments.len() != 3 || segments[2] != "download" {

Copilot · 2026-01-20T20:16:34Z

src/crates/handlers.rs

+pub async fn handle_sparse_index(
+    path: &str,
+    our_base: &str,
+    storage: Arc<FilesystemStorage>,
+    index: Arc<CacheBackend>,
+) -> Result<Response<Body>> {
+    // Handle config.json specially - serve our own
+    if path == "/index/config.json" || path == "config.json" {
+        return serve_index_config(our_base);
+    }
+
+    // Extract crate name from path
+    // Path format: /index/{prefix}/{crate_name}
+    let crate_name = path
+        .rsplit('/')
+        .next()
+        .filter(|s| !s.is_empty())
+        .context("invalid index path")?;
+
+    // Validate the path matches expected prefix
+    let expected_path = index_path(crate_name);
+    let clean_path = path.trim_start_matches("/index/");
+
+    if clean_path != expected_path {
+        return respond_text(StatusCode::NOT_FOUND, "crate not found");
+    }
+
+    let storage_path = format!("crates_index/{}", expected_path);
+    let meta_key = format!("crates:index:{}", crate_name);
+    let upstream_url = format!("https://index.crates.io/{}", expected_path);
+
+    let result = fetch_cached_text(
+        storage.as_ref(),
+        index.as_ref(),
+        &storage_path,
+        &meta_key,
+        "text/plain; charset=utf-8",
+        "public, max-age=60",
+        false,
+        MetaStoreMode::BestEffort,
+        false,
+        |headers| async move { fetch_with_headers(&upstream_url, &headers).await },
+        |body| async move { Ok(body) },
+    )
+    .await?;
+
+    Ok(result.response)
+}


The handle_sparse_index function lacks test coverage for the main index path handling logic. Only the config.json serving is tested. Consider adding tests for valid and invalid index paths, caching behavior, and error cases.

Copilot · 2026-01-20T20:16:34Z

src/http_cache.rs

+pub async fn fetch_cached_text<F, Fut, T, TFut>(
+    storage: &FilesystemStorage,
+    index: &CacheBackend,
+    storage_path: &str,
+    meta_key: &str,
+    content_type: &str,
+    cache_control: &str,
+    include_content_length: bool,
+    meta_mode: MetaStoreMode,
+    strip_transfer_encoding: bool,
+    fetch: F,
+    transform: T,
+) -> Result<CachedFetchResult>
+where
+    F: FnOnce(HeaderMap) -> Fut,
+    Fut: Future<Output = Result<Response<Body>>>,
+    T: FnOnce(Vec<u8>) -> TFut,
+    TFut: Future<Output = Result<Vec<u8>>>,
+{
+    let cached_bytes = tokio::fs::read(storage.resolve(storage_path)).await.ok();
+    let had_cache = cached_bytes.is_some();
+
+    let cached_meta = load_cached_meta(index, meta_key, meta_mode).await?;
+    let request_headers = build_conditional_headers(&cached_meta);
+
+    let response = fetch(request_headers).await?;
+    let status = response.status();
+
+    if status == StatusCode::NOT_MODIFIED && cached_bytes.is_some() {
+        let body = cached_bytes.unwrap_or_default();
+        let meta = cached_meta.unwrap_or_default();
+        let transformed = transform(body).await?;
+        let response = build_cached_response(
+            &transformed,
+            &meta,
+            content_type,
+            cache_control,
+            include_content_length,
+        )?;
+        return Ok(CachedFetchResult {
+            response,
+            outcome: CacheOutcome::Revalidated,
+        });
+    }
+
+    if status.is_success() {
+        use rama::http::body::util::BodyExt;
+        let headers = response.headers().clone();
+        let body = response
+            .into_body()
+            .collect()
+            .await
+            .context("reading cached response body")?
+            .to_bytes();
+        let meta = CacheEntryMeta::from_headers(&headers);
+
+        persist_body(storage, storage_path, &body).await?;
+        store_cached_meta(index, meta_key, &meta, meta_mode).await?;
+
+        let transformed = transform(body.to_vec()).await?;
+        let response = build_cached_response(
+            &transformed,
+            &meta,
+            content_type,
+            cache_control,
+            include_content_length,
+        )?;
+        let outcome = if had_cache {
+            CacheOutcome::Revalidated
+        } else {
+            CacheOutcome::Miss
+        };
+        return Ok(CachedFetchResult { response, outcome });
+    }
+
+    let forwarded = forward_response(response, strip_transfer_encoding).await?;
+    Ok(CachedFetchResult {
+        response: forwarded,
+        outcome: CacheOutcome::Pass,
+    })
+}


The fetch_cached_text function, which is a core caching utility handling HTTP conditional requests, ETag/Last-Modified headers, and multiple cache outcomes, lacks test coverage. Given its complexity and importance, it should have tests covering cache hits, misses, revalidation, error handling, and the transform callback functionality.

Copilot · 2026-01-20T20:16:34Z

src/crates/handlers.rs

+    let crate_name = path
+        .rsplit('/')
+        .next()
+        .filter(|s| !s.is_empty())
+        .context("invalid index path")?;


The crate name extracted from the path is not validated for security or correctness. Cargo crate names have specific requirements (lowercase alphanumeric plus hyphens and underscores, starting with a letter). Without validation, malicious inputs could potentially cause issues in file paths or upstream requests. Consider adding validation to ensure the crate name matches the expected pattern.

Copilot · 2026-01-20T20:16:34Z

src/proxy.rs

                }
+                // Crates.io sparse index (with caching)
+                p if p.starts_with("/index/") => {
+                    let our_base = format!("http://{}:{}", self.config.server.host, self.config.server.port);


The our_base URL is constructed using the server host and port from config, but it doesn't account for the protocol (http vs https) or any base path. If the server is behind a reverse proxy with HTTPS or a different base path, the config.json served to cargo clients will have incorrect download URLs. Consider making this configurable or detecting it from request headers.

Suggested change

let our_base = format!("http://{}:{}", self.config.server.host, self.config.server.port);

// Derive our_base from the incoming request, so that generated URLs

// respect reverse proxies (protocol, host, and optional base path).

let headers = req.headers();

// Determine scheme (protocol).

let scheme = headers

.get("x-forwarded-proto")

.and_then(|v| v.to_str().ok())

.unwrap_or("http");

// Determine host (and port if present).

let host = headers

.get("x-forwarded-host")

.and_then(|v| v.to_str().ok())

.map(|s| s.to_owned())

.or_else(|| {

headers

.get(header::HOST)

.and_then(|v| v.to_str().ok().map(|s| s.to_owned()))

})

.unwrap_or_else(|| {

format!("{}:{}", self.config.server.host, self.config.server.port)

});

// Optional base path provided by reverse proxy.

let prefix = headers

.get("x-forwarded-prefix")

.and_then(|v| v.to_str().ok())

.unwrap_or("")

.trim_end_matches('/');

let mut our_base = format!("{}://{}", scheme, host);

if !prefix.is_empty() {

if !prefix.starts_with('/') {

our_base.push('/');

}

our_base.push_str(prefix);

}

Copilot · 2026-01-20T20:16:34Z

src/proxy/types.rs

+    pub fn from_crate_download_path(path: &str) -> Option<Self> {
+        let parts = path.strip_prefix("/api/v1/crates/")?;
+        let segments: Vec<&str> = parts.split('/').collect();
+        if segments.len() < 3 || segments[2] != "download" {
+            return None;
+        }
+        let name = segments[0];
+        let version = segments[1];
+
+        // Validate - no path traversal
+        if name.contains("..")
+            || name.contains('/')
+            || version.contains("..")
+            || version.contains('/')
+        {
+            tracing::warn!(name = %name, version = %version, "Rejected path traversal in crate request");
+            return None;
+        }
+
+        let file_name = format!("{}-{}.crate", name, version);
+        let relative_path = format!("crates/{}/{}", name, file_name);
+
+        Some(Self {
+            kind: AssetKind::Crate,
+            name: name.to_string(),
+            version: version.to_string(),
+            platform: None,
+            file_name,
+            relative_path,
+        })
+    }


The from_crate_download_path function lacks test coverage. Given the security-sensitive nature of path validation (checking for path traversal), this function should have comprehensive tests covering valid paths, invalid paths, path traversal attempts, and edge cases.

Copilot · 2026-01-20T20:16:35Z

src/proxy.rs

+    async fn fetch_crate(&self, name: &str, version: &str) -> Result<Response<Body>> {
+        use rama::http::client::EasyHttpWebClient;
+
+        let url = format!(
+            "https://static.crates.io/crates/{}/{}-{}.crate",
+            name, name, version
+        );
+
+        let client = EasyHttpWebClient::default();


The fetch_crate function creates a new EasyHttpWebClient on every call. This is inefficient as it doesn't reuse connections. Consider storing a dedicated HTTP client for crates.io requests in the VeinProxy struct to enable connection pooling and better performance.

refactor: share conditional cache handling

9a9781f

seuros requested review from Copilot and pboling January 20, 2026 20:10

Copilot started reviewing on behalf of seuros January 20, 2026 20:11 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

seuros added 3 commits January 20, 2026 21:21

fix: use default max_connections for postgres backend

e46a211

fix feature selection

9d22630

ci: add cfg guards for sqlite-only tests

b59966c

seuros merged commit 28668a5 into master Jan 20, 2026
2 checks passed

seuros deleted the crates branch January 20, 2026 20:55

github-actions bot mentioned this pull request Jan 11, 2026

chore: release master #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce crates proxing#10

feat: introduce crates proxing#10
seuros merged 4 commits intomasterfrom
crates

seuros commented Jan 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		async fn fetch_with_headers(url: &str, extra_headers: &HeaderMap) -> Result<Response<Body>> {
		let client = EasyHttpWebClient::default();

		Ok(resp) => {
		ctx.cache = CacheStatus::Hit; // May be revalidated, but close enough

	if segments.len() < 3 \|\| segments[2] != "download" {
	if segments.len() != 3 \|\| segments[2] != "download" {

-                    let our_base = format!("http://{}:{}", self.config.server.host, self.config.server.port);
+                    // Derive our_base from the incoming request, so that generated URLs
+                    // respect reverse proxies (protocol, host, and optional base path).
+                    let headers = req.headers();
+                    // Determine scheme (protocol).
+                    let scheme = headers
+                        .get("x-forwarded-proto")
+                        .and_then(|v| v.to_str().ok())
+                        .unwrap_or("http");
+                    // Determine host (and port if present).
+                    let host = headers
+                        .get("x-forwarded-host")
+                        .and_then(|v| v.to_str().ok())
+                        .map(|s| s.to_owned())
+                        .or_else(|| {
+                            headers
+                                .get(header::HOST)
+                                .and_then(|v| v.to_str().ok().map(|s| s.to_owned()))
+                        })
+                        .unwrap_or_else(|| {
+                            format!("{}:{}", self.config.server.host, self.config.server.port)
+                        });
+                    // Optional base path provided by reverse proxy.
+                    let prefix = headers
+                        .get("x-forwarded-prefix")
+                        .and_then(|v| v.to_str().ok())
+                        .unwrap_or("")
+                        .trim_end_matches('/');
+                    let mut our_base = format!("{}://{}", scheme, host);
+                    if !prefix.is_empty() {
+                        if !prefix.starts_with('/') {
+                            our_base.push('/');
+                        }
+                        our_base.push_str(prefix);
+                    }

Conversation

seuros commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

seuros commented Jan 20, 2026 •

edited

Loading