Skip to content

feat: introduce crates proxing#10

Merged
seuros merged 4 commits intomasterfrom
crates
Jan 20, 2026
Merged

feat: introduce crates proxing#10
seuros merged 4 commits intomasterfrom
crates

Conversation

@seuros
Copy link
Collaborator

@seuros seuros commented Jan 20, 2026

While cleaning this codebase, i noticed we could reuse vein for other ecosystems.

I removed the tree-sitter and binary feature (i will readd them later).

The plan is :
[X] Add Crates compatible api
[ ] Add Python compatible api
[ ] Add Npm compatible api

Vein will act as a proxy.

Fetch once from upstream , use forever.

Other ecosystem like Go use repos directly.

I might Add Dart (But i need to study the api first so it won't conflict).

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces support for proxying crates.io packages alongside the existing RubyGems proxy functionality. The implementation adds a sparse registry index handler and crate download support following the Cargo sparse registry protocol. Additionally, the PR refactors caching logic into a reusable http_cache module and removes several gem-specific features including tree-sitter symbol extraction, binary architecture validation, and recursive SBOM generation.

Changes:

  • Adds crates.io sparse registry protocol support with index caching and download proxying
  • Refactors HTTP caching logic into a dedicated http_cache module for reuse across ecosystems
  • Removes tree-sitter, goblin, and other dependencies related to Ruby-specific analysis features

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/crates.rs Module declaration for crates.io support
src/crates/types.rs Index path computation and config types for sparse registry
src/crates/handlers.rs Sparse index request handler with caching
src/http_cache.rs Extracted HTTP caching utilities with ETag/Last-Modified support
src/proxy/types.rs Added crate download path parsing and validation
src/proxy.rs Integrated crates handlers and refactored compact index to use http_cache
src/gem_metadata/*.rs Removed modules for symbols, binary_arch, indexer, version_req
src/gem_metadata/sbom.rs Simplified SBOM generation, removed recursive resolution
src/config/database.rs Removed max_connections field from database config
Cargo.toml Removed dependencies: tree-sitter, goblin, semver, and others

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pub fn index_path(name: &str) -> String {
let name_lower = name.to_lowercase();
match name_lower.len() {
0 => panic!("empty crate name"),
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using panic! for empty crate names is inappropriate for a library function that processes user input. This should return a Result or Option instead to allow proper error handling. If a malicious or malformed request provides an empty crate name, this will crash the server.

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +75
async fn fetch_with_headers(url: &str, extra_headers: &HeaderMap) -> Result<Response<Body>> {
let client = EasyHttpWebClient::default();
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to fetch_crate, this function creates a new EasyHttpWebClient on every call. This is inefficient for index requests which may be frequent. Consider reusing a client instance to enable connection pooling.

Copilot uses AI. Check for mistakes.
Comment on lines +153 to +154
Ok(resp) => {
ctx.cache = CacheStatus::Hit; // May be revalidated, but close enough
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache status is hard-coded to CacheStatus::Hit at line 154, but the actual outcome from fetch_cached_text could be Miss, Revalidated, or Pass. The comment acknowledges this is not accurate. This will cause incorrect cache metrics and logging. The actual cache outcome should be extracted from the result and used instead.

Suggested change
Ok(resp) => {
ctx.cache = CacheStatus::Hit; // May be revalidated, but close enough
Ok((resp, cache_status)) => {
ctx.cache = cache_status;

Copilot uses AI. Check for mistakes.
pub fn from_crate_download_path(path: &str) -> Option<Self> {
let parts = path.strip_prefix("/api/v1/crates/")?;
let segments: Vec<&str> = parts.split('/').collect();
if segments.len() < 3 || segments[2] != "download" {
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation only checks if there are fewer than 3 segments, but the code proceeds to access segments[2] without verifying the length is exactly 3. If there are more than 3 segments (e.g., /api/v1/crates/name/version/download/extra), this will pass validation but may represent an invalid path. Consider checking for exactly 3 segments or using segments.get(2) instead of direct indexing.

Suggested change
if segments.len() < 3 || segments[2] != "download" {
if segments.len() != 3 || segments[2] != "download" {

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +71
pub async fn handle_sparse_index(
path: &str,
our_base: &str,
storage: Arc<FilesystemStorage>,
index: Arc<CacheBackend>,
) -> Result<Response<Body>> {
// Handle config.json specially - serve our own
if path == "/index/config.json" || path == "config.json" {
return serve_index_config(our_base);
}

// Extract crate name from path
// Path format: /index/{prefix}/{crate_name}
let crate_name = path
.rsplit('/')
.next()
.filter(|s| !s.is_empty())
.context("invalid index path")?;

// Validate the path matches expected prefix
let expected_path = index_path(crate_name);
let clean_path = path.trim_start_matches("/index/");

if clean_path != expected_path {
return respond_text(StatusCode::NOT_FOUND, "crate not found");
}

let storage_path = format!("crates_index/{}", expected_path);
let meta_key = format!("crates:index:{}", crate_name);
let upstream_url = format!("https://index.crates.io/{}", expected_path);

let result = fetch_cached_text(
storage.as_ref(),
index.as_ref(),
&storage_path,
&meta_key,
"text/plain; charset=utf-8",
"public, max-age=60",
false,
MetaStoreMode::BestEffort,
false,
|headers| async move { fetch_with_headers(&upstream_url, &headers).await },
|body| async move { Ok(body) },
)
.await?;

Ok(result.response)
}
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handle_sparse_index function lacks test coverage for the main index path handling logic. Only the config.json serving is tested. Consider adding tests for valid and invalid index paths, caching behavior, and error cases.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +127
pub async fn fetch_cached_text<F, Fut, T, TFut>(
storage: &FilesystemStorage,
index: &CacheBackend,
storage_path: &str,
meta_key: &str,
content_type: &str,
cache_control: &str,
include_content_length: bool,
meta_mode: MetaStoreMode,
strip_transfer_encoding: bool,
fetch: F,
transform: T,
) -> Result<CachedFetchResult>
where
F: FnOnce(HeaderMap) -> Fut,
Fut: Future<Output = Result<Response<Body>>>,
T: FnOnce(Vec<u8>) -> TFut,
TFut: Future<Output = Result<Vec<u8>>>,
{
let cached_bytes = tokio::fs::read(storage.resolve(storage_path)).await.ok();
let had_cache = cached_bytes.is_some();

let cached_meta = load_cached_meta(index, meta_key, meta_mode).await?;
let request_headers = build_conditional_headers(&cached_meta);

let response = fetch(request_headers).await?;
let status = response.status();

if status == StatusCode::NOT_MODIFIED && cached_bytes.is_some() {
let body = cached_bytes.unwrap_or_default();
let meta = cached_meta.unwrap_or_default();
let transformed = transform(body).await?;
let response = build_cached_response(
&transformed,
&meta,
content_type,
cache_control,
include_content_length,
)?;
return Ok(CachedFetchResult {
response,
outcome: CacheOutcome::Revalidated,
});
}

if status.is_success() {
use rama::http::body::util::BodyExt;
let headers = response.headers().clone();
let body = response
.into_body()
.collect()
.await
.context("reading cached response body")?
.to_bytes();
let meta = CacheEntryMeta::from_headers(&headers);

persist_body(storage, storage_path, &body).await?;
store_cached_meta(index, meta_key, &meta, meta_mode).await?;

let transformed = transform(body.to_vec()).await?;
let response = build_cached_response(
&transformed,
&meta,
content_type,
cache_control,
include_content_length,
)?;
let outcome = if had_cache {
CacheOutcome::Revalidated
} else {
CacheOutcome::Miss
};
return Ok(CachedFetchResult { response, outcome });
}

let forwarded = forward_response(response, strip_transfer_encoding).await?;
Ok(CachedFetchResult {
response: forwarded,
outcome: CacheOutcome::Pass,
})
}
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch_cached_text function, which is a core caching utility handling HTTP conditional requests, ETag/Last-Modified headers, and multiple cache outcomes, lacks test coverage. Given its complexity and importance, it should have tests covering cache hits, misses, revalidation, error handling, and the transform callback functionality.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +41
let crate_name = path
.rsplit('/')
.next()
.filter(|s| !s.is_empty())
.context("invalid index path")?;
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crate name extracted from the path is not validated for security or correctness. Cargo crate names have specific requirements (lowercase alphanumeric plus hyphens and underscores, starting with a letter). Without validation, malicious inputs could potentially cause issues in file paths or upstream requests. Consider adding validation to ensure the crate name matches the expected pattern.

Copilot uses AI. Check for mistakes.
}
// Crates.io sparse index (with caching)
p if p.starts_with("/index/") => {
let our_base = format!("http://{}:{}", self.config.server.host, self.config.server.port);
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The our_base URL is constructed using the server host and port from config, but it doesn't account for the protocol (http vs https) or any base path. If the server is behind a reverse proxy with HTTPS or a different base path, the config.json served to cargo clients will have incorrect download URLs. Consider making this configurable or detecting it from request headers.

Suggested change
let our_base = format!("http://{}:{}", self.config.server.host, self.config.server.port);
// Derive our_base from the incoming request, so that generated URLs
// respect reverse proxies (protocol, host, and optional base path).
let headers = req.headers();
// Determine scheme (protocol).
let scheme = headers
.get("x-forwarded-proto")
.and_then(|v| v.to_str().ok())
.unwrap_or("http");
// Determine host (and port if present).
let host = headers
.get("x-forwarded-host")
.and_then(|v| v.to_str().ok())
.map(|s| s.to_owned())
.or_else(|| {
headers
.get(header::HOST)
.and_then(|v| v.to_str().ok().map(|s| s.to_owned()))
})
.unwrap_or_else(|| {
format!("{}:{}", self.config.server.host, self.config.server.port)
});
// Optional base path provided by reverse proxy.
let prefix = headers
.get("x-forwarded-prefix")
.and_then(|v| v.to_str().ok())
.unwrap_or("")
.trim_end_matches('/');
let mut our_base = format!("{}://{}", scheme, host);
if !prefix.is_empty() {
if !prefix.starts_with('/') {
our_base.push('/');
}
our_base.push_str(prefix);
}

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +117
pub fn from_crate_download_path(path: &str) -> Option<Self> {
let parts = path.strip_prefix("/api/v1/crates/")?;
let segments: Vec<&str> = parts.split('/').collect();
if segments.len() < 3 || segments[2] != "download" {
return None;
}
let name = segments[0];
let version = segments[1];

// Validate - no path traversal
if name.contains("..")
|| name.contains('/')
|| version.contains("..")
|| version.contains('/')
{
tracing::warn!(name = %name, version = %version, "Rejected path traversal in crate request");
return None;
}

let file_name = format!("{}-{}.crate", name, version);
let relative_path = format!("crates/{}/{}", name, file_name);

Some(Self {
kind: AssetKind::Crate,
name: name.to_string(),
version: version.to_string(),
platform: None,
file_name,
relative_path,
})
}
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The from_crate_download_path function lacks test coverage. Given the security-sensitive nature of path validation (checking for path traversal), this function should have comprehensive tests covering valid paths, invalid paths, path traversal attempts, and edge cases.

Copilot uses AI. Check for mistakes.
Comment on lines +394 to +402
async fn fetch_crate(&self, name: &str, version: &str) -> Result<Response<Body>> {
use rama::http::client::EasyHttpWebClient;

let url = format!(
"https://static.crates.io/crates/{}/{}-{}.crate",
name, name, version
);

let client = EasyHttpWebClient::default();
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch_crate function creates a new EasyHttpWebClient on every call. This is inefficient as it doesn't reuse connections. Consider storing a dedicated HTTP client for crates.io requests in the VeinProxy struct to enable connection pooling and better performance.

Copilot uses AI. Check for mistakes.
@seuros seuros merged commit 28668a5 into master Jan 20, 2026
2 checks passed
@seuros seuros deleted the crates branch January 20, 2026 20:55
@github-actions github-actions bot mentioned this pull request Jan 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant