Conversation
There was a problem hiding this comment.
Pull request overview
This pull request introduces support for proxying crates.io packages alongside the existing RubyGems proxy functionality. The implementation adds a sparse registry index handler and crate download support following the Cargo sparse registry protocol. Additionally, the PR refactors caching logic into a reusable http_cache module and removes several gem-specific features including tree-sitter symbol extraction, binary architecture validation, and recursive SBOM generation.
Changes:
- Adds crates.io sparse registry protocol support with index caching and download proxying
- Refactors HTTP caching logic into a dedicated
http_cachemodule for reuse across ecosystems - Removes tree-sitter, goblin, and other dependencies related to Ruby-specific analysis features
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/crates.rs | Module declaration for crates.io support |
| src/crates/types.rs | Index path computation and config types for sparse registry |
| src/crates/handlers.rs | Sparse index request handler with caching |
| src/http_cache.rs | Extracted HTTP caching utilities with ETag/Last-Modified support |
| src/proxy/types.rs | Added crate download path parsing and validation |
| src/proxy.rs | Integrated crates handlers and refactored compact index to use http_cache |
| src/gem_metadata/*.rs | Removed modules for symbols, binary_arch, indexer, version_req |
| src/gem_metadata/sbom.rs | Simplified SBOM generation, removed recursive resolution |
| src/config/database.rs | Removed max_connections field from database config |
| Cargo.toml | Removed dependencies: tree-sitter, goblin, semver, and others |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn index_path(name: &str) -> String { | ||
| let name_lower = name.to_lowercase(); | ||
| match name_lower.len() { | ||
| 0 => panic!("empty crate name"), |
There was a problem hiding this comment.
Using panic! for empty crate names is inappropriate for a library function that processes user input. This should return a Result or Option instead to allow proper error handling. If a malicious or malformed request provides an empty crate name, this will crash the server.
| async fn fetch_with_headers(url: &str, extra_headers: &HeaderMap) -> Result<Response<Body>> { | ||
| let client = EasyHttpWebClient::default(); |
There was a problem hiding this comment.
Similar to fetch_crate, this function creates a new EasyHttpWebClient on every call. This is inefficient for index requests which may be frequent. Consider reusing a client instance to enable connection pooling.
| Ok(resp) => { | ||
| ctx.cache = CacheStatus::Hit; // May be revalidated, but close enough |
There was a problem hiding this comment.
The cache status is hard-coded to CacheStatus::Hit at line 154, but the actual outcome from fetch_cached_text could be Miss, Revalidated, or Pass. The comment acknowledges this is not accurate. This will cause incorrect cache metrics and logging. The actual cache outcome should be extracted from the result and used instead.
| Ok(resp) => { | |
| ctx.cache = CacheStatus::Hit; // May be revalidated, but close enough | |
| Ok((resp, cache_status)) => { | |
| ctx.cache = cache_status; |
| pub fn from_crate_download_path(path: &str) -> Option<Self> { | ||
| let parts = path.strip_prefix("/api/v1/crates/")?; | ||
| let segments: Vec<&str> = parts.split('/').collect(); | ||
| if segments.len() < 3 || segments[2] != "download" { |
There was a problem hiding this comment.
The validation only checks if there are fewer than 3 segments, but the code proceeds to access segments[2] without verifying the length is exactly 3. If there are more than 3 segments (e.g., /api/v1/crates/name/version/download/extra), this will pass validation but may represent an invalid path. Consider checking for exactly 3 segments or using segments.get(2) instead of direct indexing.
| if segments.len() < 3 || segments[2] != "download" { | |
| if segments.len() != 3 || segments[2] != "download" { |
| pub async fn handle_sparse_index( | ||
| path: &str, | ||
| our_base: &str, | ||
| storage: Arc<FilesystemStorage>, | ||
| index: Arc<CacheBackend>, | ||
| ) -> Result<Response<Body>> { | ||
| // Handle config.json specially - serve our own | ||
| if path == "/index/config.json" || path == "config.json" { | ||
| return serve_index_config(our_base); | ||
| } | ||
|
|
||
| // Extract crate name from path | ||
| // Path format: /index/{prefix}/{crate_name} | ||
| let crate_name = path | ||
| .rsplit('/') | ||
| .next() | ||
| .filter(|s| !s.is_empty()) | ||
| .context("invalid index path")?; | ||
|
|
||
| // Validate the path matches expected prefix | ||
| let expected_path = index_path(crate_name); | ||
| let clean_path = path.trim_start_matches("/index/"); | ||
|
|
||
| if clean_path != expected_path { | ||
| return respond_text(StatusCode::NOT_FOUND, "crate not found"); | ||
| } | ||
|
|
||
| let storage_path = format!("crates_index/{}", expected_path); | ||
| let meta_key = format!("crates:index:{}", crate_name); | ||
| let upstream_url = format!("https://index.crates.io/{}", expected_path); | ||
|
|
||
| let result = fetch_cached_text( | ||
| storage.as_ref(), | ||
| index.as_ref(), | ||
| &storage_path, | ||
| &meta_key, | ||
| "text/plain; charset=utf-8", | ||
| "public, max-age=60", | ||
| false, | ||
| MetaStoreMode::BestEffort, | ||
| false, | ||
| |headers| async move { fetch_with_headers(&upstream_url, &headers).await }, | ||
| |body| async move { Ok(body) }, | ||
| ) | ||
| .await?; | ||
|
|
||
| Ok(result.response) | ||
| } |
There was a problem hiding this comment.
The handle_sparse_index function lacks test coverage for the main index path handling logic. Only the config.json serving is tested. Consider adding tests for valid and invalid index paths, caching behavior, and error cases.
| pub async fn fetch_cached_text<F, Fut, T, TFut>( | ||
| storage: &FilesystemStorage, | ||
| index: &CacheBackend, | ||
| storage_path: &str, | ||
| meta_key: &str, | ||
| content_type: &str, | ||
| cache_control: &str, | ||
| include_content_length: bool, | ||
| meta_mode: MetaStoreMode, | ||
| strip_transfer_encoding: bool, | ||
| fetch: F, | ||
| transform: T, | ||
| ) -> Result<CachedFetchResult> | ||
| where | ||
| F: FnOnce(HeaderMap) -> Fut, | ||
| Fut: Future<Output = Result<Response<Body>>>, | ||
| T: FnOnce(Vec<u8>) -> TFut, | ||
| TFut: Future<Output = Result<Vec<u8>>>, | ||
| { | ||
| let cached_bytes = tokio::fs::read(storage.resolve(storage_path)).await.ok(); | ||
| let had_cache = cached_bytes.is_some(); | ||
|
|
||
| let cached_meta = load_cached_meta(index, meta_key, meta_mode).await?; | ||
| let request_headers = build_conditional_headers(&cached_meta); | ||
|
|
||
| let response = fetch(request_headers).await?; | ||
| let status = response.status(); | ||
|
|
||
| if status == StatusCode::NOT_MODIFIED && cached_bytes.is_some() { | ||
| let body = cached_bytes.unwrap_or_default(); | ||
| let meta = cached_meta.unwrap_or_default(); | ||
| let transformed = transform(body).await?; | ||
| let response = build_cached_response( | ||
| &transformed, | ||
| &meta, | ||
| content_type, | ||
| cache_control, | ||
| include_content_length, | ||
| )?; | ||
| return Ok(CachedFetchResult { | ||
| response, | ||
| outcome: CacheOutcome::Revalidated, | ||
| }); | ||
| } | ||
|
|
||
| if status.is_success() { | ||
| use rama::http::body::util::BodyExt; | ||
| let headers = response.headers().clone(); | ||
| let body = response | ||
| .into_body() | ||
| .collect() | ||
| .await | ||
| .context("reading cached response body")? | ||
| .to_bytes(); | ||
| let meta = CacheEntryMeta::from_headers(&headers); | ||
|
|
||
| persist_body(storage, storage_path, &body).await?; | ||
| store_cached_meta(index, meta_key, &meta, meta_mode).await?; | ||
|
|
||
| let transformed = transform(body.to_vec()).await?; | ||
| let response = build_cached_response( | ||
| &transformed, | ||
| &meta, | ||
| content_type, | ||
| cache_control, | ||
| include_content_length, | ||
| )?; | ||
| let outcome = if had_cache { | ||
| CacheOutcome::Revalidated | ||
| } else { | ||
| CacheOutcome::Miss | ||
| }; | ||
| return Ok(CachedFetchResult { response, outcome }); | ||
| } | ||
|
|
||
| let forwarded = forward_response(response, strip_transfer_encoding).await?; | ||
| Ok(CachedFetchResult { | ||
| response: forwarded, | ||
| outcome: CacheOutcome::Pass, | ||
| }) | ||
| } |
There was a problem hiding this comment.
The fetch_cached_text function, which is a core caching utility handling HTTP conditional requests, ETag/Last-Modified headers, and multiple cache outcomes, lacks test coverage. Given its complexity and importance, it should have tests covering cache hits, misses, revalidation, error handling, and the transform callback functionality.
| let crate_name = path | ||
| .rsplit('/') | ||
| .next() | ||
| .filter(|s| !s.is_empty()) | ||
| .context("invalid index path")?; |
There was a problem hiding this comment.
The crate name extracted from the path is not validated for security or correctness. Cargo crate names have specific requirements (lowercase alphanumeric plus hyphens and underscores, starting with a letter). Without validation, malicious inputs could potentially cause issues in file paths or upstream requests. Consider adding validation to ensure the crate name matches the expected pattern.
| } | ||
| // Crates.io sparse index (with caching) | ||
| p if p.starts_with("/index/") => { | ||
| let our_base = format!("http://{}:{}", self.config.server.host, self.config.server.port); |
There was a problem hiding this comment.
The our_base URL is constructed using the server host and port from config, but it doesn't account for the protocol (http vs https) or any base path. If the server is behind a reverse proxy with HTTPS or a different base path, the config.json served to cargo clients will have incorrect download URLs. Consider making this configurable or detecting it from request headers.
| let our_base = format!("http://{}:{}", self.config.server.host, self.config.server.port); | |
| // Derive our_base from the incoming request, so that generated URLs | |
| // respect reverse proxies (protocol, host, and optional base path). | |
| let headers = req.headers(); | |
| // Determine scheme (protocol). | |
| let scheme = headers | |
| .get("x-forwarded-proto") | |
| .and_then(|v| v.to_str().ok()) | |
| .unwrap_or("http"); | |
| // Determine host (and port if present). | |
| let host = headers | |
| .get("x-forwarded-host") | |
| .and_then(|v| v.to_str().ok()) | |
| .map(|s| s.to_owned()) | |
| .or_else(|| { | |
| headers | |
| .get(header::HOST) | |
| .and_then(|v| v.to_str().ok().map(|s| s.to_owned())) | |
| }) | |
| .unwrap_or_else(|| { | |
| format!("{}:{}", self.config.server.host, self.config.server.port) | |
| }); | |
| // Optional base path provided by reverse proxy. | |
| let prefix = headers | |
| .get("x-forwarded-prefix") | |
| .and_then(|v| v.to_str().ok()) | |
| .unwrap_or("") | |
| .trim_end_matches('/'); | |
| let mut our_base = format!("{}://{}", scheme, host); | |
| if !prefix.is_empty() { | |
| if !prefix.starts_with('/') { | |
| our_base.push('/'); | |
| } | |
| our_base.push_str(prefix); | |
| } |
| pub fn from_crate_download_path(path: &str) -> Option<Self> { | ||
| let parts = path.strip_prefix("/api/v1/crates/")?; | ||
| let segments: Vec<&str> = parts.split('/').collect(); | ||
| if segments.len() < 3 || segments[2] != "download" { | ||
| return None; | ||
| } | ||
| let name = segments[0]; | ||
| let version = segments[1]; | ||
|
|
||
| // Validate - no path traversal | ||
| if name.contains("..") | ||
| || name.contains('/') | ||
| || version.contains("..") | ||
| || version.contains('/') | ||
| { | ||
| tracing::warn!(name = %name, version = %version, "Rejected path traversal in crate request"); | ||
| return None; | ||
| } | ||
|
|
||
| let file_name = format!("{}-{}.crate", name, version); | ||
| let relative_path = format!("crates/{}/{}", name, file_name); | ||
|
|
||
| Some(Self { | ||
| kind: AssetKind::Crate, | ||
| name: name.to_string(), | ||
| version: version.to_string(), | ||
| platform: None, | ||
| file_name, | ||
| relative_path, | ||
| }) | ||
| } |
There was a problem hiding this comment.
The from_crate_download_path function lacks test coverage. Given the security-sensitive nature of path validation (checking for path traversal), this function should have comprehensive tests covering valid paths, invalid paths, path traversal attempts, and edge cases.
| async fn fetch_crate(&self, name: &str, version: &str) -> Result<Response<Body>> { | ||
| use rama::http::client::EasyHttpWebClient; | ||
|
|
||
| let url = format!( | ||
| "https://static.crates.io/crates/{}/{}-{}.crate", | ||
| name, name, version | ||
| ); | ||
|
|
||
| let client = EasyHttpWebClient::default(); |
There was a problem hiding this comment.
The fetch_crate function creates a new EasyHttpWebClient on every call. This is inefficient as it doesn't reuse connections. Consider storing a dedicated HTTP client for crates.io requests in the VeinProxy struct to enable connection pooling and better performance.
While cleaning this codebase, i noticed we could reuse vein for other ecosystems.
I removed the tree-sitter and binary feature (i will readd them later).
The plan is :
[X] Add Crates compatible api
[ ] Add Python compatible api
[ ] Add Npm compatible api
Vein will act as a proxy.
Fetch once from upstream , use forever.
Other ecosystem like Go use repos directly.
I might Add Dart (But i need to study the api first so it won't conflict).