Skip to content

Implement Aho-Corasick for Fast, Multi-Keyword Filtering of GitHub Archive Events #136

@halcyondude

Description

@halcyondude

Enhancement: Implement Aho-Corasick for Fast, Multi-Keyword Filtering of GitHub Archive Events

Problem

Currently, landscape-graph processes GitHub Archive data by filtering events based on metadata such as organization, repository, and actor. While this is efficient for known sets of repositories, it doesn't allow for searching the actual content of event payloads when fast-filtering thru the full hourly archives. This limitation prevents a broader analysis of a project's ecosystem, such as mentions in issues, comments, or commit messages from other repositories.

Answering questions like "When do other projects talk about or integrate with Project X?" is currently infeasible without building massive, expensive indices or databases. The goal is to provide this enhanced capability to contributors and maintainers without incurring significant cloud computing costs.

For context, the GitHub Archive is a massive dataset, comprising over 5 TB of compressed JSONL events since 2015.

Proposed Solution

This issue proposes a major update to the GitHub Archive processing code by leveraging the Aho-Corasick algorithm. This algorithm allows for searching multiple keywords in a single pass with linear time complexity, making it highly efficient for large-scale text analysis.

This will enable filtering the entire dataset not just by repository metadata, but also for any mention of a project or related terms within the event payloads. This provides a powerful tool to quantify project adoption, integration, and general discussion across the entire GitHub ecosystem, in a way that's accessible to those without an unlimited budget, or the desire to load the full data set into a [data] store.

The implementation should ideally follow a two-phase approach to optimize performance:

  1. Aho-Corasick "Lite" Pre-filter: A fast, initial pass that searches for the presence of any of the specified keywords in an event. It returns early as soon as a match is found, significantly reducing the amount of data for the next phase. It still benefits from compiling
  2. Full Aho-Corasick Analysis: An optional full Aho-Corasick search is then run on the smaller, pre-filtered dataset. This allows for more detailed analysis, such as identifying which specific keywords were matched in each event. This two-phase approach improves data locality and cache performance, should the user wish (or even care) about the substring <-> event id mappings

The existing implementation in the sgm-gharchive sub-graph module capitalizes on the structure of events. The common header is ALWAYS at the beginning of the line, with the potentially massive payloads following. The current code can filter on org or repo, found in these initial key-value pairs by using ijson with an early exit. In this way (when we know what org(s) or repos(s) we're looking for) it only needs to touch the beginning of those lines, and only attempts to JSON decode the resulting, filtered events. Because it has to plow thru id, actor, and a timestamp for each line, we opportunistically extract those keys in addition to org and repo. We'll be leaving this alone, as it's both a useful "fast filter" as well as a way to quickly get "just the envelope" in one quick filtering pass.

Futures

To ensure broad usability, the implementation should include a library with SIMD-optimized versions for both x86_64 and Apple Silicon (ARM64) architectures. Note that while there are libraries implementing this algorithm already (pyahocorasick) it doesn't fully support the full scope of this issue - namely - no "early exit" when ANY of the terms are found (the first time), and isn't a pure python implementation, without any SIMD optimizations.

Since this is a gap in the ecosystem (based on present understanding), will likely factor this into a new (does not yet exist) pip module as it's useful for these sorts of use cases. Initial implementation might happen here (in landscape-graph) for expediency, but will plan to keep things modular so it can be moved to a discrete pip module and taken as a dependency instead.

Value Proposition

The benefits of this enhancement include:

  • Landscape(s) Analysis: The ability to quantify how and when projects are discussed and integrated across GitHub, providing a more comprehensive view of their impact and adoption.
  • Filtering: The ability to use many terms in a single-pass effeciently when scanning, such as "Observability," "traces," "metrics," "otel, "opentelemetry," "open-telemetry," and "profiling," to get a clearer picture of trends and discussions.
  • Significant Performance Improvement: Back-of-the-napkin calculations suggest a 20x performance improvement, especially for complex queries with many search terms, such as analyzing all CNCF Graduated and Incubating projects. Roadmap to SIMD and other hw accelerated better yet.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestsgmSub-Graph Module (sgm)sgm/gharchiveGitHub Archive (gharchive.org)

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions