Mining Millions of Search Result Pages of Hundreds of Search Engines from 25Β Years of Web Archives.
Start now by running your custom analysis/experiment, scraping your own query log, or just look at our example files.
The data in the Archive Query Log is highly sensitive (still, you can re-crawl everything from the Wayback Machine). For that reason, we ensure that custom experiments or analyises can not leak sensitive data (please get in touch if you have questions) by using TIRA as a platform for custom analyses/experiments. In TIRA, you submit a Docker image that implements your experiment. Your software is then executed in sandboxed mode (without internet connection) to ensure that your software does not leak sensitive information. After your software execution finished, administrators will review your submission and unblind it so that you can access the outputs.
Please refer to our dedicated TIRA tutorial as starting point for your experiments.
- Install Python 3.10
- Create and activate virtual environment:
python3.10 -m venv venv/ source venv/bin/activate
- Install dependencies:
pip install -e .
To quickly scrape a sample query log, jump to the TL;DR.
If you want to learn more about each step here are some more detailed guides:
- Search providers
- Fetch archived URLs
- Parse archived query URLs
- Download archived raw SERPs
- Parse archived SERPs
Let's start with a small example and construct a query log for the ChatNoir search engine:
python -m archive_query_log make archived-urls chatnoir
python -m archive_query_log make archived-query-urls chatnoir
python -m archive_query_log make archived-raw-serps chatnoir
python -m archive_query_log make archived-parsed-serps chatnoir
Got the idea? Now you're ready to scrape your own query logs! To scale things up and understand the data, just keep on reading. For more details on how to add more search providers, see below.
Manually or semi-automatically collect a list of search providers that you would like to scrape query logs from.
The list of search providers should be stored in a single YAML file at data/selected-services.yaml
and contain one entry per search provider, like shown below:
- name: string # search providers name (alexa_domain - alexa_public_suffix)
public_suffix: string # public suffix (https://publicsuffix.org/) of alexa_domain
alexa_domain: string # domain as it appears in Alexa top-1M ranks
alexa_rank: int # rank from fused Alexa top-1M rankings
category: string # manual annotation
notes: string # manual annotation
input_field: bool # manual annotation
search_form: bool # manual annotation
search_div: bool # manual annotation
domains: # known domains of the search providers (including the main domain)
- string
- string
- ...
query_parsers: # query parsers in order of precedence
- pattern: regex
type: query_parameter # for URLs like https://example.com/search?q=foo
parameter: string
- pattern: regex
type: fragment_parameter # for URLs like https://example.com/search#q=foo
parameter: string
- pattern: regex
type: query_parameter # for URLs like https://example.com/search/foo
path_prefix: string
- ...
page_parsers: # page number parsers in order of precedence
- pattern: regex
type: query_parameter # for URLs like https://example.com/search?page=2
parameter: string
- ...
offset_parsers: # page offset parsers in order of precedence
- pattern: regex
type: query_parameter # for URLs like https://example.com/search?start=11
parameter: string
- ...
interpreted_query_parsers: # interpreted query parsers in order of precedence
- ...
results_parsers: # search result and snippet parsers in order of precedence
- ...
- ...
In the source code, a search provider corresponds to the Python class Service
.
Fetch all archived URLs for a search provider from the Internet Archive's Wayback Machine.
You can run this step with the following command line, where <PROVIDER>
is the name of the search provider you want to fetch archived URLs from:
python -m archive_query_log make archived-urls <PROVIDER>
This will create multiple files in the archived-urls
subdirectory under the data directory, based on the search provider's name (<PROVIDER>
), domain (<DOMAIN>
), and the Wayback Machine's CDX page number (<CDXPAGE>
) from which the URLs were originally fetched:
<DATADIR>/archived-urls/<PROVIDER>/<DOMAIN>/<CDXPAGE>.jsonl.gz
Here, the <CDXPAGE>
is a 10-digit number with leading zeros, e.g., 0000000001
.
Each individual file is a GZIP-compressed JSONL file with one archived URL per line, in arbitrary order. Each line contains the following fields:
{
"url": "string",
// archived URL
"timestamp": "int"
// archive timestamp as POSIX integer
}
In the source code, an archived URL corresponds to the Python class ArchivedUrl
.
Parse and filter archived URLs that contain a query and may point to a search engine result page (SERP).
You can run this step with the following command line, where <PROVIDER>
is the name of the search provider you want to parse query URLs from:
python -m archive_query_log make archived-query-urls <PROVIDER>
This will create multiple files in the archived-query-urls
subdirectory under the data directory, based on the search provider's name (<PROVIDER>
), domain (<DOMAIN>
), and the Wayback Machine's CDX page number (<CDXPAGE>
) from which the URLs were originally fetched:
<DATADIR>/archived-query-urls/<PROVIDER>/<DOMAIN>/<CDXPAGE>.jsonl.gz
Here, the <CDXPAGE>
is a 10-digit number with leading zeros, e.g., 0000000001
.
Each individual file is a GZIP-compressed JSONL file with one archived query URL per line, in arbitrary order. Each line contains the following fields:
{
"url": "string",
// archived URL
"timestamp": "int",
// archive timestamp as POSIX integer
"query": "string",
// parsed query
"page": "int",
// result page number (optional)
"offset": "int"
// result page offset (optional)
}
In the source code, an archived query URL corresponds to the Python class ArchivedQueryUrl
.
Download the raw HTML content of archived search engine result pages (SERPs).
You can run this step with the following command line, where <PROVIDER>
is the name of the search provider you want to download raw SERP HTML contents from:
python -m archive_query_log make archived-raw-serps <PROVIDER>
This will create multiple files in the archived-urls
subdirectory under the data directory, based on the search provider's name (<PROVIDER>
), domain (<DOMAIN>
), and the Wayback Machine's CDX page number (<CDXPAGE>
) from which the URLs were originally fetched. Archived raw SERPs are stored as 1GB-sized WARC chunk files, that is, WARC chunks are "filled" sequentially up to a size of 1GB each. If a chunk is full, a new chunk is created.
<DATADIR>/archived-raw-serps/<PROVIDER>/<DOMAIN>/<CDXPAGE>/<WARCCHUNK>.jsonl.gz
Here, the <CDXPAGE>
and <WARCCHUNK>
are both 10-digit numbers with leading zeros, e.g., 0000000001
.
Each individual file is a GZIP-compressed WARC file with one WARC request and one WARC response per archived raw SERP. WARC records are arbitrarily ordered within or across chunks, but the WARC request and response for the same archived query URL are kept together. The archived query URL is stored in the WARC request's and response's Archived-URL
field in JSONL format (the same format as in the previous step):
{
"url": "string",
// archived URL
"timestamp": "int",
// archive timestamp as POSIX integer
"query": "string",
// parsed query
"page": "int",
// result page number (optional)
"offset": "int"
// result page offset (optional)
}
In the source code, an archived raw SERP corresponds to the Python class ArchivedRawSerp
.
Parse and filter archived SERPs from raw contents.
You can run this step with the following command line, where <PROVIDER>
is the name of the search provider you want to parse SERPs from:
python -m archive_query_log make archived-parsed-serps <PROVIDER>
This will create multiple files in the archived-serps
subdirectory under the data directory, based on the search provider's name (<PROVIDER>
), domain (<DOMAIN>
), and the Wayback Machine's CDX page number (<CDXPAGE>
) from which the URLs were originally fetched:
<DATADIR>/archived-serps/<PROVIDER>/<DOMAIN>/<CDXPAGE>.jsonl.gz
Here, the <CDXPAGE>
is a 10-digit number with leading zeros, e.g., 0000000001
.
Each individual file is a GZIP-compressed JSONL file with one archived parsed SERP per line, in arbitrary order. Each line contains the following fields:
{
"url": "string",
// archived URL
"timestamp": "int",
// archive timestamp as POSIX integer
"query": "string",
// parsed query
"page": "int",
// result page number (optional)
"offset": "int",
// result page offset (optional)
"interpreted_query": "string",
// query displayed on the SERP (e.g. with spelling correction; optional)
"results": [
{
"url": "string",
// URL of the result
"title": "string",
// title of the result
"snippet": "string"
// snippet of the result (highlighting normalized to <em>)
},
...
]
}
In the source code, an archived parsed SERP corresponds to the Python class ArchivedParsedSerp
.
By default, the data directory is set to data/
. You can change this with the --data-directory
option, e.g.:
python -m archive_query_log make archived-urls --data-directory /mnt/ceph/storage/data-in-progress/data-research/web-search/web-archive-query-log/
If the search provider you're scraping queries for is very large and has many domains, testing your settings on a smaller sample from that search provider can be helpful. You can specify a single domain to scrape from like this:
python -m archive_query_log make archived-urls <PROVIDER> <DOMAIN>
If a domain is very popular and therefore has many archived URLs, you can further limit the number of archived URLs to scrape by selecting a page from the Wayback Machine's CDX API:
python -m archive_query_log make archived-urls <PROVIDER> <DOMAIN> <CDX_PAGE>
If you use the Archive Query Log dataset or the code to generate it in your research, please cite the following paper describing the AQL and its use-cases:
TODO
You can use the following BibTeX entry for citation:
% TODO
Run tests:
flake8 archive_query_log
pylint -E archive_query_log
pytest archive_query_log
Add new tests for parsers:
- Select the number of tests to run per service and the number of services.
- Auto-generate unit tests and download WARCs with generate_tests.py
- Run the tests.
- Failing tests will open a diff editor with the approval and a web browser tab with the Wayback URL.
- Use the web browser dev tools to find the query input field and search result CSS paths.
- Close diffs and tabs and re-run tests.
- Kaggle dataset of the manual test SERPs, thanks to @DiTo97
If you've found an important search provider to be missing from this query log, please suggest it by creating an issue. We also very gratefully accept pull requests for adding search providers or new parser configurations!
If you're unsure about anything, post an issue, or contact us:
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
We're happy to help!
This repository is released under the MIT license. Files in the data/
directory are exempt from this license.
If you use the AQL in your research, we'd be glad if you'd cite us.
The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.