Skip to content

Commit

Permalink
Merge pull request #25 from edsu/headers-view
Browse files Browse the repository at this point in the history
Add a view for HTTP headers
  • Loading branch information
Florents-Tselai authored Oct 30, 2023
2 parents 626f443 + 3b705a4 commit 43866ef
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 0 deletions.
23 changes: 23 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,29 @@ Here's the relational schema of the `.warcdb` file.
![WarcDB Schema](schema.png)
### Views
In addition to the core tables that map to the WARC record types there are also helper *views* that make it a bit easier to query data:
#### v_request_http_header
A view of HTTP headers in WARC request records:
| Column Name | Column Type | Description |
| -------------- | ----------- | ---------------------------------------------------------------------- |
| warc_record_id | text | The WARC-Record-Id for the *request* record that it was extracted from. |
| name | text | The lowercased HTTP header name (e.g. content-type) |
| value | text | The HTTP header value (e.g. text/html) |
#### v_response_http_header
A view of HTTP headers in WARC response records:
| Column Name | Column Type | Description |
| -------------- | ----------- | ---------------------------------------------------------------------- |
| warc_record_id | text | The WARC-Record-Id for the *response* record that it was extracted from. |
| name | text | The lowercased HTTP header name (e.g. content-type) |
| value | text | The HTTP header value (e.g. text/html) |
## Motivation
Expand Down
25 changes: 25 additions & 0 deletions tests/test_warcdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,28 @@ def test_column_names():
assert re.match(r"^[a-z_]+", col.name), f"column {col.name} named correctly"

os.remove(db_file)


def test_http_header():
runner = CliRunner()
runner.invoke(
warcdb_cli, ["import", db_file, str(pathlib.Path("tests/google.warc"))]
)

db = sqlite_utils.Database(db_file)

resp_headers = list(db["v_response_http_header"].rows)
assert len(resp_headers) == 43
assert {
"name": "content-type",
"value": "text/html; charset=UTF-8",
"warc_record_id": "<urn:uuid:2008CBED-030B-435B-A4DF-09A842DDB764>",
} in resp_headers

req_headers = list(db["v_request_http_header"].rows)
assert len(req_headers) == 17
assert {
"name": "user-agent",
"value": "Wget/1.21.3",
"warc_record_id": "<urn:uuid:6E9096E2-5D54-4CD6-A157-1DE4A7040DEB>",
} in req_headers
24 changes: 24 additions & 0 deletions warcdb/migrations.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,27 @@ def m001_initial(db):
("warc_concurrent_to", "metadata", "warc_record_id"),
],
)


@migration()
def m002_headers(db):
db.create_view(
"v_request_http_header",
"""
SELECT
request.warc_record_id AS warc_record_id,
LOWER(JSON_EXTRACT(header.VALUE, '$.header')) AS name,
JSON_EXTRACT(header.VALUE, '$.value') AS value
FROM request, JSON_EACH(request.http_headers) AS header
""",
)
db.create_view(
"v_response_http_header",
"""
SELECT
response.warc_record_id AS warc_record_id,
LOWER(JSON_EXTRACT(header.VALUE, '$.header')) AS name,
JSON_EXTRACT(header.VALUE, '$.value') AS value
FROM response, JSON_EACH(response.http_headers) AS header
""",
)

0 comments on commit 43866ef

Please sign in to comment.