Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add min- and max-created filters to lddb_json_shape.py #1312

Merged
merged 4 commits into from
Oct 2, 2023

Conversation

niklasl
Copy link
Member

@niklasl niklasl commented Sep 27, 2023

Use like:

$ zcat SOME_DUMP.json.linez.gz |
    python3.11 lddb_json_shape.py /lddb_shapes/ --min-created "2023-08-01T00:00:00Z" --max-created "2023-09-01T00:00:00Z"

Copy link
Member

@andersju andersju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -109,6 +123,14 @@ def count_value(k, v, shape):

try:
data = json.loads(l)

if '@graph' in data:
created = datetime.fromisoformat(data['@graph'][0]['created'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a traceback (at different lines on repeated runs) but upon quick inspection doesn't seem to indicate a missing created property in the data but have not had time to investigate more properly.

  File "/libris/librisxl/librisxl-tools/scripts/lddb_json_shape.py", line 128, in <module>
    created = datetime.fromisoformat(data['@graph'][0]['created'])
                                     ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'created'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processing definitions/build/*.lines I see a few dozen things with no created property, e.g.,

{'@graph': [{'@type': 'SystemRecord', 'mainEntity': {'@id': 'https://libris.kb.se/'}, '@id': 'p76szt07r0kw1bjb', 'inDataset': [{'@id': 'https://libris.kb.se/dataset/syscore'}, {'@id': 'https://libris.kb.se/dataset/sys/apps'}]}, {'@id': 'https://libris.kb.se/', '@type': 'DataCatalog', 'title': 'libris.kb.se', 'article': {'@type': 'Article', 'articleBody': "<p xml:lang='sv'>Data på <b>LIBRIS.KB.SE</b>.</p>"}}]}
{'@graph': [{'@type': 'SystemRecord', 'mainEntity': {'@id': 'https://libris.kb.se/data'}, '@id': 'p76szt07r4pwm3dk', 'inDataset': [{'@id': 'https://libris.kb.se/dataset/syscore'}, {'@id': 'https://libris.kb.se/dataset/sys/apps'}]}, {'@id': 'https://libris.kb.se/data', '@type': 'DataService', 'titleByLang': {'en': 'LIBRIS-XL Linked Data Platform API'}, 'statistics': {'sliceList': [{'dimensionChain': ['rdf:type'], 'itemLimit': 400}]}}]}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the test sets I used only contain bib, auth or hold data (using method described here). I tried to fetch the line with sed when i got the At: 228,661Traceback (most recent call) but didn't immediately see anything suspicious unless it yields the wrong line.

@niklasl niklasl merged commit 0f72406 into develop Oct 2, 2023
1 check passed
@niklasl niklasl deleted the feature/update-shapes-script branch October 2, 2023 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants