This repository is a collection of vocabulary and scheme definitions and mappings to external resources. These define the foundation of linked library data used by the National Library of Sweden.
Requires Python 3.7+. (Use PyPy for a general speed improvement.)
Preferably set up a virtualenv:
$ python3 -m venv PATH_TO_VENV_OF_YOUR_CHOICE
$ source PATH_TO_VENV_OF_YOUR_CHOICE/bin/activate
Install the Python-based dependencies:
$ pip install -r requirements.txt
See the files in `source/datasets/' for definitions of what is included in each (set of) datasets.
Run the following to build the full set of datasets:
$ python datasets.py -l
This is often not needed, as not all datasets are updated all the time. Instead, prefer to set up and use the following to produce a load file only for what has been worked on.
This builds the system core dataset:
$ python syscore.py -l
Run the following to build the full set of common datasets for id.kb.se:
$ python common.py -l
You can also pass dataset names to generate the different parts in isolation.
Pass -h
or --help
to the script for details.
Finally, this builds as set of documentation articles for id.kb.se:
$ python docs.py -l
The source/
directory contains the main vocabulary mappings for linked data
at KB.
Here can also be found a bunch of common definitions and mappings, as well as
language labels more or less manually synced with various external origins. See
datasets.py
for the details.
It also contains a hand-curated set of RDF types extracted from MARC fixed field definitions (006, 007 and 008 for bib, auth and hold).
The vocabulary is split into the formally decided "vocab" terms (which we call the KBV namespace), and the legacy (often unstable) "marc" terms stemming from MARC21 constructs not yet interpreted according to the new modelling principles (based on RDF and linked data (see source/doc/model.en.mkd)).
In these files are special:
-
source/vocab/bf-to-kbv-base.rq
andsource/vocab/bf-map.ttl
are used to automatically wire up the base BF2 mappings and term hierarchies. -
source/vocab/display.jsonld
defines lenses used to display data (as "cards" or "chips"). -
source/vocab/platform.ttl
andsource/vocab/services.ttl
map various technical terms to public vocabularies. -
source/vocab/enums.ttl
,source/vocab/construct-enum-restrictions.rq
, andsource/marc/enums.ttl
define the terms (properties and classes) for controlled, "enumerable" values. A lot of these stem from controlled values for columns in fixed fields in MARC21. Some come from RDA, and some from cleaned up defintions in BibFrame 2, or our own vocabulary. (See links in the data for references.)These files also contain certain instances of these classes. Specifically, these correspond to the domain of the properties defined as
@type: @vocab
insource/vocab-overlay.jsonld
. These are special values defined within the vocabulary (often because they are very "type-like"). A prime example isIssuanceType
, whose values are kept together with the vocabulary itself. -
source/marc/construct-enums.rq
combine to create all other "enumerable" values, which may or may not become merged with other controlled lists in the future. (When that is done, the definition here must be removed and its URI be places in asameAs
relation in whatever term that is replacing it.
Note: the file source/vocab/check-bases.rq
is used to check some sanity
in the generated structures. It is advised to heed any warnings by correcting
the relevant sources.
Tip: During vocab development. Regularly run just:
$ python syscore.py
which generates the vocab build file.
Look at it as Turtle by running:
$ rdfpipe -ijson-ld:context=sys/context/kbv.jsonld build/vocab.jsonld
, and/or make a nice, digested tree view by running:
$ python scripts/misc/vocab-summary.py build/vocab.jsonld -c sys/context/kbv.jsonld -v
When bigger changes are made, you can generate a more predicable output by calling e.g.:
$ PYTHONHASHSEED=1 python datasets.py -l
Use this in conjunction with switching between a stable branch and a feature, backing up the build directory when doing so, then using e.g.:
diff -qr /tmp/build-develop-bak build
to see the resulting differences.
To categorize classes and properties, we use or own kbv:category
property,
which links to various terms we've defined for various purposes, such as
:pending
.
We do not use vs:term_status
for this, since:
-
We have a more broad set of categories than "status" implies. Categories are defined for various application-specific purposes, e.g. to state that a term is a shorthand term, or that a class belongs to a group of classes mappable to MARC bibliographic records).
-
Its use of string literals is poor practise, since out-of-band definitions are then needed to discover applicable values and their meanings. This is natural when using linked data by simply minting a URI for the status item and defining it with labels and definition texts (in any languages needed).
We have put vs:term_status "unstable"
to use in some places, to clearly
indicate that using a common colloquialism. But for out application purposes,
we use :category :pending
.
For deprecation we use owl:deprecated true
, to facilitate any eventual
tooling requiring this exact form.
We also mark terms using ptg:abstract true
if they are not supposed to be
used for resources directly (and thus choosable e.g. in an editing interface),
but to represent a point in a class or property hierarchy defined for
structuring the vocabulary.
In principle, we should keep any published terms indefinitely. Everything at
id.kb.se
is potentially used externally (even without us knowing so), as
we're an official agency tasked with ensuring long term stability and promoting
data reuse.
If we consider a certain term ill-defined and detrimental to use, do not expect
anyone else to be using it, and consider keeping it along with a
owl:deprecated true
as potentially problematic, it is OK to comment it out
along with a note like:
# Dropped at 2021-09-08. Feel free to delete this after 5 years.
If its disappearance prompts any complaints, this gives us an easy way of seeing that we've removed it, and provides a window for restoring it.
This is a public application vocabulary. As such, we have no contract in
terms of stability or officiality, other than that all terms we use in our
data are to be defined within it. In general, this holds even if our data for
certain resources is deleted, since their descriptions may have been kept in
other systems. We do not guarantee this indefinitely though, and especially we
might drop terms if they are deemed incorrect. Other than that, we will use
owl:deprecated true
to signal intended disappearance of a term.
All of these terms are implicitly owl:deprecated true
and can in theory be
dropped at any time (after removing any use of them from our datasets). No
external use should depend on them. Any long-term use of these which indicate
meaningful requirements should be reworked into proper KBV terms.
By using utilities in the whelk-core repository; you can generate a SPARQL construct file from the marcframe.json mappings, from which you can in turn generate a basic vocab file:
$ cd ../whelk-core/ && gradle -q vocabFromMarcFrame #.rq
To generate RDF descriptions from legacy MARC definitions, use:
$ python scripts/marcframe-skeleton-from-marcmap.py scripts/marc/marcmap.json --enums
See that script for other options.
Pipe the output to rdfpipe -ijson-ld:base=source/ -oturtle -
to get it as Turtle.