Skip to content

Latest commit

 

History

History
1383 lines (1024 loc) · 48.2 KB

docs.md

File metadata and controls

1383 lines (1024 loc) · 48.2 KB

DatAasee Software Documentation

Version: 0.2

DatAasee is a metadata-lake for centralizing bibliographic metadata and scientific metadata from various sources, to increase research data findability and discoverability, as well as metadata availability, and thus supporting FAIR research and research reporting in university libraries, research libraries, academic libraries or scientific libraries.

Particularly, DatAasee is developed for and by the University and State Library of Münster.

Sections:

Selected Subsections:


Explanations

In this section understanding-oriented explanations are collected.

Overview:

About

  • What is DatAasee?
    • DatAasee is a metadata-lake!
  • What is a Metadata-Lake?
    • A metadata-lake (a.k.a. metalake) is a data-lake restricted to metadata data!
  • What is a Data-Lake?
    • A data-lake is a data architecture for structured, semi-structured and unstructured data!
  • How does a data-lake differ from a database?
    • A data-lake includes a database, but requires further components to import and export data!
  • How does a data-lake differ from a data warehouse?
    • A data warehouse transforms incoming data to fit its schema (cf ETL), a data-lake ingests incoming data as-is and transforms it on-demand (cf ELT)!
  • How is data in a data-lake organized?
    • A data lake includes a metadata catalog that stores data locations, its metadata, and transformations!
  • What makes a metadata-lake special?
    • The metadata-lake's data-lake and metadata catalog coincide, this implies incomig data is partially transformed (cf EtLT) to hydrate the catalog aspect of the metadata-lake!
  • How does a metadata-lake differ from a data catalog?
    • A metadata-lake's data is (textual) metadata while a data catalog's data is databases (and their contents)!
  • How is a metadata-lake a data-lake?
    • The ingested metadata is stored in raw form in the metadata-lake in addition to the partially transformed catalog metadata, and transformations are performed on the raw or catalog metadata upon request.
  • How does a metadata-lake relate to a virtual data-lake?
    • A metadata-lake can act as a central metadata catalog for a set of distributed data sources and thus define a virtual data-lake.

Features

  • Search via: full-text, filter, [SRU]
  • Query by: SQL, Gremlin, Cypher, MQL, GraphQL, [SPARQL]
  • Ingest: DataCite (XML), DC (XML), MARC (XML), MODS (XML), [LIDO], [EAD], [DCAT], [RDF]
  • Ingest via: OAI-PMH (HTTP), S3 (HTTP), [GraphQL], [Postgres], [Self]
  • Deploy via: Docker, Podman, [Kubernetes]
  • REST-like API with CQRS aspects
  • Best-of statistics of enumerated properties
  • CRUD frontend for manual interaction and observation.

Components

DatAasee uses a three-tier architecture with these separately containered components:

Function Abstraction Tier Product
Metadata Catalog Multi-Model Database Data ArcadeDB
EtLT Processor Declarative Streaming Processor Logic Benthos
Web Frontend Declarative Web Framework Presentation Lowdefy

Design

  • Each component is encapsulated in its own container.
  • External access happens through an HTTP API transporting JSON and conforming to JSON:API.
  • Ingests may happen via compatible protocols, e.g. OAI-PMH, S3.
  • The frontend is optional as it is exclusively using the (external) HTTP-API.
  • Internal communication happens via the components' HTTP-APIs.
  • Only the database component holds state, the backend (and frontend) are stateless.
  • For more details see the architecture documentation.

Data Model

The internal data model is based on the one big table (OBT) approach, but with the exception of linked enumerated dimensions (Look-Up tables) making it effectively an denormalized wide table with star schema, named metadata.

EtLT Process

Combining the ETL (Extract-Transform-Load / schema-on-write) and ELT (Extract-Load-Transform / schema-on-read) concepts, processing is built upon the EtLT approach:

  • Extract: Ingest from data source, see ingest endpoint.
  • transform: Partial parsing and cleaning.
  • Load: Write to database.
  • Transform: Parse to export format on-demand.

Particularly, this means "EtL" happens (batch-wise) during ingest, while "T" occurs when requested.

Security

Secrets:

  • Two secrets need to be handled: database admin and datalake admin passwords.
  • The default datalake admin user name is admin, the password can be passed during initial deploy.
  • The database admin user name is root, the password can be passed during initial deploy.
  • The passwords are handled as file-based secrets by the deploying compose file (loaded from a file and provided to containers as a file).
  • The database credentials are used by the backend and may be used for manual database access.
  • If the secrets are kept on the host, they need to be protected, for example via openssl, SOPS, or similar tools.

Infrastructure:

  • Component containers are custom-build and hardened.
  • Only HTTP and Basic Authentication are used, as it is assumed that HTTPS is provided by a user-provided proxy-server.

Interface:

  • HTTP-API GET requests are idempotent and thus unchallenged.
  • HTTP-API POST requests may change the state of the database and thus need to be authorized by the data-lake admin user credentials.
  • See the DatAasee OpenAPI definition.

How-Tos

In this section, step-by-step guides for real-world problems are listed.

Overview:

Prerequisite

The (virtual) machine deploying DatAasee requires docker-compose, or podman-compose. See also the container engine compatibility.

Resources

The compute and memory resources for DatAasee can be configured via the compose.yaml. Overall, a bare-metal machine or virtual machine requires:

  • Minimum: 2 CPU, 4G RAM
  • Recommended: 4 CPU, 8G RAM

So, a Raspberry Pi would be sufficient. In terms of DatAasee components this breaks down to:

  • Database:
    • Minimum: 1 CPU, 2G RAM
    • Recommended: 2 CPU, 4G RAM
  • Backend:
    • Minimum: 1 CPU, 1G RAM
    • Recommended: 2 CPU, 2G RAM
  • Frontend:
    • Minimum: 1 CPU, 1G RAM
    • Recommended: 2 CPU, 2G RAM

Note, that resource and system requirements depend on load, particularly, database and backend are under heavy load during ingest. Post ingest, (new) metadata records are interrelated, also causing heavy database loads. Generally, the database drives the overall performance. Thus to improve performance, try first to increase memory for the database component (i.e. 4G to 8G).

Deploy

$ mkdir -p backup  # or: ln -s /path/to/backup/volume backup
$ wget https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)

NOTE: The required secrets are kept temporary in the files dl_pass and db_pass.

NOTE: Make sure to delete (or encrypt) secret files dl_pass and db_pass after use!

NOTE: To customize your deploy, use these environment variables.

NOTE: The runtime configuration environment variables can be stored in an .env file.

NOTE: A custom backup location can alternatively also be specified inside the compose.yaml.

Test

wget -SqO- http://localhost:8343/api/v1/ready

NOTE: The default port for the HTTP API is 8343.

Shutdown

$ docker-compose down

NOTE: A (database) backup is automatically triggered on every shutdown.

Ingest

$ wget -O- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data \
  '{"source":"https://my.url/to/oai","method":"oai-pmh","format":"mods","steward":"https://my.url/identifying/steward"}'

NOTE: A (database) backup is automatically triggered after every ingest.

Backup Manually

$ wget -O- http://localhost:8343/api/v1/backup --user admin --ask-password --post-data=''

Logs

$ docker-compose logs backend

Update

$ docker compose down
$ docker compose pull
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)

NOTE: "Update" means: if available, new images of the same DatAasee version but updated dependencies will be installed, whereas "Upgrade" means: a new version of DatAasee will be installed.

Upgrade

$ docker compose down
$ echo -n 'password1' > dl_pass && echo -n 'password2' > db_pass && DL_VERSION=0.3 docker compose up -d; rm -f dl_pass db_pass; history -d $(history 1)

NOTE: docker-compose restart cannot be used here because environment variables (such as DL_VERSION) are not updated when using restart.

NOTE: Make sure to put the DL_VERSION variable also into the .env file for a permanent upgrade.

Web Interface (Prototype)

NOTE: The default port for the web frontend is 8000 for development and 80 for deployment.

Index Screenshot

Filter Screenshot

Query Screenshot

Overview Screenshot

About Screenshot

Fetch Screenshot

Insert Screenshot

Admin Screenshot

API Indexing

Add the JSON object inside to the apis array in your global apis.json API index.

{
  "name": "DatAasee API",
  "description": "The DatAasee API enables research data search and discovery via metadata",
  "keywords": ["Metadata"],
  "attribution": "DatAasee",
  "baseURL": "http://your-dataasee.url/api/v1",
  "properties": [
    {
      "type": "InterfaceLicense",
      "url": "https://spdx.org/licenses/MIT.html"
    },
    {
      "type": "x-openapi",
      "url": "http://your-dataasee.url/api/v1/api"
    }
  ]
}

References

In this section technical descriptions are summarized.

Overview:

HTTP-API

The HTTP-API is served under http://<your-url-here>/api/v1 and provides the following endpoints:

Method Endpoint Type Summary
GET /ready system Return service status
GET /api special Return API specification and schemas
GET /schema metadata Return database schema
GET /attributes metadata Return enumerated properties
GET /stats data Return statistics about records
GET /metadata data Return metadata record(s)
POST /insert data Create new record
POST /ingest system Trigger ingest from source
POST /backup system Trigger database backup
POST /health system Return service health
GET /export data TODO:
GET /sru data TODO:
POST /forward system TODO:

NOTE: The base path for all endpoints is /api/v1.

NOTE: All GET requests are unchallenged, all POST requests are challenged, which are handled via "Basic Authentication".

NOTE: All request and response bodies have content type JSON, and if provided, the Content-Type HTTP header must be application/json!

NOTE: As the metadata-lake's data is metadata, a type "data" means metadata, and a type "metadata" means metadata about metadata.

NOTE: Responses follow the JSON:API format.

NOTE: The id property is the server's Unix timestamp.


/ready Endpoint

Returns boolean answering if service is ready.

NOTE: The ready endpoint can be used as readiness probe.

Status:

  • 200 OK
  • 406 Not Acceptable.
  • 503 Service Unavailable.

Example:

Get service readiness:

$ wget -qO- http://localhost:8343/api/v1/ready

/api Endpoint

Returns OpenAPI specification if no parameter is given, otherwise returns a request or response schema.

NOTE: In case of a succesful request, the response is NOT in the JSONAPI format, but the requested JSON file directly.

  • Method: GET
  • Parameters:
    • request (Optional; if provided, a request schema for the endpoint in the parameter value is returned.)
    • response (Optional; if provided, a response schema for the endpoint in the parameter value is returned.)
  • Response Schema: response/api.json
  • Access: Public
  • Process: see architecture

Statuses:

  • 200 OK
  • 400 Parameter value is not an endpoint or has no request schema.
  • 406 Not Acceptable.

Examples:

Get OpenAPI definition:

$ wget -qO- http://localhost:8343/api/v1/api

Get ingest endpoint request schema:

$ wget -qO- http://localhost:8343/api/v1/api?request=ingest

Get metadata endpoint response schema:

$ wget -qO- http://localhost:8343/api/v1/api?response=metadata

/schema Endpoint

Returns internal metadata schema.

Statuses:

  • 200 OK
  • 406 Not Acceptable.
  • 500 Database error.

Example:

Get native metadata schema:

$ wget -qO- http://localhost:8343/api/v1/schema

/attributes Endpoint

Returns list of enumerated attribute values.

Statuses:

  • 200 OK
  • 400 Invalid request.
  • 406 Not Acceptable.
  • 500 Database error.

Example:

Get all enumerated attributes:

$ wget -qO- http://localhost:8343/api/v1/attributes

Get language attributes:

$ wget -qO- http://localhost:8343/api/v1/attributes?type=languages

/stats Endpoint

Return statistics about records.

Statuses:

  • 200 OK
  • 406 Not Acceptable.
  • 500 Database error.

Example:

$ wget -qO- http://localhost:8343/api/v1/stats

/metadata Endpoint

Fetch from, search, filter or query metadata record(s).

  • Method: GET
  • Parameters:
    • id (Optional; if provided, a metadata-set with this value is returned.)
    • search (Optional; if provided, full-text search results for this value are returned.)
    • query (Optional; if provided, query results using this value are returned, no language parameter implies sql.)
    • language (Optional; if provided, filter results by language are returned, also used to set query language.)
    • resourcetype (Optional; if provided, filter results by resourceType are returned.)
    • license (Optional; if provided, filter results by license are returned.)
    • category (Optional; if provided, filter results by category are returned.)
    • from (Optional; if provided, filter results greater or equal publicationYear are returned.)
    • till (Optional; if provided, filter results lesser or equal publicationYear are returned.)
    • skip (Optional; if provided, this number of results is skipped, use for paging.)
    • newest (Optional; if provided, results are sorted new-to-oldest if true (default), or old-to-new if false.)
  • Response Body: response/metadata.json
  • Access: Public
  • Process: see architecture

NOTE: Only idem-potent read operations are permitted in custom queries.

NOTE: A full-text search always matches for all argument terms (AND-based) in titles, descriptions and keywords in any order, while accepting * as wildcards and _ to build phrases.

Statuses:

  • 200 OK
  • 400 Invalid request.
  • 404 Not found.
  • 406 Not Acceptable.
  • 500 Database error.

Examples:

Get record by record identifier:

$ wget -qO- http://localhost:8343/api/v1/metadata?id=

Search records by single filter:

$ wget -qO- http://localhost:8343/api/v1/metadata?language=chinese

Search records by multiple filters:

$ wget -qO- http://localhost:8343/api/v1/metadata?resourcetype=book&language=german

Search records by full-text for word "History":

$ wget -qO- http://localhost:8343/api/v1/metadata?search=History

Search records by full-text and filter, oldest first:

$ wget -qO- http://localhost:8343/api/v1/metadata?search=Geschichte&resourcetype=book&language=german&newest=false

Search records by custom SQL query:

$ wget -qO- http://localhost:8343/api/v1/metadata?language=sql&query=SELECT%20FROM%20metadata%20LIMIT%2010

/insert Endpoint

Inserts and parses, if necessary, a new record into the database.

NOTE: This endpoint is meant for metadata records that cannot be ingested such as a report of ingested sources or testing; general use is discouraged. For details on the request body, see the associated JSON schema.

Status:

  • 201 OK
  • 400 Invalid request.
  • 403 Invalid credentials.
  • 406 Not Acceptable.
  • 500 Database error.

Example:

Insert record with given fields: TODO:

$ wget -qO- http://localhost:8343/api/v1/insert --user admin --ask-password --post-file=myinsert.json

/ingest Endpoint

Trigger ingest from data source.

  • Method: POST
  • Request Body: request/ingest.json
    • source must be an URL
    • method can be oai-pmh or s3
    • format can be datacite, oai_datacite, dc, oai_dc, marc21, marcxml, mods, or rawmods
    • steward should be an URL or email address.
  • Response Body: response/ingest.json
  • Access: Challenged (Basic Authentication)
  • Process: see architecture

NOTE: To test if the server is busy, send an empty (POST) body to this endpoint. HTTP status 400 means available, status 503 means currently ingesting.

NOTE: The method and format are case-sensitive.

Status:

  • 202 Accepted.
  • 400 Invalid request.
  • 403 Invalid credentials.
  • 406 Not Acceptable.
  • 503 Already ingesting.

Example:

Start ingest from a given source:

$ wget -qO- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data='{"source":"https://datastore.uni-muenster.de/oai", "method":"oai-pmh", "format":"datacite", "steward":"[email protected]"}'

/backup Endpoint

Trigger database backup.

Status:

  • 200 OK
  • 403 Invalid credentials.
  • 406 Not Acceptable.
  • 500 Database error.

Example:

$ wget -qO- http://localhost:8343/api/v1/backup --user admin --ask-password --post-data=''

/health Endpoint

Returns internal status of service components.

NOTE: The health endpoint can be used as liveness probe.

Status:

  • 200 OK
  • 403 Invalid credentials.
  • 406 Not Acceptable.
  • 500 Internal Server Error.

Example:

Get service health:

$ wget -qO- http://localhost:8343/api/v1/health --user admin --ask-password --post-data=''

/export Endpoint

TODO:

/sru Endpoint

TODO:

/forward Endpoint

TODO:

Ingest Protocols

  • OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)
    • Supported Versions: 2.0
  • S3 (Simple Storage Service)
    • Supported Versions: 2006-03-01

Ingest Encodings

Currently, XML (eXtensible Markup Language) is the sole encoding for ingested metadata.

Ingest Formats

Native Schema

Key Class Entry Type Constraints
schemaVersion Process Automatic Integer min 0
recordId Process Automatic String max 31
metadataQuality Process Automatic String max 255
dataSteward process Automatic String max 4095
source Process Automatic String max 4095
createdAt Process Automatic Datetime
updatedAt Process Automatic Datetime
sizeBytes Technical Automatic Integer min 0
fileFormat Technical Automatic String max 255
dataLocation Technical Automatic String max 4095, regexp
numberDownloads Social Automatic Integer min 0
keywords Social Optional String max 255
categories Social Optional List(String) max 4
name Descriptive Mandatory String max 255
creators Descriptive Mandatory List(pair) max 255
publisher Descriptive Mandatory String min 1, max 255
publicationYear Descriptive Mandatory Integer min -9999, max 9999
resourceType Descriptive Mandatory Link(attribute) resourceTypes
identifiers Descriptive Mandatory List(pair) max 255
synonyms Descriptive Optional List(pair) max 255
language Descriptive Optional Link(attribute) languages
subjects Descriptive Optional List(pair) max 255
version Descriptive Optional String max 255
license Descriptive Optional Link(pair) licenses
rights Descriptive Optional String max 65535
project Descriptive Optional Embedded(pair)
fundings Descriptive Optional List(pair) max 255
description Descriptive Optional String max 65535
message Descriptive Optional String max 65535
externalItems Descriptive Optional List(pair) max 255
rawType Raw Optional String max 255
raw Raw Optional String max 1048575
rawChecksum Raw Optional String max 255

NOTE: See also the schema diagram: schema.md

NOTE: The preloaded set of categories (see preload.sql) is highly opinionated.

Helper types

attributes
Property Type Constraints
name String min 3, max 255
also List(String)
pair
Property Type Constraints
name String max 255
data String max 4095, regexp

Global Metadata

Each schema property has a label, additionally the descriptive properties have a comment property.

Key Type Comment
label String For UI labels
comment String For UI helper texts

Interrelation Edges

Type Comment
isRelatedTo Base edge type
isNewVersionOf Derived from isRelatedTo
isDerivedFrom Derived from isRelatedTo
isPartOf Derived from isRelatedTo
isSameExpressionAs Derived from isRelatedTo
isSameManifestationAs Derived from isRelatedTo

Ingestable to Native Schema Crosswalk

TODO: Add sub elements

DatAasee DataCite DC MARC MODS
name titles title 245, 130 titleInfo, part
creators creators, contributors creator, contributor 100, 700 name, relatedItem
publisher publisher publisher 260, 264 originInfo
publicationYear publicationYear date 260, 264 originInfo, part, recordInfo
resourceType resourceType type 007, genre
identifiers identifier, alternateIdentifiers identifier 001, 020, 856 identifier, recordInfo
synonyms titles title 210, 222, 240, 242, 246, 247 titleInfo
language language language 008, 041 language
subjects subjects subjects 655, 689 subject
version version 250
license rights accessCondition
rights rights 506, 540
project
fundings fundingReferences
description description description 520
message format 500 note
externalItems relatedIdentifiers identifier identifier
isRelatedTo relatedItems, relatedIdentifiers related 773 relatedItem
isNewVersionOf relatedItems, relatedIdentifiers relatedItem
isDerivedFrom relatedItems, relatedIdentifiers relatedItem
isPartOf relatedItems, relatedIdentifiers relatedItem
isSameExpressionAs relatedItem
isSameManifestationAs recordInfo

Query Languages

Language Identifier Documentation
SQL sql ArcadeDB SQL
Cypher cypher Neo4J Cypher
GraphQL graphql GraphQL Spec
Gremlin gremlin Tinkerpop Gremlin
MQL mongo Mongo MQL
SPARQL sparql SPARQL (WIP)

Runtime Configuration

The following environment variables affect DatAasee if set before starting.

Symbol Value Meaning
TZ CET (Default) Timezone of server
DL_VERSION 0.2 (Example) Requested DatAasee version
DL_BACKUP $PWD/backup (Default) Path to backup folder
DL_USER admin (Default) DatAasee admin username
DL_BASE http://my.url (Example) Outward DatAasee base URL (including protocol and port, no trailing slash)
DL_PORT 8343 (Default) DatAasee API port
FE_PORT 8000 (Default) Web Frontend port (Development)

Tutorials

In this section learning-oriented lessons for new-comers are given.

Overview:

Getting Started

  1. Setup compatible compose orchestrator
  2. Download DatAasee release
    $ wget https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
    or:
    $ curl https://raw.githubusercontent.com/ulbmuenster/dataasee/0.2/compose.yaml
  3. Unpack compose.yaml
    $ tar -xf dataasee-0.2.tar.gz
    and:
    $ cd dataasee-0.2
  4. Create or mount folder for backups (assuming your backup volume is mounted under /backup)
    $ mkdir -p backup
    or:
    $ ln -s /backup backup
  5. Create DatAasee API and database admin passwords. The spaces before echo prevent these commands from being added to the history. echo -n is used to create the password files as most editors add a newline at the end of a file.
    $  echo -n 'password1' > dl_pass
    and:
    $  echo -n 'password2' > db_pass
  6. Start DatAasee service
    $ docker-compose up -d
    or:
    $ podman-compose up -d

Now, if started locally point a browser to http://localhost:8000 to use the web frontend, or send requests to http://localhost:8343/api/v1/ to use the HTTP API directly.

Example Ingest

For demonstration purposes the collection of the "Directory of Open Access Journals" (DOAJ) is ingested. An ingest has four phases: First, the administrator needs to collect the necessary information of the metadata source, i.e. URL, protocol, format, and data steward. Second, the ingest is triggered via the HTTP-API. Third, the backend ingests the metadata records from the source to the database. Fourth and lastly, the ingested data is interconnected inside the database.

  1. Check the documentation of DOAJ:

    https://doaj.org/docs

    The oai-pmh protocol is available.

  2. Check the documentation about OAI-PMH:

    https://doaj.org/docs/oai-pmh/

    The OAI-PMH endpoint URL is: https://doaj.org/oai.

  3. Check the OAI-PMH for available metadata formats:

    https://doaj.org/oai?verb=ListMetadataFormats

    A compatible metadata format is oai_dc.

  4. Start an ingest:

    $ wget -qO- http://localhost:8343/api/v1/ingest --user admin --ask-password --post-data='{"source":"https://doaj.org/oai", "method":"oai-pmh", "format":"oai_dc", "steward":"[email protected]"}'

    A status 202 confirms the start of the ingest. Here, no steward is listed in the DOAJ documentation, thus a general contact is set. Alternatively, the "Ingest" form of the "Admin" page in the web frontend can be used.

  5. DatAasee reports the start of the ingest in the backend logs:

    $ docker logs dataasee-backend-1

    with a message akin to: Starting ingest from https://doaj.org/oai via oai-pmh as oai_dc..

  6. DatAasee reports completion of the ingest in the backend logs:

    $ docker logs dataasee-backend-1

    with a message akin to: Completed ingest of 20812 records from https://doaj.org/oai after 0.05h..

  7. DatAasee starts interconnecting the ingested metadata records:

    $ docker logs dataasee-database-1

    with a message akin to: Interconnect Started!.

  8. DatAasee finishes interconnecting the ingested metadata records:

    $ docker logs dataasee-database-1

    with a message akin to: Interconnect Completed!.

NOTE: The interconnection is a potentially long-running, asynchronous operation, whose status is only reported in the database logs.

Container Engines

DatAasee is deployed via a compose.yaml (see How to deploy), which is compatible to the following orchestration tools:

Docker-Compose

  • docker
  • docker-compose >= 2

Installation see: docs.docker.com/compose/install/

$ docker-compose up -d
$ docker-compose ps
$ docker-compose down

Docker-Compose (with Podman)

  • podman
  • podman-docker
  • docker-compose

Installation see: docs.docker.com/compose/install/

NOTE: This tutorial assumes a Debian-based Linux host like Ubuntu.

$ sudo apt-get -y install dnsmasq podman-plugins containernetworking-plugins podman-docker
$ docker-compose up -d
$ docker-compose ps
$ docker-compose down

Podman-Compose

  • podman >= 4.6.2
  • podman-compose >= 1.0.6

NOTE: This tutorial assumes a Debian-based Linux host like Ubuntu.

Additionally:

$ sudo apt-get -y install dnsmasq podman-plugins containernetworking-plugins python3-pip
$ pip3 install podman-compose
$ podman-compose up -d
$ podman-compose ps
$ podman-compose down

Kompose (Minikube)

  • minikube
  • kubectl
  • kompose

Installation see: kompose.io/installation/

Prepare compose.yaml:

  • Add port to database service:
    services:
      database:
        ports:  # value changed from []
          - "2480:2480"
$ kompose -f compose.yaml convert

Particularly, for kompose in version 1.33.0 and 1.34.0 the following manual changes need to be made in:

  • database-deployment.yaml:
    spec:
      template:
        spec:
          containers:
            - env:
                volumeMounts:
                  - name: "database"
                    mountPath: "/run/secrets/database"  # value changed from "/run/secrets"
  • backend-deployment.yaml:
    spec:
      template:
        spec:
          containers:
            - env:
                volumeMounts:
                  - name: "database"
                    mountPath: "/run/secrets/database"  # value changed from "/run/secrets"
                  - name: "datalake"
                    mountPath: "/run/secrets/datalake"  # value changed from "/run/secrets"
$ rm compose.yaml
$ minikube start
$ kubectl apply -f .
$ kubectl port-forward service/backend 8343:8343  # now the backend can be accessed via `http://localhost:8343/api/v1`
$ minikube stop

Container Probes

The following endpoints are available for monitoring the respective containers; here the compose.yaml host names (service names) are used. Logs are written to the standard output.

Backend

Ready:

http://backend:4195/ready

returns HTTP status 200 if ready, see also Benthos ready.

Liveness:

http://backend:4195/ping

returns HTTP status 200 if live, see also Benthos ping.

Metrics:

http://backend:4195/metrics

allows Prometheus scraping, see also Connect prometheus.

Database

Ready:

http://database:2480/api/v1/ready

returns HTTP status 204 if ready, see also ArcadeDB ready.

Frontend

Ready:

http://frontend:3000

returns HTTP status 200 if ready.

Custom Queries

NOTE: All custom query results are limited to 100 items.

SQL

DatAasee uses the ArcadeDB SQL dialect. For custom SQL queries, only single, read-only queries are admissible, meaning:

The vertex type (cf. table) holding the metadata records is named metadata.

Examples:

Get the schema:

SELECT FROM schema:types

Get one-hundred metadata record titles:

SELECT name FROM metadata

Gremlin

TODO:

Get one-hundred metadata record titles:

g.V().hasLabel("metadata")

Cypher

DatAasee supports a subset of OpenCypher. For custom Cypher queries, only read-queries are admissible, meaning:

  • MATCH
  • OPTIONAL MATCH
  • RETURN

Examples:

Get labels:

MATCH (n) RETURN DISTINCT labels(n)

Get one-hundred metadata record titles:

MATCH (m:metadata) RETURN m

MQL

TODO:

GraphQL

TODO:

SPARQL

TODO:

Custom Frontend

Remove Prototype Frontend

Remove the YAML object "frontend" in the compose.yaml (all lines below ## Frontend # ...).


Appendix

In this section development-related guidelines are gathered.

Overview:

Development Decision Rationales:

Infrastructure

  • What versioning scheme is used?

    • DatAasee uses SimVer versioning, with the addition, that the minor version starts with one for the first release of a major version (X.1), so during the development of a major version the minor version will be zero (X.0).
  • How stable is the upgrade to a release?

    • During the development releases (0.X) every release will likely be breaking, particularly with respect to backend API and database schema. Once a version 1.0 is released, breaking changes will only occur between major versions.
  • What are the three compose files for?

    • The compose.develop.yaml is only for the development environment,
    • The compose.package.yaml is only for building the release container images,
    • The compose.yaml is the only file making up a release.
  • Why does a release consist only of the compose.yaml?

    • The compose configuration acts as a installation script and deploy recipe. Given access to a repository with DatAasee, all containers are set up on-the-fly by pulling. No other files are needed.
  • Why is Ubuntu 24.04 used as base image for database and backend?

    • Overall, the calendar based version together with the 5 year support policy for Ubuntu LTS makes keeping current easier. Generally, glibc is used, and specifically for the database, OpenJDK is supported, as opposed to Alpine.
  • Why does building the backend Docker image fail?

    • This is likely a timeout when downloading Go module packages. Multiple retries maybe necessary to complete a build.

Database

  • Why is an init.sh script used instead of a plain command in the database container?
    • This is a security measure; the script is designed to hide secrets which need to be passed on start up. A secondary use is the set up of the database schema in case the container is freshly created.

Backend

  • Why are the main processing components part of the input and not a separate pipeline?

    • Since the ingests may take very long, it is only triggered and the sucessful triggering is reported in the response while the ingest keeps on running. This async behavior is only possible with a buffer which has to be directly after the input and after sync_response of the trigger, thus the input post-pressing processors are used as main pipeline.
  • Why is the content type application/json used for responses and not application/vnd.api+json?

    • Using the official JSON MIME-type makes a response more compatible and states what it is in more general terms. Requested content types on the other hand may be either empty, */*, application/json, or application/vnd.api+json.

Frontend

  • Why is the frontend a prototype?

    • The frontend is not meant for direct production use but serves as system testing device, a proof-of-concept, living documentation, and simplification for manual testing. Thus it has the layout of an internal tool. Nonetheless, it can be used as a basis or template for a production frontend.
  • Why is there custom JS defined?

    • This is necessary to enable triggering the submit button when pressing the "Enter" key.
  • Why does the frontend container use the backend name explicitly and not the host loopback, i.e. extra_hosts: [host.docker.internal:host-gateway]?

    • Because podman does not seem to support it yet.

Development Workflows

Development Setup

  1. git clone https://github.com/ulbmuenster/dataasee && cd dataasee (clone repository)
  2. make setup (builds container images locally)
  3. make start (starts development setup)

Dependency Updates

  1. Dependency documentation
  2. Dependency versions
  3. Version verification (Frontend only)

Schema Changes

  1. Schema definition
  2. Schema documentation
  3. Schema implementation

API Changes

  1. API definition
  2. API architecture
  3. API documentation
  4. API implementation
  5. API testing

Coding Standards

  • YAML and SQL files must have a comment header line containing: dialect, project, license, author.
  • YAML should be restricted to StrictYAML (except github-ci and compose).
  • SQL commands should be all-caps.