Skip to content

Latest commit

 

History

History
169 lines (125 loc) · 9.39 KB

README.md

File metadata and controls

169 lines (125 loc) · 9.39 KB

All things data

Emerging outbreaks

Any new outbreak that is tracked by Global.health will use the reusable-data-service for CRUD operations on line list cases. It incorporates a simple "day zero" schema along with facilities for flexibly extending the data model.

Configuration

TBD as the service gets written!

COVID-19 instances

The data-service folder contains the service for CRUD operations involving the COVID-19 schema.

Database design

G.h COVID-19 case data, as well as the source, user, and session data for the curator portal, is stored in a MongoDB database.

We have multiple instances of MongoDB, ranging from local instances for development, to dev and qa (for https://dev-data.covid-19.global.health/ and https://qa-data.covid-19.global.health in the case of COVID-19 data), and prod (for https://data.covid-19.global.health/ for COVID-19).

Each instance has a covid19 database, which in turn has collections for each type of data, e.g. cases, users, etc.

Case data

The data in the cases collection is the primary data that G.h collects, verifies, and shares. Each document in the cases collection represents a single disease case.

Shape of the data

To learn more about what a case consists of, try importing some sample data into a local MongoDB instance and connecting to it with MongoDB Compass. Alternatively, you can peruse the schema.

Version history

We store past revisions of case documents in the caserevisions collection. These revisions are made in the application layer when a case is updated; we follow the MongoDB Document Version Pattern.

A caserevision document has a case field containing a snapshot of the case at a given revision. The collection indexes the id of the case and its revision for quick lookups.

Importing cases

G.h has millions of COVID-19 case records that predate the new curator portal. These are exported to a gzipped CSV in a separate repo.

We can convert these cases to a json format that conforms to the cases collection schema and ingest these into a MongoDB instance using these scripts:

  1. Convert the data only

Exporting data

We provide a flattened version of the complete dataset accessible through the Data page, refreshed on a nightly basis and stored on S3. The scripts that orchestrate this are organized using AWS SAM, and export the dataset in chunks in parallel, parse them, recombine them and then compress and upload the result along with a data dictionary.

The script that aggregates and exports counts for the Map visualizations may also be found in this set of scripts.

Updating the case schema

What needs to be changed & where

Schema updates can affect the whole stack, including:

  • The MongoDB case json schema, which is a 'validator' applied to the cases collection on the database. This schema is similar to a relational database schema in that it enforces which fields are allowed to be present and what their types are. It does not validate any values. This validator will apply to all case data regardless of how it's entered into the database.
  • The Mongoose schema/data model in the dataserver, which in addition to mirroring the mongodb json schema also has some additional validation logic (what fields are required, regexes, etc). The reason this is more stringent than the mongodb json schema is that not all data in the database is expected to conform to our expectations of new data, the latter of which has to go through the mongoose schema — ex. imports from the existing CSV may not have all the data that we've made required through the curator portal.
  • The OpenAPI specs, which, in addition to providing API documentation, validates requests and responses. For the most part, our API case object mirrors the mongodb case object, so changes to the schema will, unless intentionally accounted for in the dataserver, also affect the API.
  • The curator UI, which sends and receives case objects via the aforementioned curator API. So again, if something changes in the API, it will affect the model objects used in the UI.
  • The CSV → json converter, which converts the existing (Sheets-originated) CSV data to json that conforms to the aforementioned MongoDB schema. If you add a new field to the case schema that is not present in the old data, you don't need to worry about this; however, if you're modifying a field that is part of the conversion process, the converter will need to be updated to generate the correct fields/data.
  • The mongodb → CSV exporter, which exports specified fields from the MongoDB cases into a CSV format that we can make available to researchers, similar to the CSV that was generated from Google Sheets originally. If you add, remove, or rename a field and it's part of (or should be added to) the CSV export, you'll need to update the exporter. [TBD/WIP]
  • The sample data, which unfortunately is sprinkled throughout the stack, and is used for seeding local databases and for unit & integration testing. The examples need to be updated with the changes.
Sample PRs

Testing a schema update

  1. First, make the changes to the affected parts of the stack, including sample data and test fixtures.

  2. Run dev/setup_db.sh and check the output for errors. If one or more documents fails to import, there may be a mismatch between your sample data and your MongoDB json schema.

    ./dev/setup\_db.sh

  3. Run the CSV → JSON importer locally and check the output for errors. If one or more documents fails to import, there may be a mismatch between the data the converter outputs and your MongoDB json schema.

    cd data-serving/scripts/data-pipeline
    python3 -m pip install -r requirements.txt
    ./convert\_and\_import\_latest\_data.sh -r .01</td>
    
  4. Run the dataservice unit tests. If one of the model tests fails, there may be a mismatch between your test fixtures and the Mongoose schema/data model; if one of the controller tests fails, there may be a mismatch between your test fixtures and the OpenAPI spec.

    cd data-serving/data-service
    npm run test
    
  5. Run the curator API unit tests. If one of the controller tests fails, there may be a mismatch between the dataservice OpenAPI spec and the curator service OpenAPI spec.

    cd verification/curator-service/api
    npm run test
    
  6. Run the integration tests. If one of the tests fails, there may be a mismatch between the curator OpenAPI spec and the curator UI.

    ./dev/test_all.sh

  7. Run the exporter locally. If it fails, there may be a mismatch between the MongoDB json schema and the exporter script. [TBD/WIP]

Adding an index

Add your index as a migration

Sample PRs