Any new outbreak that is tracked by Global.health will use the reusable-data-service
for CRUD operations on line list cases.
It incorporates a simple "day zero" schema along with facilities for flexibly extending the data model.
TBD as the service gets written!
The data-service
folder contains the service for CRUD operations involving the COVID-19 schema.
G.h COVID-19 case data, as well as the source, user, and session data for the curator portal, is stored in a MongoDB database.
We have multiple instances of MongoDB, ranging from local instances for development, to dev and qa (for https://dev-data.covid-19.global.health/ and https://qa-data.covid-19.global.health in the case of COVID-19 data), and prod (for https://data.covid-19.global.health/ for COVID-19).
Each instance has a covid19
database, which in turn has collections for each type of data, e.g. cases
, users
, etc.
The data in the cases
collection is the primary data that G.h collects, verifies, and shares. Each document in the
cases
collection represents a single disease case.
To learn more about what a case
consists of, try
importing some
sample data into a local MongoDB instance and
connecting to it with MongoDB Compass. Alternatively, you can peruse the
schema.
We store past revisions of case
documents in the caserevisions
collection. These revisions are made in the
application layer when a case is updated; we follow the
MongoDB Document Version Pattern.
A caserevision
document has a case
field containing a snapshot of the case
at a given revision. The collection
indexes the id of the case
and its revision for quick lookups.
G.h has millions of COVID-19 case records that predate the new curator portal. These are exported to a gzipped CSV in a separate repo.
We can convert these cases to a json format that conforms to the cases
collection schema and ingest these into a
MongoDB instance using these scripts:
We provide a flattened version of the complete dataset accessible through the Data page, refreshed on a nightly basis and stored on S3. The scripts that orchestrate this are organized using AWS SAM, and export the dataset in chunks in parallel, parse them, recombine them and then compress and upload the result along with a data dictionary.
The script that aggregates and exports counts for the Map visualizations may also be found in this set of scripts.
Schema updates can affect the whole stack, including:
- The MongoDB case json schema, which is a 'validator' applied to the cases collection on the database. This schema is similar to a relational database schema in that it enforces which fields are allowed to be present and what their types are. It does not validate any values. This validator will apply to all case data regardless of how it's entered into the database.
- The Mongoose schema/data model in the dataserver, which in addition to mirroring the mongodb json schema also has some additional validation logic (what fields are required, regexes, etc). The reason this is more stringent than the mongodb json schema is that not all data in the database is expected to conform to our expectations of new data, the latter of which has to go through the mongoose schema — ex. imports from the existing CSV may not have all the data that we've made required through the curator portal.
- The OpenAPI specs, which, in addition to providing API documentation, validates requests and responses. For the most part, our API case object mirrors the mongodb case object, so changes to the schema will, unless intentionally accounted for in the dataserver, also affect the API.
- The curator UI, which sends and receives case objects via the aforementioned curator API. So again, if something changes in the API, it will affect the model objects used in the UI.
- The CSV → json converter, which converts the existing (Sheets-originated) CSV data to json that conforms to the aforementioned MongoDB schema. If you add a new field to the case schema that is not present in the old data, you don't need to worry about this; however, if you're modifying a field that is part of the conversion process, the converter will need to be updated to generate the correct fields/data.
- The mongodb → CSV exporter, which exports specified fields from the MongoDB cases into a CSV format that we can make available to researchers, similar to the CSV that was generated from Google Sheets originally. If you add, remove, or rename a field and it's part of (or should be added to) the CSV export, you'll need to update the exporter. [TBD/WIP]
- The sample data, which unfortunately is sprinkled throughout the stack, and is used for seeding local databases
and for unit & integration testing. The examples need to be updated with the changes.
- Seeding local databases
- Dataserver unit test fixtures
- Integration test fixtures [1], [2]
-
First, make the changes to the affected parts of the stack, including sample data and test fixtures.
-
Run dev/setup_db.sh and check the output for errors. If one or more documents fails to import, there may be a mismatch between your sample data and your MongoDB json schema.
./dev/setup\_db.sh
-
Run the CSV → JSON importer locally and check the output for errors. If one or more documents fails to import, there may be a mismatch between the data the converter outputs and your MongoDB json schema.
cd data-serving/scripts/data-pipeline python3 -m pip install -r requirements.txt ./convert\_and\_import\_latest\_data.sh -r .01</td>
-
Run the dataservice unit tests. If one of the model tests fails, there may be a mismatch between your test fixtures and the Mongoose schema/data model; if one of the controller tests fails, there may be a mismatch between your test fixtures and the OpenAPI spec.
cd data-serving/data-service npm run test
-
Run the curator API unit tests. If one of the controller tests fails, there may be a mismatch between the dataservice OpenAPI spec and the curator service OpenAPI spec.
cd verification/curator-service/api npm run test
-
Run the integration tests. If one of the tests fails, there may be a mismatch between the curator OpenAPI spec and the curator UI.
./dev/test_all.sh
-
Run the exporter locally. If it fails, there may be a mismatch between the MongoDB json schema and the exporter script. [TBD/WIP]
Add your index as a migration