Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Latest commit

 

History

History
22 lines (13 loc) · 4.03 KB

File metadata and controls

22 lines (13 loc) · 4.03 KB

Species Enumeration README

Alex gave me the idea to create this file before I leave the CEP project. The purpose of this README is to document the code from the last stage of my project -- to enumerate all the different species of data that can be found in all the existing CEP mongo databases. This file will be very useful and time-saving for whoever continues the project after I have left.

List of Files with Descriptions

  • remote_connector_template.py This is a template Python script to connect to the molspace.rc.fas.harvard.edu server containing the existing CEP mongo databases. It handles connecting to the database (taking in the username and password as command line arguments), as well as closing the connection cleanly upon exit (even when that exit is done with a Ctrl-C command). It is highly recommended that if you want to use Pymongo to connect to the remote databases, to base your code off this script. This basic framework is tried and tested.

  • species_enumerator.py This is the Python script used to poll each collection of each database of a mongo server and determine all the species (i.e. structures of documents) in each collection. Note that it works off the localhost, since I already copied all the CEP mongo data to my local machine using the scripts here. Note that for the sake of efficiency, this script does not read through every single document in the collection, instead using a probabilistic approach. The process_collection() method is somewhat slow for small databases (i.e. 1 million records) but scales linearly, allowing all 1.5 Tb of data to be gone through in about 12 hours. The new_process_collection() method is more thorough in the proportion of documents per collection it polls, and thus has a lower chance of missing rare species. It runs at about the same speed as process_collection() for databases of about 1 million records, but it scales quasilinearly (i.e. n log n) and is therefore much slower for the larger databases. This script dumps its output in the output folder.

  • outputs_parser.py This script takes as input the files generated by species_enumerator.py and compiles a document for each collection in each database. For each collection, it creates a file in parsed-output and does the following:

    • creates a "super species" -- a sample document that contains every single field and subfield contained in that collection
    • enumerates the set of species in terms of what (sub)fields each species is missing from the super species
    • writes out the intersection of all the species -- a sample document that contains every field contained in every document in that collection
  • parsed_outputs_concatenator.sh This is a simple bash script that takes the outputs of outputs_parser.py and concatenates all the outputs belonging to a single database. It also formats everything into a relatively pretty, human-readable format. It puts its output files in printable-parsed-output

Next Steps

Now that we have an understanding of all the different fields present in the mongo databases, we should consider how to migrate that data to a new, non-mongo schema in one unified database.

Suggested method: Entity-attribute-value model: Essentially instead of having each row in your database be an entire record, you have each row be one id-value pair. Thus, the number of records is increased vastly, but the database is much easier to store and maintain, and relational queries are much more manageable with this format than the mongo blob-type format.