Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

Data Analysis Process (Proposed)

Justin Littman edited this page Jan 18, 2019 · 2 revisions

Steps:

  1. Manually obtain & analyze sample or real data.
  2. Write harvester script to programmatically obtain real data, with documentation on method, frequency, status of agreement, etc. Harvester script(s) should get and normalize source data for feeding from this repository into a Traject+ Pipeline.
  3. Use analysis to document proposed mapping of harvested data to the DLME IR MAP, including requested normalizations.
  4. Attempt writing a Traject+ Mapping Config for the above mapping, using the existing patterns. Ticket any mappings or normalizations that aren't currently supported. Use the mapping config only for this first pass (i.e. don't try to write modules). Consider need to write tests for mapping.
  5. Attempt running DLME application locally with above traject+ mappings being used, and analyze output to ticket any needed actions or questions on using the system.

All of the above then feeds into the work cycle, when the development team can review the documentation (with the intended goals of the harvesting & mapping), the attempted configs, and the tickets, and work on pipeline issues that ease the process for staging and production.

Running Traject Locally

Run the following from wherever you may have Traject + Traject+ installed (for ease of use, the following uses the DLME codebase for this).

Run a Prototype Config to JSON Output in Shell

$ bundle exec traject -w JsonWriter -c config/traject.rb -c lib/traject/fgdc_config.rb -s source='harvard_fgdc' spec/fixtures/fgdc/HARVARD.SDE2.AFRICOVER_EG_RIVERS.fgdc.xml

Run a Prototype Config to Debug Output in Shell

$ bundle exec traject -w DebugWriter -c config/traject.rb -c lib/traject/fgdc_config.rb -s source='harvard_fgdc' spec/fixtures/fgdc/HARVARD.SDE2.AFRICOVER_EG_RIVERS.fgdc.xml

Run a Prototype Config to a local DLME Application

In a separate shell, start up the DLME application's Solr (presuming you've run bundle install already & are running this from where you have the DLME codebase locally):

$ bundle exec solr_wrapper

In another separate shell, start up the DLME Rails application (same presumptions as above):

$ bundle exec rails s

Now in a third shell, same presumptions, run the DLME application's Traject+ installation with whatever mapping and a pointer to whatever harvested data you want to transform (note the difference here is the lack of a declared Writer):

$ bundle exec traject -w SolrWriter -c config/traject.rb -c lib/traject/fgdc_config.rb -s source='harvard_fgdc' spec/fixtures/fgdc/HARVARD.SDE2.AFRICOVER_EG_RIVERS.fgdc.xml

Alternatively for that very last step, you could run the DLME application's bin scripts as well, to load all the prototype data:

$ ./bin/fetch_and_import

Some Other Config to JSON Output in Shell

$ bundle exec traject -w JsonWriter -c config/traject.rb -c lib/traject/fgdc_config.rb -s source='harvard_fgdc' spec/fixtures/fgdc/HARVARD.SDE2.AFRICOVER_EG_RIVERS.fgdc.xml