APIs for "shadow" metadata
-
Get list of creator names
GET https://umbra.edirepository.org/creators/names
Returns a list of all normalized names:
[“Abbaszadegan, Morteza”,“Abbott, Benjamin”,“Abendroth, Diane”,“Aber, John”, etc.]
Status 200 -
Get list of creator names for a particular scope (e.g. edi, knb-lter-arc)
GET https://umbra.edirepository.org/creators/names_for_scope/knb-lter-arc
Returns a list of normalized names for creators associated with the scope:
["Abbott, Benjamin","Asmus, Ashley L","Barker Plotkin, Audrey","Bennington, Cynthia", etc.]
Status 200For an invalid scope, e.g., "XYZ"
GET https://umbra.edirepository.org/creators/names_for_scope/XYZ
Returns:
[Scope XYZ not found]
Status 400 -
Get variants of a creator name
For a name in the list returned by the names API, e.g., "McKnight, Diane M"
GET https://umbra.edirepository.org/creators/name_variants/McKnight, Diane M
Returns a list of variants found for that creator name:
[“McKnight, Diane”,“McKnight, Diane M”,“Mcknight, Diane”,“Mcnight, Diane”]
Status 200For a name NOT in the list returned by the names API, e.g., "Python, Monty"
GET https://umbra.edirepository.org/creators/name_variants/Python, Monty
Returns:
[Name “Python, Monty” not found]
Status 400
-
Update creator names
POST https://umbra.edirepository.org/creators/names
This is run as a cronjob on each umbra server. It gets the newly-added EML files from PASTA and processes them to find new creator names, if any. -
Get possible duplicates
GET https://umbra.edirepository.org/creators/possible_dups
There's information on how to use this API below in the section on maintaining the creator names database. -
Flush possible duplicates
POST https://umbra.edirepository.org/creators/possible_dups
There's information on how to use this API below in the section on maintaining the creator names database. -
Repair a data package that was processed incorrectly
POST https://umbra.edirepository.org/creators/repair
If a data package was processed incorrectly (e.g., if UTF-8 characters were incorrectly decoded), force it to be re-processed. The repair API takes the package ID as a parameter.
E.g., POST https://umbra.edirepository.org/creators/repair/edi.1157.1
The following steps apply to a newly-instantiated umbra server. I.e., they are the steps needed to set up umbra to start with.
-
Create the database
Create a psql database called pasta with user pasta (with the usual password).
Edit config.py to contain the correct password.
In the data directory, run the following to initialize the database schema:
psql -d pasta -U pasta -h localhost < create_eml_schemas.sql -
Edit the configuration file
Besides the database password, the configuration file config.py needs to contain the base folder path. Typically, this will be '/home/pasta/umbra'. -
Get the initial set of EML files
A newly instantiated umbra server needs to acquire a complete set of EML files from PASTA. A standalone Python program download_eml.py is provided for this purpose. It uses async i/o, but still takes several hours to complete. -
Initialize the "raw" responsible parties database table
After the EML files have been downloaded via download_eml.py, they need to be parsed and their "responsible parties" entries saved in a database table. Accomplish this via the following API:
POST https://umbra.edirepository.org/creators/init_raw_db -
The two steps above, getting the initial set of EML files and initializing the "raw" responsible parties database table, only need to be done once. Subsequently, new EML files will be downloaded and the database updated via the update creator names API described above.
The umbra software does its best to resolve and normalize creator names programmatically. To determine if several name variants (e.g., James T Kirk, James Kirk, Jim Kirk, J Kirk) actually refer to the same person, it looks at various forms of "evidence" (email address, organization name, etc.) in the EML files. In some cases, however, it is unable confidently to conclude that two variants are the same person, and manual intervention is needed. Either evidence is lacking, or a surname may be misspelled in a particular case, for example.
The API GET https://umbra.edirepository.org/creators/possible_dups returns a list of cases that should be looked at manually. They are cases where a given surname has multiple normalized givennames. The list starts with dups that are new, followed by a line that reads "==================================================". For example, a call to this API returned (partial list):
[
"Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
"Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",
"Bailey: Amey, John, Rosemary, Scott W, Vanessa L",
"Brown: Cindi, Cynthia S, Dana Rachel, James, Jeffrey, Joseph K, Kerry, Renee F",
"Martin: Chris A, Jonathan E, Mac, Mary",
"McDowell: Nate G, Nathan, William H",
"Simmons: Breana, Joseph",
"Smith: Alexander, C Scott, Colin A, Curt, David R, Dylan J, G Jason, Jane E, Jane G, Jason M, Jayme, John W, Jonathan W, Katherine, Kerry, Lesley, Lori, Matthew, Melinda D, Michael, Ned, Nicole J, Rachel, Raymond, Richard, Sarah J, Stacy A, Thomas C",
"Zhou: Jiayu, Jizhong, Weiqi",
"Zimmerman: Jess, Jess K, Richard C"
"==================================================",
"Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
"Adhikari: Ashish, Bishwo",
"Alexander: Clark R, Heather D, Mara, Pezzuoli R",
"Allen: Dennis, Jonathan, Scott Thomas",
"Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",
etc.
]
Scanning down this list, we see several cases that look suspicious:
Anderson: James, Jim
Brown: Cynthia S, Cindi
McDowell: Nate G, Nathan
Smith: Jane E, Jane G
Zimmerman: Jess, Jess K
We check these out by running psql queries on the server.
select surname, givenname, scope, address, organization, email, url, orcid, organization_keywords from eml_files.responsible_parties_test where rp_type='creator' and surname='Anderson' and givenname in ('James','Jim') order by scope; <br>
shows that there is no evidence connecting Jim and James Anderson, so we do nothing.
select surname, givenname, scope, address, organization, email, url, orcid, organization_keywords from eml_files.responsible_parties_test where rp_type='creator' and surname='Brown' and givenname like 'C%' order by scope; <br>
shows that Cindi and Cynthia S Brown are almost certainly different people, so again we do nothing.
The one case that needs fixing is Zimmerman: Jess, Jess K.
The query
select surname, givenname, scope, organization, email, organization_keywords from eml_files.responsible_parties where rp_type='creator' and surname='Zimmerman' and givenname like 'Jes%' order by scope;<br>
returns (partial results):
surname | givenname | scope | organization | email | organization_keywords
-----------+-----------+--------------+-----------------------------------------------+------------------------+-----------------------
Zimmerman | Jess | edi | LUQ LTER | |
Zimmerman | Jess | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected] | UPuertoRico
Zimmerman | Jess | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected] | UPuertoRico
Zimmerman | Jess | knb-lter-luq | | [email protected] | UPuertoRico
Zimmerman | Jess | knb-lter-luq | | [email protected] | UPuertoRico
Zimmerman | Jess | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected] | UPuertoRico
Zimmerman | Jess K | knb-lter-luq | University of Puerto Rico - Rio Piedras | [email protected] | UPuertoRico
Zimmerman | Jess | knb-lter-luq | University of Puerto Rico | [email protected] | UPuertoRico
Zimmerman | Jess K | knb-lter-luq | University of Puerto Rico - Rio Piedras | [email protected] | UPuertoRico
Zimmerman | Jess | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected] | UPuertoRico
umbra will figure out that the Jess and Jess K Zimmerman in the knb-lter-luq scope are the same person, since they're both at the University of Puerto Rico.
The problem is the Jess Zimmerman in the edi scope. But note that here we do have LUQ LTER as the organization, so that seems to make it a safe bet that it's the same Jess Zimmerman as in knb-lter-luq. To tell umbra that they are the same person, we edit the data file corrections_name_variants.xml and add these entries:
<person>
<variant>
<surname>Zimmerman</surname>
<givenname>Jess</givenname>
<scope>edi</scope>
</variant>
<variant>
<surname>Zimmerman</surname>
<givenname>Jess</givenname>
<scope>knb-lter-luq</scope>
</variant>
<variant>
<surname>Zimmerman</surname>
<givenname>Jess K</givenname>
<scope>knb-lter-luq</scope>
</variant>
</person>
These changes should be made in Github and pulled down to the server.
Once we have resolved all of the suspicious cases, we need to tell umbra to flush the "new" possible dups so the next time we ask for possible dups we aren't given the same new cases to check out all over again. To flush, POST https://umbra.edirepository.org/creators/possible_dups.
Now, if we do a GET https://umbra.edirepository.org/creators/possible_dups, the returned list will look like:
[
"==================================================",
"Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
"Adhikari: Ashish, Bishwo",
"Alexander: Clark R, Heather D, Mara, Pezzuoli R",
"Allen: Dennis, Jonathan, Scott Thomas",
"Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",
etc.
]
i.e., the list of new possible dups above the "==================================================" line is now empty.
As we update the creator names over the course of some days, new possible dups will probably show up, and we do the same process over again.
There are several other data files to be aware of.
corrections_nicknames.xml lists nicknames that we want to recognize. E.g.,
<nickname>
<name1>jim</name1>
<name2>james</name2>
</nickname>
corections_orcids.xml lists ORCIDs for creators that have incorrect ORCIDs in one or more EML files. E.g.,
<correction>
<surname>Stanley</surname>
<givenname>Emily%</givenname>
<orcid>0000-0003-4922-8121</orcid>
</correction>
corrections_overrides.xml lists cases where a name is misspelled and we want to correct the spelling, not just treat it as a variant. E.g.,
<override>
<original>
<surname>Morse</surname>
<givenname>Jennfier F</givenname>
</original>
<corrected>
<surname>Morse</surname>
<givenname>Jennifer F</givenname>
</corrected>
<scope>knb-lter-nwt</scope>
</override>
Note that the name_variants API will still return the misspelled name as one of the variants so that searches will find datasets with the name misspelled.
organizations.xml lists organization names and emails that correspond to a particular organization (usually a university). E.g.,
<organization>
<name>U%New Mexico</name>
<name>U%NM</name>
<name>UNM</name>
<email>unm.edu</email>
<keyword>UNM</keyword>
</organization>
Any organization name, address, or email address that matches or contains one of the variants will mark a record in the database as being in the given organization (UNM, in this example). This helps umbra determine what organization a record is associated with, despite the many variations in the ways organization names, addresses, and email addresses are spelled.