umbra

APIs for "shadow" metadata

Creator Names APIs

APIs used by the Data Portal:

Get list of creator names
GET https://umbra.edirepository.org/creators/names
Returns a list of all normalized names:
[“Abbaszadegan, Morteza”,“Abbott, Benjamin”,“Abendroth, Diane”,“Aber, John”, etc.]
Status 200
Get list of creator names for a particular scope (e.g. edi, knb-lter-arc)
GET https://umbra.edirepository.org/creators/names_for_scope/knb-lter-arc
Returns a list of normalized names for creators associated with the scope:
["Abbott, Benjamin","Asmus, Ashley L","Barker Plotkin, Audrey","Bennington, Cynthia", etc.]
Status 200

For an invalid scope, e.g., "XYZ"
GET https://umbra.edirepository.org/creators/names_for_scope/XYZ
Returns:
[Scope XYZ not found]
Status 400
Get variants of a creator name
For a name in the list returned by the names API, e.g., "McKnight, Diane M"
GET https://umbra.edirepository.org/creators/name_variants/McKnight, Diane M
Returns a list of variants found for that creator name:
[“McKnight, Diane”,“McKnight, Diane M”,“Mcknight, Diane”,“Mcnight, Diane”]
Status 200

For a name NOT in the list returned by the names API, e.g., "Python, Monty"
GET https://umbra.edirepository.org/creators/name_variants/Python, Monty
Returns:
[Name “Python, Monty” not found]
Status 400

APIs used to keep the names database up-to-date:

Update creator names
POST https://umbra.edirepository.org/creators/names
This is run as a cronjob on each umbra server. It gets the newly-added EML files from PASTA and processes them to find new creator names, if any.
Get possible duplicates
GET https://umbra.edirepository.org/creators/possible_dups
There's information on how to use this API below in the section on maintaining the creator names database.
Flush possible duplicates
POST https://umbra.edirepository.org/creators/possible_dups
There's information on how to use this API below in the section on maintaining the creator names database.
Repair a data package that was processed incorrectly
POST https://umbra.edirepository.org/creators/repair
If a data package was processed incorrectly (e.g., if UTF-8 characters were incorrectly decoded), force it to be re-processed. The repair API takes the package ID as a parameter.
E.g., POST https://umbra.edirepository.org/creators/repair/edi.1157.1

Manual steps involved in creating the creator names database:

The following steps apply to a newly-instantiated umbra server. I.e., they are the steps needed to set up umbra to start with.

Create the database
Create a psql database called pasta with user pasta (with the usual password).
Edit config.py to contain the correct password.
In the data directory, run the following to initialize the database schema:
psql -d pasta -U pasta -h localhost < create_eml_schemas.sql
Edit the configuration file
Besides the database password, the configuration file config.py needs to contain the base folder path. Typically, this will be '/home/pasta/umbra'.
Get the initial set of EML files
A newly instantiated umbra server needs to acquire a complete set of EML files from PASTA. A standalone Python program download_eml.py is provided for this purpose. It uses async i/o, but still takes several hours to complete.
Initialize the "raw" responsible parties database table
After the EML files have been downloaded via download_eml.py, they need to be parsed and their "responsible parties" entries saved in a database table. Accomplish this via the following API:
POST https://umbra.edirepository.org/creators/init_raw_db
The two steps above, getting the initial set of EML files and initializing the "raw" responsible parties database table, only need to be done once. Subsequently, new EML files will be downloaded and the database updated via the update creator names API described above.

Manual steps involved in maintaining the creator names database:

The umbra software does its best to resolve and normalize creator names programmatically. To determine if several name variants (e.g., James T Kirk, James Kirk, Jim Kirk, J Kirk) actually refer to the same person, it looks at various forms of "evidence" (email address, organization name, etc.) in the EML files. In some cases, however, it is unable confidently to conclude that two variants are the same person, and manual intervention is needed. Either evidence is lacking, or a surname may be misspelled in a particular case, for example.

The API GET https://umbra.edirepository.org/creators/possible_dups returns a list of cases that should be looked at manually. They are cases where a given surname has multiple normalized givennames. The list starts with dups that are new, followed by a line that reads "==================================================". For example, a call to this API returned (partial list):

[
    "Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
    "Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",
    "Bailey: Amey, John, Rosemary, Scott W, Vanessa L",
    "Brown: Cindi, Cynthia S, Dana Rachel, James, Jeffrey, Joseph K, Kerry, Renee F",
    "Martin: Chris A, Jonathan E, Mac, Mary",
    "McDowell: Nate G, Nathan, William H",
    "Simmons: Breana, Joseph",
    "Smith: Alexander, C Scott, Colin A, Curt, David R, Dylan J, G Jason, Jane E, Jane G, Jason M, Jayme, John W, Jonathan W, Katherine, Kerry, Lesley, Lori, Matthew, Melinda D, Michael, Ned, Nicole J, Rachel, Raymond, Richard, Sarah J, Stacy A, Thomas C",
    "Zhou: Jiayu, Jizhong, Weiqi",
    "Zimmerman: Jess, Jess K, Richard C"
    "==================================================",
    "Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
    "Adhikari: Ashish, Bishwo",
    "Alexander: Clark R, Heather D, Mara, Pezzuoli R",
    "Allen: Dennis, Jonathan, Scott Thomas",
    "Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",    
    etc.
]

Scanning down this list, we see several cases that look suspicious:
Anderson: James, Jim
Brown: Cynthia S, Cindi
McDowell: Nate G, Nathan
Smith: Jane E, Jane G
Zimmerman: Jess, Jess K

We check these out by running psql queries on the server.

select surname, givenname, scope, address, organization, email, url, orcid, organization_keywords from eml_files.responsible_parties_test where rp_type='creator' and surname='Anderson' and givenname in ('James','Jim') order by scope; <br>

shows that there is no evidence connecting Jim and James Anderson, so we do nothing.

select surname, givenname, scope, address, organization, email, url, orcid, organization_keywords from eml_files.responsible_parties_test where rp_type='creator' and surname='Brown' and givenname like 'C%' order by scope; <br>

shows that Cindi and Cynthia S Brown are almost certainly different people, so again we do nothing.

The one case that needs fixing is Zimmerman: Jess, Jess K.
The query

select surname, givenname, scope, organization, email, organization_keywords from eml_files.responsible_parties where rp_type='creator' and surname='Zimmerman' and givenname like 'Jes%' order by scope;<br>

returns (partial results):

  surname  | givenname |    scope     |                 organization                  |         email          | organization_keywords 
-----------+-----------+--------------+-----------------------------------------------+------------------------+-----------------------
 Zimmerman | Jess      | edi          | LUQ LTER                                      |                        | 
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected]    |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected]    |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq |                                               | [email protected] |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq |                                               | [email protected] |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected]    |  UPuertoRico
 Zimmerman | Jess K    | knb-lter-luq | University of Puerto Rico - Rio Piedras       | [email protected]    |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico                     | [email protected] |  UPuertoRico
 Zimmerman | Jess K    | knb-lter-luq | University of Puerto Rico - Rio Piedras       | [email protected]    |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected]    |  UPuertoRico

umbra will figure out that the Jess and Jess K Zimmerman in the knb-lter-luq scope are the same person, since they're both at the University of Puerto Rico.

The problem is the Jess Zimmerman in the edi scope. But note that here we do have LUQ LTER as the organization, so that seems to make it a safe bet that it's the same Jess Zimmerman as in knb-lter-luq. To tell umbra that they are the same person, we edit the data file corrections_name_variants.xml and add these entries:

    <person>
        <variant>
            <surname>Zimmerman</surname>
            <givenname>Jess</givenname>
            <scope>edi</scope>
        </variant>
        <variant>
            <surname>Zimmerman</surname>
            <givenname>Jess</givenname>
            <scope>knb-lter-luq</scope>
        </variant>
        <variant>
            <surname>Zimmerman</surname>
            <givenname>Jess K</givenname>
            <scope>knb-lter-luq</scope>
        </variant>
    </person>

These changes should be made in Github and pulled down to the server.

Once we have resolved all of the suspicious cases, we need to tell umbra to flush the "new" possible dups so the next time we ask for possible dups we aren't given the same new cases to check out all over again. To flush, POST https://umbra.edirepository.org/creators/possible_dups.

Now, if we do a GET https://umbra.edirepository.org/creators/possible_dups, the returned list will look like:

[
    "==================================================",
    "Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
    "Adhikari: Ashish, Bishwo",
    "Alexander: Clark R, Heather D, Mara, Pezzuoli R",
    "Allen: Dennis, Jonathan, Scott Thomas",
    "Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",    
    etc.
]

i.e., the list of new possible dups above the "==================================================" line is now empty.

As we update the creator names over the course of some days, new possible dups will probably show up, and we do the same process over again.

There are several other data files to be aware of.

corrections_nicknames.xml lists nicknames that we want to recognize. E.g.,

    <nickname>
        <name1>jim</name1>
        <name2>james</name2>
    </nickname>

corections_orcids.xml lists ORCIDs for creators that have incorrect ORCIDs in one or more EML files. E.g.,

    <correction>
        <surname>Stanley</surname>
        <givenname>Emily%</givenname>
        <orcid>0000-0003-4922-8121</orcid>
    </correction>

corrections_overrides.xml lists cases where a name is misspelled and we want to correct the spelling, not just treat it as a variant. E.g.,

    <override>
        <original>
            <surname>Morse</surname>
            <givenname>Jennfier F</givenname>
        </original>
        <corrected>
            <surname>Morse</surname>
            <givenname>Jennifer F</givenname>
        </corrected>
        <scope>knb-lter-nwt</scope>
    </override>

Note that the name_variants API will still return the misspelled name as one of the variants so that searches will find datasets with the name misspelled.

organizations.xml lists organization names and emails that correspond to a particular organization (usually a university). E.g.,

   <organization>
        <name>U%New Mexico</name>
        <name>U%NM</name>
        <name>UNM</name>
        <email>unm.edu</email>
        <keyword>UNM</keyword>
    </organization>

Any organization name, address, or email address that matches or contains one of the variants will mark a record in the database as being in the given organization (UNM, in this example). This helps umbra determine what organization a record is associated with, despite the many variations in the ways organization names, addresses, and email addresses are spelled.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data		data
deployment		deployment
docs		docs
log		log
tests		tests
webapp		webapp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment-mac.yml		environment-mac.yml
environment-min.yml		environment-min.yml
environment.yml		environment.yml
requirements.txt		requirements.txt
run.py		run.py
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

umbra

Creator Names APIs

APIs used by the Data Portal:

APIs used to keep the names database up-to-date:

Manual steps involved in creating the creator names database:

Manual steps involved in maintaining the creator names database:

About

Releases

Packages

Contributors 2

Languages

License

PASTAplus/umbra

Folders and files

Latest commit

History

Repository files navigation

umbra

Creator Names APIs

APIs used by the Data Portal:

APIs used to keep the names database up-to-date:

Manual steps involved in creating the creator names database:

Manual steps involved in maintaining the creator names database:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages