Skip to content

PASTAplus/umbra

Repository files navigation

umbra

APIs for "shadow" metadata

Creator Names APIs

APIs used by the Data Portal:

APIs used to keep the names database up-to-date:

Manual steps involved in creating the creator names database:

The following steps apply to a newly-instantiated umbra server. I.e., they are the steps needed to set up umbra to start with.

  • Create the database
    Create a psql database called pasta with user pasta (with the usual password).
    Edit config.py to contain the correct password.
    In the data directory, run the following to initialize the database schema:
    psql -d pasta -U pasta -h localhost < create_eml_schemas.sql

  • Edit the configuration file
    Besides the database password, the configuration file config.py needs to contain the base folder path. Typically, this will be '/home/pasta/umbra'.

  • Get the initial set of EML files
    A newly instantiated umbra server needs to acquire a complete set of EML files from PASTA. A standalone Python program download_eml.py is provided for this purpose. It uses async i/o, but still takes several hours to complete.

  • Initialize the "raw" responsible parties database table
    After the EML files have been downloaded via download_eml.py, they need to be parsed and their "responsible parties" entries saved in a database table. Accomplish this via the following API:
    POST https://umbra.edirepository.org/creators/init_raw_db

  • The two steps above, getting the initial set of EML files and initializing the "raw" responsible parties database table, only need to be done once. Subsequently, new EML files will be downloaded and the database updated via the update creator names API described above.

Manual steps involved in maintaining the creator names database:

The umbra software does its best to resolve and normalize creator names programmatically. To determine if several name variants (e.g., James T Kirk, James Kirk, Jim Kirk, J Kirk) actually refer to the same person, it looks at various forms of "evidence" (email address, organization name, etc.) in the EML files. In some cases, however, it is unable confidently to conclude that two variants are the same person, and manual intervention is needed. Either evidence is lacking, or a surname may be misspelled in a particular case, for example.

The API GET https://umbra.edirepository.org/creators/possible_dups returns a list of cases that should be looked at manually. They are cases where a given surname has multiple normalized givennames. The list starts with dups that are new, followed by a line that reads "==================================================". For example, a call to this API returned (partial list):

[
    "Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
    "Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",
    "Bailey: Amey, John, Rosemary, Scott W, Vanessa L",
    "Brown: Cindi, Cynthia S, Dana Rachel, James, Jeffrey, Joseph K, Kerry, Renee F",
    "Martin: Chris A, Jonathan E, Mac, Mary",
    "McDowell: Nate G, Nathan, William H",
    "Simmons: Breana, Joseph",
    "Smith: Alexander, C Scott, Colin A, Curt, David R, Dylan J, G Jason, Jane E, Jane G, Jason M, Jayme, John W, Jonathan W, Katherine, Kerry, Lesley, Lori, Matthew, Melinda D, Michael, Ned, Nicole J, Rachel, Raymond, Richard, Sarah J, Stacy A, Thomas C",
    "Zhou: Jiayu, Jizhong, Weiqi",
    "Zimmerman: Jess, Jess K, Richard C"
    "==================================================",
    "Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
    "Adhikari: Ashish, Bishwo",
    "Alexander: Clark R, Heather D, Mara, Pezzuoli R",
    "Allen: Dennis, Jonathan, Scott Thomas",
    "Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",    
    etc.
]

Scanning down this list, we see several cases that look suspicious:
Anderson: James, Jim
Brown: Cynthia S, Cindi
McDowell: Nate G, Nathan
Smith: Jane E, Jane G
Zimmerman: Jess, Jess K

We check these out by running psql queries on the server.

select surname, givenname, scope, address, organization, email, url, orcid, organization_keywords from eml_files.responsible_parties_test where rp_type='creator' and surname='Anderson' and givenname in ('James','Jim') order by scope; <br>

shows that there is no evidence connecting Jim and James Anderson, so we do nothing.

select surname, givenname, scope, address, organization, email, url, orcid, organization_keywords from eml_files.responsible_parties_test where rp_type='creator' and surname='Brown' and givenname like 'C%' order by scope; <br>

shows that Cindi and Cynthia S Brown are almost certainly different people, so again we do nothing.

The one case that needs fixing is Zimmerman: Jess, Jess K.
The query

select surname, givenname, scope, organization, email, organization_keywords from eml_files.responsible_parties where rp_type='creator' and surname='Zimmerman' and givenname like 'Jes%' order by scope;<br>

returns (partial results):

  surname  | givenname |    scope     |                 organization                  |         email          | organization_keywords 
-----------+-----------+--------------+-----------------------------------------------+------------------------+-----------------------
 Zimmerman | Jess      | edi          | LUQ LTER                                      |                        | 
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected]    |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected]    |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq |                                               | [email protected] |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq |                                               | [email protected] |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected]    |  UPuertoRico
 Zimmerman | Jess K    | knb-lter-luq | University of Puerto Rico - Rio Piedras       | [email protected]    |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico                     | [email protected] |  UPuertoRico
 Zimmerman | Jess K    | knb-lter-luq | University of Puerto Rico - Rio Piedras       | [email protected]    |  UPuertoRico
 Zimmerman | Jess      | knb-lter-luq | University of Puerto Rico, Rio Piedras Campus | [email protected]    |  UPuertoRico

umbra will figure out that the Jess and Jess K Zimmerman in the knb-lter-luq scope are the same person, since they're both at the University of Puerto Rico.

The problem is the Jess Zimmerman in the edi scope. But note that here we do have LUQ LTER as the organization, so that seems to make it a safe bet that it's the same Jess Zimmerman as in knb-lter-luq. To tell umbra that they are the same person, we edit the data file corrections_name_variants.xml and add these entries:

    <person>
        <variant>
            <surname>Zimmerman</surname>
            <givenname>Jess</givenname>
            <scope>edi</scope>
        </variant>
        <variant>
            <surname>Zimmerman</surname>
            <givenname>Jess</givenname>
            <scope>knb-lter-luq</scope>
        </variant>
        <variant>
            <surname>Zimmerman</surname>
            <givenname>Jess K</givenname>
            <scope>knb-lter-luq</scope>
        </variant>
    </person>

These changes should be made in Github and pulled down to the server.

Once we have resolved all of the suspicious cases, we need to tell umbra to flush the "new" possible dups so the next time we ask for possible dups we aren't given the same new cases to check out all over again. To flush, POST https://umbra.edirepository.org/creators/possible_dups.

Now, if we do a GET https://umbra.edirepository.org/creators/possible_dups, the returned list will look like:

[
    "==================================================",
    "Adams: Byron, Henry D, Jesse B, Leslie M, Mary Beth, Phyllis C",
    "Adhikari: Ashish, Bishwo",
    "Alexander: Clark R, Heather D, Mara, Pezzuoli R",
    "Allen: Dennis, Jonathan, Scott Thomas",
    "Anderson: Christopher B, Clarissa, Cody A, Craig, Iris, James, Jim, John P, Kathryn, Lucy, Lyle, Mike D, Rebecca, Robert A, Suzanne Prestrud, Thomas, William",    
    etc.
]

i.e., the list of new possible dups above the "==================================================" line is now empty.

As we update the creator names over the course of some days, new possible dups will probably show up, and we do the same process over again.

There are several other data files to be aware of.

corrections_nicknames.xml lists nicknames that we want to recognize. E.g.,

    <nickname>
        <name1>jim</name1>
        <name2>james</name2>
    </nickname>

corections_orcids.xml lists ORCIDs for creators that have incorrect ORCIDs in one or more EML files. E.g.,

    <correction>
        <surname>Stanley</surname>
        <givenname>Emily%</givenname>
        <orcid>0000-0003-4922-8121</orcid>
    </correction>

corrections_overrides.xml lists cases where a name is misspelled and we want to correct the spelling, not just treat it as a variant. E.g.,

    <override>
        <original>
            <surname>Morse</surname>
            <givenname>Jennfier F</givenname>
        </original>
        <corrected>
            <surname>Morse</surname>
            <givenname>Jennifer F</givenname>
        </corrected>
        <scope>knb-lter-nwt</scope>
    </override>

Note that the name_variants API will still return the misspelled name as one of the variants so that searches will find datasets with the name misspelled.

organizations.xml lists organization names and emails that correspond to a particular organization (usually a university). E.g.,

   <organization>
        <name>U%New Mexico</name>
        <name>U%NM</name>
        <name>UNM</name>
        <email>unm.edu</email>
        <keyword>UNM</keyword>
    </organization>

Any organization name, address, or email address that matches or contains one of the variants will mark a record in the database as being in the given organization (UNM, in this example). This helps umbra determine what organization a record is associated with, despite the many variations in the ways organization names, addresses, and email addresses are spelled.

About

APIs for "shadow" metadata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages