Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Author Disambiguation #25

Open
keith-ingentium opened this issue Apr 10, 2020 · 4 comments
Open

Author Disambiguation #25

keith-ingentium opened this issue Apr 10, 2020 · 4 comments
Assignees
Labels
Priority: High This issue should be dealt with as soon as possible Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Help Wanted Extra attention is needed Type: Data Source To identify an issue as a data source

Comments

@keith-ingentium
Copy link

The authors on references are duplicated for each reference node, and should be unified across all references. Thus, with a single author node, one should have links to all references that that person is an author on.
covidgraph/documentation#1

@motey
Copy link
Member

motey commented Apr 10, 2020

Atm every :Author node has the property _hash_id which is a md5 hash of all other properties of an :Author node. Based on _hash_id the :Author nodes are merged. In result, when an :Author has the same properties, there will be no duplicate.

Every "duplicate" is based on poor source data. This is usually the result of Authors using different representation of their names (or being references with a different representation of their names)

On the other side of the spectrum this results in the problem, that Authors with the same name (e.g. a common name like Tom Miller) are merged to one Author atm.

To bypass these problems Authors can attach an Orcid ID to their papers. This is done more and more by authors nowadays but unfortunately orcid IDs are missing in the CORD19 dataset.

One could improve the current situation, by creating a new data source script which matches papers against pubmed data and try to obtain more detailed author data from there.

As the author name representations in the references in the CORD19 data is very poor, this data will be dropped with the next datamodel release anyway.

@motey motey closed this as completed Apr 10, 2020
@mpreusse
Copy link
Member

mpreusse commented Apr 11, 2020

Some additional ideas from the Matrix chat:

The disambiguation problem is a big one for any graph project. The CORD19 dataset didnt even include links to Pubmed, so it exacerbates the problem. I think that what is called for is the ability to preprocess papers to disambiguate authors against a standard database, If you look through the wikipedia entry on author disambiguation (https://en.wikipedia.org/wiki/Author_name_disambiguation) you will see two efforts at building this reference database - AMiner and CiteSeer. For this limited dataset, I think we could build a disambiguated database, and process all the literature references through the pipeline to disambiguate the authors... would be an intersting project to work on , plus would provide real value to CovidGraph.

@mpreusse mpreusse reopened this Apr 11, 2020
@amalic
Copy link
Member

amalic commented Apr 13, 2020

I used Springer's SciGraph in the past which contains links between persons and organisations. Don't forget to consider that a person switches organisations over time.

see: SciGraph Ontology

more: scigraph/docs/jsonld/examples/person.jsonld

@keith-ingentium
Copy link
Author

Just took a quick look at the data they make available for download. Not sure how useful it is. We may need to develop a database on our own, that is specific to the COVID authors, and can rely on information on institutions, co-authors, etc. in the COVID-19 dataset.

@Jiros Jiros transferred this issue from covidgraph/documentation Dec 7, 2020
@Jiros Jiros added Type: Data Source To identify an issue as a data source Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Help Wanted Extra attention is needed Priority: High This issue should be dealt with as soon as possible labels Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High This issue should be dealt with as soon as possible Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Help Wanted Extra attention is needed Type: Data Source To identify an issue as a data source
Projects
None yet
Development

No branches or pull requests

5 participants