A repository containing support code and resources initially developed at the Institute for Medical Informatics, Statistics and Documentation at the Medical University of Graz (Austria) for participation at the 2017 TREC Precision Medicine Track. For further information on this track and the final results please check the official TREC-PM 2017 overview paper. Team name: imi_mug
It was then further improved for participation at the 2018 TREC Precision Medicine Track. Improvements include: support for subtemplates and the possibility to use disjunctive queries (dis_max) allowing e.g. synonyms and hypernyms to have different weights. Team name: hpi-dhc.
If you use imi_mug
's original data or code in your work, please cite their TREC 2017 proceedings paper:
TREC 2017 Precision Medicine - Medical University of Graz. Pablo López-García, Michel Oleynik, Zdenko Kasáč and Stefan Schulz. Text REtrieval Conference, Gaithersburg, MD. 2017. Available at https://trec.nist.gov/pubs/trec26/papers/imi_mug-PM.pdf.
If you use any of the improvements mentioned above, please also cite our TREC 2018 proceedings paper:
HPI-DHC at TREC 2018 Precision Medicine Track. Michel Oleynik, Erik Faessler, Ariane Morassi Sasso, et. al. Text REtrieval Conference, Gaithersburg, MD. 2018. Available at https://trec.nist.gov/pubs/trec27/papers/hpi-dhc-PM.pdf.
- hpi_dhc TREC 2018 presentation slides
- hpi_dhc TREC 2018 Poster
- hpi_dhc TREC 2018 Data Artifacts
- TREC 2018 proceedings.
- JDK 11+ (won't compile with JDK8)
- maven
- make (for
trec_eval
tool) - gcc (for
trec_eval
tool) - perl (for
sample_eval
tool) - Elasticsearch 5.4.0+
- python3 (to parse UMLS, get fasttext embeddings)
You require the MRCONSO.RRF
which can be obtained from the official UMLS downloads.
Then, adapt the paths in the scripts/createUmlsTermSynsets.py
script to read from your MRCONSO.RRF
file and
create the resources/umlsSynsets.txt
file. Framework classes making use of the UMLS synsets will expect
the file at this location.
- Download https://download.nlm.nih.gov/umls/kss/2019AA/umls-2019AA-mrconso.zip
unzip umls-2019AA-mrconso.zip
python3 scripts/createUmlsTermSynsets.py MRCONSO.RRF ENG > resources/umlsSynsets.txt
wc -c umlsSynsets.txt
= 338449057gzip resources/umlsSynsets.txt
FastText
embeddings are used to create document embeddings for LtR features. Note that their performance impact seemed to be minor in experiments on the TREC-PM 17/18 data and probably can be left out without great performance penalties. However, this can't be said for sure before evaluation on the 2019 gold standard.
The emebeddings can be recreated by:
- Run the BANNER gene tagger from jcore-projects, version>=2.4 on the Medline/PubMed 2019 baseline.
- Extract the document text from those document with at least one tagged gene in them. This should be around 8 million documents. The text is the title plus abstract text (e.g. by using the JCoRe PubMed reader and the JCoRe To TXT consumer in the
DOCUMENT
mode). No postprocessing (which should be done for better models but hasn't been done on the used embeddings). - Create
FastText
word embeddings with a dimension of 300. We used the.bin
output for LtR features.
# All executions should be run where the pom file is, usually the root of the project
# How to run the pubmed experimenter
# Necessary to define the year and type of gold-standard (for evaluation)
mvn clean install
mvn exec:java -Dexec.mainClass="at.medunigraz.imi.bst.trec.LiteratureArticlesExperimenter"
# How to run the clinical trials experimenter
# Necessary to define the year and type of gold-standard (for evaluation)
mvn clean install
mvn exec:java -Dexec.mainClass="at.medunigraz.imi.bst.trec.ClinicalTrialsExperimenter"
# How to run the KeywordExperimenter
# Necessary to define the year and type of gold-standard (for evaluation)
# For positive booster, in the keyword template leave boost = 1
# For negative booster, in the keyword template leave boost = -1
# Also, in the KeywordExperimenter the keywordsSource needs to be specified
mvn clean install
mvn exec:java -Dexec.mainClass="at.medunigraz.imi.bst.trec.KeywordExperimenter" > out.txt &
cat out.txt | grep -e "\(^[0-9\.]*\)\(\;.*\)\(with.*\)\(\\[.*\\]\)\(.*\)" | sed -r "s/"\(^[0-9\.]*\)\(\;.*\)\(with.*\)\(\\[.*\\]\)\(.*\)"/\1 \2 \4/" > results.txt
The databases can be re-created using the the components in the uima
subdirectory.
All UIMA pipelines have been created and run by the JCoRe Pipeline Components in version 0.4.0
. Note that all pipelines require their libraries in the lib/
directory which does not exist at first. It is automatically created and populated by opening the pipeline with the JCoRe Pipeline Builder CLI
under the above link. Opening the pipeline should be enough. If this das not create and populate the lib/
directory, try opening and saving the pipeline.
- Install
ElasticSearch 5.4
andPostgres >= 9.6
. Used for the experiments wasPostgres 9.6.13
. - Change into the
uima
directory on the command line and execute./gradlew install-uima-components
. this must successfully run through in order to complete the following steps. Note that Gradle is only used for scripting, the projects are all build with Maven. Thus, check the Maven output for success or failure messages. Gradle may report success despite Maven failing. - Run the
pm-to-xmi-db-pipeline
and thect-to-xmi-db-pipeline
with theJCoRE Pipeline Runner
. Before you actually run those, check thepipelinerunner.xml
configuration files in both projects for the number threads being used. Adapt them to the capabilities of your system, if necessary. - Configure the
preprocessing
andpreprocessing_ct
with theJCoRe Pipeline Builder
to active nearly all (explained in a second) components. Some are deactivated in this release. Note that there are some components specific toBANNER
gene tagging andFLAIR
gene tagging. Use theBANNER
components, Flair hasn't been used in our submitted runs. You might also leave theLingScope
andMutationFinder
components off because those haven't been used either. Configure theuima/costosys.xml
file in all pipelines to point to your Postgres database. Run the components. They will write the annotation data into the Postgres database. We used multiple machines for this, employing the SLURM scheduler (not required). All in all we had 96 CPU cores available. Processing time was in the hours, much less than a day for PubMed. The processing will accordingly take longer or shorter depending on the resources at your disposal. - Configure the
pubmed-indexer
andct-indexer
projects to work with your ElasticSearch index using theJCoRe Pipeline Builder
. Executemvn package
in both pipeline directories to build the indexing code, which is packaged as ajar
and automatically put into thelib
directory of the pipelines. Run the components.
If all steps have been performed successfully, the indices should now be present in your ElasticSearch instance. To run the experiments, also configure the <repository root>/config/costosys.xml
file to point to your database. Then run the at.medunigraz.imi.bst.trec.LiteratureArticlesExperimenter´ and
at.medunigraz.imi.bst.trec.ClinicalTrialsExperimenter` classes.
There are few settings that are configured via Java System properties. Such settings do not count as regular configuration settings but change basic behaviour of the system, often used for tests.
at.medunigraz.imi.bst.retrieval.subtemplates.folder
- sets the folder where the subtemplates are expected (default:/subtemplates/
)de.julielab.java.utilities.cache.enabled
- if set tofalse
, the caching library is deactivated. The caching code is still there but theCacheAccess
objects always returnnull
when retrieving cached objects.
Q: Do I really need to store all the documents into the database? Wouldn't it be quicker just to index everything directly from the source data?
A: Directly indexing from the source data is very well possible by combining the respective parts of the three steps (reading, preprocessing, indexing). Note however, that the LtR feature generation makes use of the document stored in the database. Thus, LtR wouldn't work this way.