Stitcher

Software for the ingestion and semantic normalization of datasets. Stitcher employs entity resolution algorithms to partition entities within a given dataset into disjoint sets such that those within the same set are considered equivalent. Thus, Stitcher is used to untangle a web of connections between entities from multiple sources, form clusters representing unique substances, and thereby locate the unified set of properties for each substance. At the last step, derived variables are computed by traversing the unified property set.

A technical description of this approach can be found in https://github.com/ncats/stitcher/tree/master/paper

Stitching Approach

We propose a graph-based approach to entity stitching and resolution. Briefly, our approach uses clique detection to do the stitching and resolution as follows:

For a given hypergraph (multi-edge) of stitched entities, extract connected components based on stitching keys as defined in StitchKey.
For each connected component, perform exhaustive clique enumeration over each stitch key. A clique is a complete subgraph of size 3 or larger.
Next we identify a set of high confidence cliques. A high confidence clique is a clique for which its members do not belong to any other clique. All nodes in a clique are merged to become a stitched node.
For the leftover cliques, we perform a sort by descending order of the value |V| * |E| where |V| and |E| are the clique size and the cardinality of stitch keys, respectively. Stitched nodes are created as we iterate through this order ignoring any nodes that have already been stitched.

Building and Running Stitcher (new way)

make a project directory to hold two git repos
- mkdir stitcher-project
- cd stitcher-project
clone two repositories into it, this one, and stitcher-data-inxight
- git clone https://github.com/ncats/stitcher
- clone stitcher-data-inxight alongside that folder
  - install git-lfs
    - sudo apt-get install -y git-lfs
    - git lfs install
  - git clone https://github.com/ncats/stitcher-data-inxight
build stitcher - takes 12 hours or so
- get to the stitcher folder, open a new screen
- UPDATE THE VERSION NUMBER, UNLESS YOU WANT TO OVERWRITE THE EXISTING ONE
- docker compose -f build.docker-compose.yml up
- you can close this container, after the database is built
stand up the app / api
- (optional) update the version in api.docker-compose.yml
- docker compose -f api.docker-compose.yml up
add the curations
- get the container id for the API process
  - docker ps
- docker exec -it {{container}} bash
- run the curations
  - python3 scripts/stitcher-curation/applyCurations.py docker --filename scripts/stitcher-curation/dbCurations-2023-02-13.txt
stand up the neo4j browser
- (optional) update the version number in neo4j.docker-compose.yml
- docker compose -f neo4j.docker-compose.yml up
- you might have to update the docker compose file to allow writing the database for the first time you run the container
  - NEO4J_dbms_read__only=false
- make sure you change it back though after, and restart the container

Updating Stitcher (current - 1/25/2024)

download new gsrs dump file -
- https://gsrs.ncats.nih.gov/#/release
- put it in /stitcher-project/stitcher-data-inxight/files
- delete the old one
- commit & push changes to the repo
delete /temp
get updated dailymed files
- check that scripts/dailymed/dailymed_get_noel.sh has all the partial files listed here
  - https://dailymed.nlm.nih.gov/dailymed/spl-resources-all-drug-labels.cfm
- run scripts/dailymed/dailymed_get_noel.sh to get all the updated dailymed files
- run dailymed_prepare.sh
  - make sure it says "All done!" at the end
run "python scripts/approvalYears.py"
- I had to fix the purple book download a few times, because there were some random new lines in the dose field for some reason
- run it until it says "done"

Building Stitcher (old way)

This codebase is based on the latest version of the Play framework and as such it needs Java 8 to build. Modules are defined under modules. The main Play app is defined in app. To build the main app, type

$ ./activator {target}

where {target} can be one of {compile,run,test, dist}. Building modules is similar:

$ ./activator {module}/{target}

where {module} is the module name as it appears under modules/ and {target} can be {compile, test}. To run a particular class in a particular module, use the runMain syntax, e.g.,

$ ./activator "project stitcher" "runMain ncats.stitcher.tools.DuctTape"

Detailed Instructions (old way)

Preparing the Database and Stitching

Try invoking the sbt shell to check if it is available, then exit.
```
$ sbt
```
1. Initiate (define auxiliary functions, check for java version, etc.), then exit.
```
$ bash activator2
```
Build, stitch, and calculate events.
1. Make sure you have a file .sbtopts in your stitcher directory that has the following content:
```
-J-Xms1024M -J-Xmx16G -J-Xss1024M -J-XX:+CMSClassUnloadingEnabled -J-XX:+UseConcMarkSweepGC
```
2. From the stitcher directory, run:
```
$ ./scripts/stitching/stitch-all-current.sh
```
  The script will create a date- and time-stamped database named according to the following convention stitchvYYYYMMDD-hhmmss.db.
  NOTE: Building the database and stitching should take about 14 hours total on a server with two Intel(R) Xeon(R) E5-2665 CPUs. The application uses about 20GB of RAM.
3. Alternatively, to create a log file, run:
```
$ ./scripts/stitching/master-stitch-all-current.sh
```
  NOTE: Since the process takes a while, it's better run the process in a separate screen to keep the process running, if the connection to the server/terminal is reset. While nohup is another option, it is problematic in this case, as it will stop the job at the end of every command due to a tty output attempt.
```
$ screen
$ ./scripts/stitching/master-stitch-all-current.sh
#press 'ctrl+a', then 'd' to disconnect from the screen
```
NOTE: If you encounter errors, try cleaning the project by removing all target directories directly, and then re-run the script:
```
$ find . -name target -type d -exec rm -rf {} \;
$ bash scripts/stitch-all-current.sh
```

Testing Locally (old way)

Stitching (Inxight)

Since the stitching takes a long time, one might want to test a small subset of substances.

Prepare test data sources by selecting a desired subset of substances in each.
To make a G-SRS data source, run:
```
$ ./scripts/stitcher-testing/make-test-gsrs-dump.sh UNII
```
The script takes a UNII as an argument and will excise that record from the G-SRS dump and a path to that G-SRS dump.
NOTE: the first run is slow, but the follow-up runs are fast, as the script will attempt to locate temporary files it produced in /tmp directory.

Modify the test script accordingly and run it:

$ ./scripts/stitching/test/stitch-all-current.sh

App Deployment (old way)

In your stitcher directory, run:
```
$ ./scripts/deployment/restart-stitcher-from-repo.sh YOUR-DATABASE-PATH
```
The script takes one argument, the path to your desired database.
When prompted in the console, in your browser navigate to
http://localhost:9000/app/stitches/latest

Deployment

Build the Binary Distribution

NOTE: only do this if you have changed the stitcher code or starting anew.

Please make sure you run the following test when you update the stitching algorithm
```
sbt stitcher/"testOnly ncats.stitcher.test.TestStitcher"
```
and ensure all the basic stitching test cases are passed before doing a build
Make a distribution. In the stitcher directory run:
```
sbt dist
```
It will be created in stitcher/target/universal/ and have a name similar to ncats-stitcher-master-20171110-400d1f1.zip.

Copy the archive to the deployment server (e.g., dev.ncats.io). For example:

#navigate to path-to-stitcher-parent-directory/stitcher/target/universal/ 
#scp to the server
$ scp ncats-stitcher-master-20171110-400d1f1.zip [email protected]:/tmp

Unzip into the desired folder.

#navigate to the desired folder on the deployment server
$ ssh [email protected]
#unzip
$ unzip /tmp/ncats-stitcher-master-20171110-400d1f1.zip

Deploy

In the stitcher folder (where you have prepared the database), archive the database folder and copy it over to the deployment server.
```
$ zip -r stitchv1db.zip stitchv1.db/
$ scp stitchv1db.zip [email protected]:/tmp
```
On the deployment server, navigate to a directory containing the stitcher distribution folder and unzip the database.
```
$ ssh [email protected]
$ unzip /tmp/stitchv1db.zip
```

Start up the app. The script takes the distribution and db folders as arguments.

$ ./scripts/deployment/restart-stitcher.sh ncats-stitcher-master-20171110-400d1f1 stitchv1.db

Summary

To run a new stitcher instance you'll need:

A distribution folder (e.g. ~/ncats-stitcher-master-20171110-400d1f1).
A database (e.g. stitchv1.db).
A files-for-stitcher.ix folder with three files.
The script for (re)starting stitcher restart-stitcher.sh.

Examples

Sample API Queries

https://stitcher.ncats.io/app/stitches/latest
https://stitcher.ncats.io/app/stitches/latest/ + UNII
https://stitcher.ncats.io/app/stitches/latest/aspirin
https://stitcher.ncats.io/api/datasources

Scraping Inxight Target Data

The activity data is linked to the substance records using the UNII identifier.

For example, the UNII for cannabidiol is 19GBJ60SN5, and Inxight contains the following data:

Primary Target	Pharmacology	Potency
Vanilloid receptor	Agonist	3.2 µM [EC50]
Dopamine D2 receptor	Partial Agonist	11.0 nM [Ki]
Glycine receptor subunit alpha-3	Binding Agent
G-protein coupled receptor 55	Antagonist	445.0 nM [IC50]
Serotonin 1a (5-HT1a) receptor	Agonist

The target data can be found in the json in Stitcher as well under:
sgroup / properties / targets
For example:

"targets": 
    [
      0: 
      {
        "node": 380111
        "value": "eyJwcmltYXJ5X3RhcmdldF9pZCI6IkNIRU1CTDIxNCIsImNvbXBvdW5kX2lkIjo4MjMzLjAsInRhcmdldF9wcmltYXJ5X3RhcmdldF90eXBlIjoiQ2hFTUJMIiwicHJpbWFyeV90YXJnZXRfdXJpIjoiaHR0cHM6Ly93d3cubmNiaS5ubG0ubmloLmdvdi9wdWJtZWQvMTYyNTg4NTMiLCJ0YXJnZXRfcHJpbWFyeV9wb3RlbmN5X3R5cGUiOiJVbmtub3duIiwicHJpbWFyeV9wb3RlbmN5X3VyaSI6IlVua25vd24iLCJ0YXJnZXRfcGhhcm1hY29sb2d5IjoiQWdvbmlzdCIsInByaW1hcnlfdGFyZ2V0X2xhYmVsIjoiU2Vyb3RvbmluIDFhICg1LUhUMWEpIHJlY2VwdG9yIiwidGFyZ2V0X2lkIjo5NjA4fQ"
      }

The value properties are base64 encoded json objects and decoding them yields all relevant target information including the source URL, e.g.:

{
    "primary_target_id": "CHEMBL214",
    "compound_id": 8233.0,
    "target_primary_target_type": "ChEMBL",
    "primary_target_uri": "https://www.ncbi.nlm.nih.gov/pubmed/16258853",
    "target_primary_potency_type": "Unknown",
    "primary_potency_uri": "Unknown",
    "target_pharmacology": "Agonist",
    "primary_target_label": "Serotonin 1a (5-HT1a) receptor",
    "target_id": 9608
}

NOTE: The node property refers to the Stitcher data source. NOTE: all homonymous properties from different sources are merged into a single array in Stitcher; therefore, some targets coming from different sources may not be base64 encoded.

To get the entire dataset, you can iterate over all entries in the stitcher API using 'top' (must be <11) and 'skip', e.g.:
https://stitcher.ncats.io/api/stitches/v1?top=10&skip=590

Troubleshooting

Issue #1

Description:

java.lang.NumberFormatException: For input string: "0x100"

Cause:
SBT uses jline for terminal output. The latter in turn uses the infocmp utility provided by ncurses, which expects only decimal values. This behaviour was fixed in a new version of jline and and newer version of SBT, however version 0.13.15 used for this project still suffers from it.

Solution:
Add the following to your ~/.bashrc:

export TERM=xterm-color

Access to underlying Neo4j database

The underlying Neo4j for stitcher is publicly accessible here.
Please specify stitcher.ncats.io:80 in the Host field.
No credentials are needed.

Data Preparation

Scripts for Recent Approval Data from FDA

cd scripts
python approvalYears.py   # requires python 3+

In the /data folder, there should now be a file named according to the following convention: approvalYears-YYYY-MM-DD.txt.
If acceptable, update the filename reference in /data/conf/ob.conf to point to this new file.

Name		Name	Last commit message	Last commit date
Latest commit History 1,539 Commits
.github/workflows		.github/workflows
app		app
conf		conf
data		data
gard		gard
lib		lib
modules		modules
oopd		oopd
paper		paper
project		project
public		public
scripts		scripts
stitcher-inputs		stitcher-inputs
workflows		workflows
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
activator		activator
activator-launch-1.3.7.jar		activator-launch-1.3.7.jar
activator.bat		activator.bat
activator2		activator2
api.Dockerfile		api.Dockerfile
api.docker-compose.yml		api.docker-compose.yml
build-ontgraph.sh		build-ontgraph.sh
build.Dockerfile		build.Dockerfile
build.docker-compose.yml		build.docker-compose.yml
build.entrypoint.sh		build.entrypoint.sh
build.sbt		build.sbt
build.sh		build.sh
cypher.txt		cypher.txt
dev.Dockerfile		dev.Dockerfile
dev.docker-compose.yml		dev.docker-compose.yml
downloadDrugsAtFda.sh		downloadDrugsAtFda.sh
gard-diseases.json		gard-diseases.json
neo4j-cert.sh		neo4j-cert.sh
neo4j.docker-compose.yml		neo4j.docker-compose.yml
stitcher-drug-style.json		stitcher-drug-style.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stitcher

Stitching Approach

Building and Running Stitcher (new way)

Updating Stitcher (current - 1/25/2024)

Building Stitcher (old way)

Detailed Instructions (old way)

Preparing the Database and Stitching

Testing Locally (old way)

Stitching (Inxight)

App Deployment (old way)

Deployment

Build the Binary Distribution

Deploy

Summary

To run a new stitcher instance you'll need:

Examples

Sample API Queries

Scraping Inxight Target Data

Troubleshooting

Issue #1

Access to underlying Neo4j database

Data Preparation

About

Releases

Packages

Contributors 9

Languages

License

ncats/stitcher

Folders and files

Latest commit

History

Repository files navigation

Stitcher

Stitching Approach

Building and Running Stitcher (new way)

Updating Stitcher (current - 1/25/2024)

Building Stitcher (old way)

Detailed Instructions (old way)

Preparing the Database and Stitching

Testing Locally (old way)

Stitching (Inxight)

App Deployment (old way)

Deployment

Build the Binary Distribution

Deploy

Summary

To run a new stitcher instance you'll need:

Examples

Sample API Queries

Scraping Inxight Target Data

Troubleshooting

Issue #1

Access to underlying Neo4j database

Data Preparation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages