Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensembl Tark Data Provider #86

Closed
davmlaw opened this issue Sep 20, 2024 · 7 comments
Closed

Ensembl Tark Data Provider #86

davmlaw opened this issue Sep 20, 2024 · 7 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Sep 20, 2024

Andy Yates suggested https://tark.ensembl.org/

This has Ensembl in a format we can use

However, it doesn't have alignments (CIGAR etc) for RefSeq so doesn't handle gaps, have raised issue on project Ensembl/tark#81

So I think we should just do Ensembl to start with


Example:

http://tark.ensembl.org/api/transcript/?stable_id=ENST00000256078&stable_id_version=4&expand_all=true

We can get sequence out via:

data["results"][0]["sequence"]["sequence"]

Can get out protein - get_pro_ac_for_tx_ac:

t = data["results"][0]["translations"][0]
In [17]: f'{t["stable_id"]}.{t["stable_id_version"]}'
Out[17]: 'ENSP00000256078.4'

Can implement ``get_tx_for_gene```

http://tark.ensembl.org/api/transcript/search/?identifier_field=KRAS&expand=transcript_release_set%2Cgenes

Can even implement get_tx_for_region via eg:

http://tark.ensembl.org/api/transcript/?loc_start=25362365&loc_end=25403737&loc_region=12&expand_all=false

@davmlaw davmlaw changed the title Ensembl Tark client Ensembl Tark Data Provider Sep 20, 2024
davmlaw added a commit that referenced this issue Sep 24, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 24, 2024

working in branch ensembl_tark

davmlaw added a commit that referenced this issue Sep 25, 2024
davmlaw added a commit that referenced this issue Sep 26, 2024
davmlaw added a commit that referenced this issue Oct 2, 2024
davmlaw added a commit that referenced this issue Oct 2, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 2, 2024

ok merged into main. Need to start trying with a test set for a while

Also need to disable RefSeq due to the gap problem

Also need a test for _get_most_recent_release_date

davmlaw added a commit that referenced this issue Oct 3, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 9, 2024

Ran TARK using benchmark, 50 out of 50 correct

   count      mean       std       min       25%       50%       75%       max
0   50.0  2.090134  0.460153  1.597495  1.828673  2.057176  2.110917  4.637631
Correct: 50, incorrect: 0, no data: 0, errors: 0

performance is 0.478 / second vs 3/second with cdot REST, so ~6 times slower

Still need the RefSeq check

davmlaw added a commit that referenced this issue Oct 9, 2024
davmlaw added a commit that referenced this issue Oct 9, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 9, 2024

Done initial implementation

@davmlaw davmlaw closed this as completed Oct 9, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 9, 2024

@holtgrewe - you use Ensembl right? Maybe you'll be interested in this

@holtgrewe
Copy link
Contributor

Thanks

CC @tedil

@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 9, 2024

Can use w/no arguments for Ensembl (uses hgvs SeqFetcher for genome sequences) , but for RefSeq or if you want local genome fetching, need to initialise w/special seq fetcher initialised with fasta

from cdot.hgvs.dataproviders.ensembl_tark_data_provider import EnsemblTarkDataProvider, EnsemblTarkSeqFetcher

seqfetcher = EnsemblTarkSeqFetcher(fasta_files=["/data/grch38.fa.gz"]
hdp = EnsemblTarkDataProvider(seqfetcher=seqfetcher)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants