Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample data #33

Open
spficklin opened this issue May 31, 2018 · 4 comments
Open

Sample data #33

spficklin opened this issue May 31, 2018 · 4 comments

Comments

@spficklin
Copy link

In an effort to write unit testing for the Newick file importer that comes with Tripal, do you have a file that could be shared? We would need the file in newick format, a FASTA file containing all of the gene/protein sequences and the organism to which those FASTA sequences belong.

Thanks much!

@adf-ncgr
Copy link
Contributor

adf-ncgr commented Jun 1, 2018

Hi @spficklin -
all the data for our gene families trees is available here:
https://legumeinfo.org/data/public/Gene_families/legume.genefam.fam1.M65K/

If you grab the tarball of trees:
legume.genefam.fam1.M65K.trees_ML_rooted.tar.gz
and the corresponding tarball of per-family fastas:
legume.genefam.fam1.M65K.family_fasta.tar.gz

I think that will give you what you wanted; note that these sequences are the unaligned versions, but their IDs should correspond to the leaf node labels in the trees (if they don't let me know- it's possible the tarball hasn't been updated to reflect some fixes in that regard)

regarding the organisms, I'm not sure what exactly you'll need but we are using the "gensp." prefixing to denote the species of origin (ie "glyma" => Glycine max, "medtr" => Medicago truncatula, etc.); can give you more detailed list if I know how you plan to handle this (in our case, the loader expects that the annotations have already been loaded and just does a lookup for them)

@spficklin
Copy link
Author

spficklin commented Jun 18, 2018

Thanks @adf-ncgr . I've gotten back to this. Do you have a lookup table that maps your organism "gensp" prefix to the taxonomic name? I want to import a FASTA file from one I downloaded using the file you mentioned above but I need to know the species that each belongs to.

@adf-ncgr
Copy link
Contributor

Hi @spficklin- there may be a few quirks in the following extraction from our organism table, in particular with some of the non-legume species, but hopefully it will be close enough to give you the relevant info (e.g. you'll probably see easily that Arabidopsis thaliana would be arath in "gensp" representation instead of A. thaliana). Let me know if there's anything in the fasta you grabbed that you can't glean from this, or if you have other questions- thanks for moving it along...

  abbreviation      |    genus     |         species

------------------------+--------------+--------------------------
glyma | Glycine | max
lupal | Lupinus | albus
O. sativa | Oryza | sativa
A. thaliana | Arabidopsis | thaliana
phaco | Phaseolus | coccineus
vicfa | Vicia | faba
P. persica | Prunus | persica
S. lycopersicum | Solanum | lycopersicum
V. vinifera | Vitis | vinifera
Z. mays | Zea | mays
A. trichopoda | Amborella | trichopoda
araip | Arachis | ipaensis
consensus | consensus | consensus
lencu | Lens | culinaris
cajca | Cajanus | cajan
cicar.ICC4958 | Cicer | arietinum_ICC4958
trire | Trifolium | repens
cicar.CDCFrontier | Cicer | arietinum_CDCFrontier
medtr | Medicago | truncatula
vigra | Vigna | radiata
lotja | Lotus | japonicus
lupan | Lupinus | angustifolius
tripr | Trifolium | pratense
medsa | Medicago | sativa
vigun | Vigna | unguiculata
apiam | Apios | americana
cucsa | Cucumis | sativus
chafa | Chamaecrista | fasciculata
prupe.Lovell.gnm2.ann1 | Prunus | persica.Lovell.gnm2.ann1
vigan | Vigna | angularis
pea | Pisum | sativum
arahy | Arachis | hypogaea
aradu | Arachis | duranensis
phavu | Phaseolus | vulgaris

@spficklin
Copy link
Author

This is great. Thanks. I'll let you know how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants