Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A trial analysis API client stub #681

Closed
wants to merge 17 commits into from
Closed

Conversation

Nuanda
Copy link

@Nuanda Nuanda commented Nov 29, 2016

When finished it fixes #680

@teatree1212 per your request I started that with a client stub. Currently it finds the PT (by name), gets all TDs, PSUs and TSes related to that PT, and outputs a CSV document which pretty much resembles the one available at https://bip.earlham.ac.uk/trial_scorings/5.

Please take it from here and add further data. When in doubt how to query an individual table, see the 'Q' marker in the docs, or the definition of permitted_params inside individual model classes in the sources, to see what field is available for filtering.

@teatree1212
Copy link
Contributor

thanks! In this case, it would be also good to have the option of having it as a .json file as output. I will feed some objects into a workflow I am developing.

There is this Active model serialiser gem which to my understanding reduces the .json output. I think this would be good as some of the objects are not necessary. Have you used this before?

@teatree1212
Copy link
Contributor

  1. When running the script, I get an error:
    trial_analysis.rb:114:in block (3 levels) in <main>': undefined method []' for nil:NilClass (NoMethodError)
    from trial_analysis.rb:114:in map' from trial_analysis.rb:114:in block (2 levels) in '
    from trial_analysis.rb:112:in each' from trial_analysis.rb:112:in block in '
    from /Users/hildegaa/.rvm/rubies/ruby-2.2.0/lib/ruby/2.2.0/csv.rb:1157:in generate' from trial_analysis.rb:110:in '
  2. to have some output to look at, I went through the script and displayed the response using
    puts JSON.pretty_generate(response)orJSON.pretty_generate(trait_scores) after the API queries, to see what has been queried, and commenting out everything else below with =begin ...=end.
    I am starting to understand what things are doing.

I won't have time to do anything else until Monday. But be prepared for questions then (:

@teatree1212 teatree1212 assigned Nuanda and unassigned teatree1212 Nov 29, 2016
@Nuanda
Copy link
Author

Nuanda commented Nov 30, 2016

Are you sure you haven't changed that file? Mine has 111 lines and yours, guessing from the error output, has at least 114. Anyway, please try with the trial name 'whri_2005_GE2_02' on the BIP public server - it should work (just run it right now and it worked fine).

@Nuanda Nuanda assigned teatree1212 and unassigned Nuanda Nov 30, 2016
@Nuanda
Copy link
Author

Nuanda commented Nov 30, 2016

Re. JSON output - first you'd need to decide how to structure the output (by PSUs? by TDs?). Active Model is not available in the vanilla client script, unless you load it, but then again it can't be used as what you get are rather hashes and not the active models. But I think using JSON generator methods, as you do, is the way to go.

@teatree1212
Copy link
Contributor

You are right, I must have changed the script before running it.
To be super sure, I went back to master and checked out the remote branch again, to have your version. I get the same error:

N80569:client_example hildegaa$ ruby trial_analysis.rb U.Nottm_2016_RIPRleafminerals_REMLmeans api-key

  1. Finding the Plant Trial
  • Found, plant_trial_id = 47
  1. Loading all Trait Scores for this Plant Trial.
  • Progress: ......................................................
  • 10780 Trait Scores loaded
  1. Finding Trait Descriptors
  • The Trait Descriptors scored in this Plant Trial: ["Leaf Silver concentration", "Leaf Aluminium concentration", "Leaf Arsenic concentration", "Leaf Boron concentration", "Leaf Calcium concentration", "Leaf Cadmium concentration", "Leaf Chromium concentration", "Leaf Caesium concentration", "Leaf Copper concentration", "Leaf Iron concentration", "Leaf Potassium concentration", "Leaf Magnesium concentration", "Leaf Manganese concentration", "Leaf Molybdenum concentration", "Leaf Sodium concentration", "Leaf Nickel concentration", "Leaf Phosphorus concentration", "Leaf Lead concentration", "Leaf Rubidium concentration", "Leaf Sulphur concentration", "Leaf Selenium concentration", "Leaf Strontium concentration", "Leaf Titanium concentration", "Leaf Uranium concentration", "Leaf Vanadium concentration", "Leaf Zinc concentration", "mineral and ion content related trait", "cobalt concentration"]
  1. Iterating through Plant Scoring Units
  2. Generating output CSV to STDOUT
    trial_analysis.rb:105:in block (3 levels) in <main>': undefined method []' for nil:NilClass (NoMethodError)
    from trial_analysis.rb:105:in map' from trial_analysis.rb:105:in block (2 levels) in '
    from trial_analysis.rb:104:in each' from trial_analysis.rb:104:in block in '
    from /Users/hildegaa/.rvm/rubies/ruby-2.2.0/lib/ruby/2.2.0/csv.rb:1157:in generate' from trial_analysis.rb:102:in '

but I don't get an error when using your suggested trial.

I will now use the rest of my day to add to the script and test it on your trial name.

@teatree1212
Copy link
Contributor

I am failing in adding more columns to the current csv-stout.. could you have a look and give me a hint? I was trying to add the Plant_Accession_name as a column between the Plant Scoring Unit and the Trait scores.

Re. JSON output, I thought, I would start with plant_trials, then PSU, and within PSU, have PA and PL as well as TS and TD( and associated). does that make sense you think?

@teatree1212 teatree1212 assigned Nuanda and unassigned teatree1212 Dec 5, 2016
@Nuanda Nuanda assigned teatree1212 and unassigned Nuanda Dec 6, 2016
@Nuanda
Copy link
Author

Nuanda commented Dec 6, 2016

Done - see the changes. One more thing about the PAs is that you probably don't get them all, as the API returns first 50 hits by default (you can max it out to 200, but not beyond - an anti-attack measure). So you need to implement a loop - see one for trait scores, try to do something similar for PAs.

@teatree1212
Copy link
Contributor

thanks, will have a look.
would it be possible to get the active_support gem ? I keep reading that this is the best way to .select or .extract key- value pairs from a hash.

@Nuanda
Copy link
Author

Nuanda commented Dec 6, 2016

I guess you can do that, but remember that it introduces an external dependency - it means, all script users will need to install at least rubygems and the activesupport gem, along its own dependencies. For their sake, I'd advise to try to stay with vanilla ruby.

@teatree1212
Copy link
Contributor

okay..... I don't manage to select multiple keys ( e.g. score_valyes and value_type)..
I am also looping over the entire hash in a not very pretty manner. I think ruby can do better, but I don't..

Pleeeaase have a look for me, am at the banging my head stage (:

see recent commit

@teatree1212
Copy link
Contributor

with regards to your commit "showing PAs in the CSV output"
how to you make sure you link the correct plant_ accession name to the correct scoring unit name? and how do you call the plant_accession?
I don't really understand what the "data" object is I think.

@Nuanda
Copy link
Author

Nuanda commented Dec 6, 2016

  1. PA - PSU link. When I get PSUs from BIP (step 4), they include their FKey value for related PAs (the FKey column name is plant_accession_id). I make sure to save them in the outputs hash. Then, in step 7, I find (detect in Ruby nomenclature) the correct PA using this FKey value - see current line 132. Having the correct PA, I extract its plant_accession column value in line 134.

  2. outputs is a Hash (or an https://en.wikipedia.org/wiki/Associative_array) with keys and values. I use PSU.scoring_unit_name as keys and Hashes (yes, it's a Hash of Hashes, or a nested Hash - quite common in Ruby) as values. These values may have further multiple keys, to record data about a given PSU. At the moment we have plant_accession_id (why - see above) and trait_scores, containing all trait score objects for a given PSU.

@teatree1212
Copy link
Contributor

teatree1212 commented Dec 12, 2016

could you have a look at '5. -Finding Plant Accessions... '
I tried it in the manner you trait_scores loop, but I only get an empty JSON output, when printing it to stdout. When running the commented out section, I get 50 objects, as you said, but no more. Nothing comes up in the .csv.
Could you please have a look @Nuanda ?

@Nuanda
Copy link
Author

Nuanda commented Dec 13, 2016

@teatree1212 for demonstration I tried a different technique this time. Notable changes:

  • removed unnecessary second PSU loop and used the former one to record all PA_ids
  • uniq! removes all duplicates from an array 'in situ'
  • each_slice is a nice utility that chops an array into pieces - I use chunks of 200 as this is the maximum for a single BIP API get request
  • you can use similar technique to retrieve all PLs in step 6.

@teatree1212
Copy link
Contributor

a few notes on the next commit:
100 plant_lines are now being called, but not using the each_slice or uniq! function.

-data['plant_line_id'] doesn't exist or is empty or something in between (:
which is why it is not being displayed in the .csv

-data is part of the ouputs - hash, and as no connection is established between PL and PSU, there is no plant_line information in the output hash in general.

-I don't yet understand how to add the PLs to the outputs hash, as I find that that table is "too far away" ( via Plant_accessions)

to do:
apply Nuanda's similar technique to PLs in step 6.

to be able to generate a meaningful json output:

  1. add accession_name to outputs
  2. add plant_line and sequence_id to outputs.

@teatree1212
Copy link
Contributor

teatree1212 commented Dec 19, 2016

Notes:

outputs.each do |scoring_unit_name, data|

scoring_unit_name = hash
data = key
key contains keys and values itself.

Need to add plant_lines hashes to PSU.scoring_unit_name keys

http://www.slideshare.net/harkamalsingh355/ruby-data-types-and-objects
slide 9

@teatree1212
Copy link
Contributor

teatree1212 commented Dec 19, 2016

@Nuanda

Problem 1:
I have managed to link the entire PA-object with the PSU.scoring_unit_name key. What I intended to use this for is in the same fashion as you did for the trait_descriptors, to map the trait_descriptors object to the values of the data[trait_scores'], where the trait_descriptor_id is similar.
I get the following error undefined method `[]' for nil:NilClass (NoMethodError)
( my least favourite error message in Ruby)
Can you have a look and tell me what is wrong please?

Problem 2:
what I ideally want to create a more sophisticated outputs hash, so that I can get a nice JSON output with what I need for the workflow I am building.
-- having it look like this:
[
{"plant_trial_name": "whri_2005_GE2_02",
{
[
"su.WHRI2006_A215.P4_01": {
"plant_accession_id": 1459,
"trait_scores": [
{
"score_value": "4.629",
"trait_descriptor" : "leaf Silver content"
},
{
"score_value": "0.6052",
"Trait_descriptor": "leaf Iron content"
}
],
"plant_accession_name": "whri2005_A215",
"plant_line_name":"a_name",
"sequence_identifier": "SRRXXXX"
},
"su.WHRI2006_A215.P4_01": {
.....
}
]
}

I was trying to build something in line 162/3, Where i want to put the content of the plant_accessions['plant_accession'] into outputs[plant_scoring_unit['scoring_unit_name']]['plant_accession_name']
Can you explain to my why it is not working with the detect -function?

@Nuanda
Copy link
Author

Nuanda commented Dec 19, 2016

Dear @teatree1212

  1. I don't think you need the loop in step "7." since the PA you are looking for is set in line 181 (in the plant_accession variable, in the context of each PSU you loop through)
  2. Also please note an important difference between detect and select array methods - the former one return the first matching element (or nil, if not found), the latter one return all matching elements (or an empty array, if nothing was found) - even if there is only 1 matching element, select still returns an array, not just the element
  3. Regarding "Problem 1" - I'm not sure I follow. What I do for TDs in line 190 is there because we need to traverse all TDs (for this given PT) per each PSU, since each PSU has (probably) scores for each TD. Remember lines 181-191 are executed in context of a single PSU and they, in fact, output a single CSV row - representing this PSU. So, this is probably not what you want to do for PAs.
  4. If you need to output more columns about the PA (organisation, year produced, or similar) you can use the technique from line 185 (the ternary ?: operator is there to make sure the script does not trip on PSUs with no PA assigned).
  5. If you need to output PL columns, you can add another detect right after the mentioned line 181 in this vein: plant_line = plant_accession ? plant_lines.detect{ |pl| pl['id'] == plant_accession['plant_line_id'] } : nil. Then, things like [plant_line ? plant_line['sequence_id']: ''] should start to work.
  6. If this is not what you are after, please tell me exactly what do you want to do.
  7. "Problem 2" - one problem at a time ;).

@Nuanda
Copy link
Author

Nuanda commented Feb 15, 2017

@teatree1212 Annemarie - since @kammerer created #702 I am closing this pull request. I will, however, leave the git branch in case you'd like to use it further.

@Nuanda Nuanda closed this Feb 15, 2017
@kammerer kammerer deleted the 680_trial_analysis_script branch January 18, 2018 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Associating SRA accession numbers with trait scoring within a given trial
3 participants