Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CropStore schema update #687

Open
wjurkowski opened this issue Dec 19, 2016 · 18 comments
Open

CropStore schema update #687

wjurkowski opened this issue Dec 19, 2016 · 18 comments

Comments

@wjurkowski
Copy link
Collaborator

Incorporate new version of the CropStore schema v9.01

@nowakowski
Copy link
Contributor

Hello. Need to consult this with @Nuanda since modifying the data model affects all layers of the software stack.

I have reviewed the changes in 9.0.1 and it seems that they include a number of geometry columns. Now, geometry is not a standard SQL datatype so we need more details to implement this properly (all I've been able to find regarding 9.0.1 is a set of DDL commands - some usage examples would be very useful). Regardless, I see two possible options:

  • PostgreSQL directly provides some rudimentary geometric types such as POINT, LINE, BOX etc. These can be used to express planar coordinates but the plane itself remains fictitious (i.e. these types are not well suited for storing geographical data - and they're also not part of the SQL standard).
  • For a more advanced use case, such as seems to be the case here (note that the word geolocation appearing in the names of new columns), I can deploy a dedicated GIS extension - for example, PostGIS. This is the mechanism of choice as it provides powerful data manipulation capabilities when used in conjunction with an application-level geospatial library (such as RGeo for Ruby); however it enforces a certain rigor when it comes to expressing data items so I'm not sure whether it would be applicable here (again - we would need some examples of the actual data which is to be stored in the new columns).

Let us know what your feelings are.

@wjurkowski
Copy link
Collaborator Author

Thank's for checking. First of all, we don't need to literally follow these changes. Let's see how much value 9.01 really adds, I'll try to find some exemplary data to illustrate and then perhaps it would make sense to discuss.

@nowakowski
Copy link
Contributor

Okay, I'll await usage examples. On a side note -- I wasn't aware that the CropstoreDB schema was still evolving. This might cause problems in the future if the changes conflict with the design principles upon which BIP is based. Could you point us to the person(s) responsible for introducing those changes?

@wjurkowski
Copy link
Collaborator Author

I added fields descriptions with examples. The new columns should be highlighted in yellow (please check if anything omitted)

@Nuanda
Copy link

Nuanda commented Apr 19, 2017

Related PR: #688

@Nuanda Nuanda removed the on hold label Apr 19, 2017
@Nuanda
Copy link

Nuanda commented Apr 19, 2017

@nowakowski Piotr, Wiktor asked us to go ahead with that. As your first task please take a look at the CropStore data model version we used as the entry point for BIP, and compare it with the CropStore data model version 9.01 (provided by @wjurkowski in the PR mentioned above), and check if the geolocation and srid columns are the only ones which were added. If you spot more columns/tables added between these two CS data model versions, please tell us (in this GitHub thread).

@Nuanda
Copy link

Nuanda commented Apr 19, 2017

@kammerer Tomasz, I know that you recently were trying to integrate some R/Shiny tools with BIP. @wjurkowski suggested that these tools may require some data that is currently not available in BIP, due to the different data model. Are you able to list the additional tables/columns, which would need to be added to BIP, in order to make it fully compliant with those R/Shiny tools?

@wjurkowski
Copy link
Collaborator Author

@kammerer from what you sent me so far we have these columns missing in our data model:
plant_lines.species
plant_lines.genus
plant_lines.taxa
plant_varieties.taxa_authority
plant_varieties.subtaxa_authority
scoring_occasions.score_start_date
scoring_occasions.score_end_date
trait_score.replicate_score_reading
plant_population_lists.data_status

@kammerer
Copy link
Member

@Nuanda These tools are quite flexible because they operate on a huge join table and just show what is available. This join table however originally includes some columns which are not existent in BIP database (mainly the ones Wiktor mentioned above).

@nowakowski
Copy link
Contributor

I've run an automated comparison of both database schemas (our original cs_full_web and @wjurkowski 's schema dump). These results should be self-explanatory but if you need a more concise description of differences between both DBs, let me know.

# WARNING: Objects in server1.cs_full_web but not in server1.cs_full_web_9_01:
#        TABLE: plant_variety_detail
#
# WARNING: Objects in server1.cs_full_web_9_01 but not in server1.cs_full_web:
#        TABLE: plant_variety_details
#
#                                                   Defn    Row     Data   
# Type      Object Name                             Diff    Count   Check  
# ------------------------------------------------------------------------- 
# TABLE     clone_libraries                         pass    SKIP    SKIP    
# TABLE     clones                                  pass    SKIP    SKIP    
# TABLE     countries                               pass    SKIP    SKIP    
# TABLE     cs_additional_information               pass    SKIP    SKIP    
# TABLE     cs_images                               pass    SKIP    SKIP    
# TABLE     cs_images_lookup                        pass    SKIP    SKIP    
# TABLE     cs_institutions                         pass    SKIP    SKIP    
# TABLE     cs_people                               pass    SKIP    SKIP    
# TABLE     cs_publications                         pass    SKIP    SKIP    
# TABLE     cs_publications_lookup                  pass    SKIP    SKIP    
# TABLE     cs_synonyms                             pass    SKIP    SKIP    
# TABLE     design_factors                          pass    SKIP    SKIP    
# TABLE     genotype_matrices                       pass    SKIP    SKIP    
# TABLE     linkage_groups                          pass    SKIP    SKIP    
# TABLE     linkage_maps                            pass    SKIP    SKIP    
# TABLE     map_linkage_group_lists                 pass    SKIP    SKIP    
# TABLE     map_locus_hits                          pass    SKIP    SKIP    
# TABLE     map_positions                           pass    SKIP    SKIP    
# TABLE     marker_assays                           pass    SKIP    SKIP    
# TABLE     marker_sequence_assignments             pass    SKIP    SKIP    
# TABLE     marker_sequence_hits                    pass    SKIP    SKIP    
# TABLE     marker_variations                       pass    SKIP    SKIP    
# TABLE     occasions                               pass    SKIP    SKIP    
# TABLE     plant_accessions                        FAIL    SKIP    SKIP    
#
# Object definitions differ. (--changes-for=server1)
#

--- `cs_full_web`.`plant_accessions`
+++ `cs_full_web_9_01`.`plant_accessions`
@@ -10,6 +10,8 @@
   `date_harvested` date NOT NULL DEFAULT '0000-00-00',
   `female_parent_plant_id` varchar(30) NOT NULL DEFAULT 'unspecified',
   `male_parent_plant_id` varchar(30) NOT NULL DEFAULT 'unspecified',
+  `plant_accession_geolocation` geometry NOT NULL,
+  `plant_accession_srid` int(11) NOT NULL,
   `comments` longtext NOT NULL,
   `entered_by_whom` varchar(90) NOT NULL DEFAULT 'unspecified',
   `date_entered` date NOT NULL DEFAULT '0000-00-00',

# TABLE     plant_individuals                       pass    SKIP    SKIP    
# TABLE     plant_line_assigned_genotypes           pass    SKIP    SKIP    
# TABLE     plant_lines                             FAIL    SKIP    SKIP    
#
# Object definitions differ. (--changes-for=server1)
#

--- `cs_full_web`.`plant_lines`
+++ `cs_full_web_9_01`.`plant_lines`
@@ -9,6 +9,8 @@
   `organisation` varchar(100) NOT NULL DEFAULT 'unspecified',
   `genetic_status` varchar(100) NOT NULL DEFAULT 'unspecified',
   `previous_line_name` varchar(100) NOT NULL DEFAULT 'unspecified',
+  `plant_line_geolocation` geometry NOT NULL,
+  `plant_line_srid` int(11) NOT NULL,
   `comments` longtext NOT NULL,
   `entered_by_whom` varchar(90) NOT NULL DEFAULT 'unspecified',
   `date_entered` date NOT NULL DEFAULT '0000-00-00',

# TABLE     plant_marker_fragments                  pass    SKIP    SKIP    
# TABLE     plant_marker_variations                 pass    SKIP    SKIP    
# TABLE     plant_parts                             pass    SKIP    SKIP    
# TABLE     plant_population_lists                  pass    SKIP    SKIP    
# TABLE     plant_populations                       pass    SKIP    SKIP    
# TABLE     plant_scoring_units                     FAIL    SKIP    SKIP    
#
# Object definitions differ. (--changes-for=server1)
#

--- `cs_full_web`.`plant_scoring_units`
+++ `cs_full_web_9_01`.`plant_scoring_units`
@@ -9,6 +9,8 @@
   `scoring_unit_frame_size` varchar(15) NOT NULL DEFAULT 'unspecified',
   `date_planted` date NOT NULL DEFAULT '0000-00-00',
   `seed_packet_id` varchar(50) NOT NULL DEFAULT 'unspecified',
+  `plant_scoring_unit_geolocation` geometry NOT NULL,
+  `plant_scoring_unit_srid` int(11) NOT NULL,
   `described_by_whom` varchar(90) NOT NULL DEFAULT 'unspecified',
   `comments` longtext NOT NULL,
   `entered_by_whom` varchar(90) NOT NULL DEFAULT 'unspecified',

# TABLE     plant_trial_collection_lists            pass    SKIP    SKIP    
# TABLE     plant_trial_collections                 pass    SKIP    SKIP    
# TABLE     plant_trials                            FAIL    SKIP    SKIP    
#
# Object definitions differ. (--changes-for=server1)
#

--- `cs_full_web`.`plant_trials`
+++ `cs_full_web_9_01`.`plant_trials`
@@ -8,6 +8,8 @@
   `trial_location_site_name` varchar(100) NOT NULL DEFAULT 'unspecified',
   `country` char(3) NOT NULL DEFAULT 'xxx',
   `place_name` varchar(100) NOT NULL DEFAULT 'unspecified',
+  `plant_trial_geolocation` geometry NOT NULL,
+  `plant_trial_srid` int(11) NOT NULL,
   `latitude` varchar(15) NOT NULL DEFAULT 'unspecified',
   `longitude` varchar(15) NOT NULL DEFAULT 'unspecified',
   `altitude` varchar(15) NOT NULL DEFAULT 'unspecified',

# TABLE     plant_varieties                         FAIL    SKIP    SKIP    
#
# Object definitions differ. (--changes-for=server1)
#

--- `cs_full_web`.`plant_varieties`
+++ `cs_full_web_9_01`.`plant_varieties`
@@ -6,6 +6,8 @@
   `taxa_authority` varchar(100) NOT NULL DEFAULT 'unspecified',
   `subtaxa_authority` varchar(100) NOT NULL DEFAULT 'unspecified',
   `crop_type` varchar(100) NOT NULL DEFAULT 'unspecified',
+  `plant_variety_geolocation` geometry NOT NULL,
+  `plant_variety_srid` int(11) NOT NULL,
   `comments` longtext NOT NULL,
   `entered_by_whom` varchar(90) NOT NULL DEFAULT 'unspecified',
   `date_entered` date NOT NULL DEFAULT '0000-00-00',

# TABLE     pop_locus_genotype_alleles              pass    SKIP    SKIP    
# TABLE     pop_type_lookup                         pass    SKIP    SKIP    
# TABLE     population_genotypes                    pass    SKIP    SKIP    
# TABLE     population_loci                         pass    SKIP    SKIP    
# TABLE     population_locus_alleles                pass    SKIP    SKIP    
# TABLE     primers                                 pass    SKIP    SKIP    
# TABLE     probes                                  pass    SKIP    SKIP    
# TABLE     proccessed_trait_datasets               pass    SKIP    SKIP    
# TABLE     processed_trait_datasets                pass    SKIP    SKIP    
# TABLE     qtl                                     pass    SKIP    SKIP    
# TABLE     qtl_jobs                                pass    SKIP    SKIP    
# TABLE     restriction_enzymes                     pass    SKIP    SKIP    
# TABLE     row_plot_positions                      pass    SKIP    SKIP    
# TABLE     scoring_occasions                       pass    SKIP    SKIP    
# TABLE     trait_descriptors                       pass    SKIP    SKIP    
# TABLE     trait_grades                            pass    SKIP    SKIP    
# TABLE     trait_scores                            pass    SKIP    SKIP    
# TABLE     version                                 FAIL    SKIP    SKIP    
#
# Object definitions differ. (--changes-for=server1)
#

--- `cs_full_web`.`version`
+++ `cs_full_web_9_01`.`version`
@@ -1,7 +1,10 @@
 CREATE TABLE `version` (
-  `date` date NOT NULL DEFAULT '0000-00-00',
-  `version` varchar(100) NOT NULL DEFAULT '',
+  `crop_group` varchar(25) NOT NULL DEFAULT 'unspecified crop',
+  `data_version` varchar(20) NOT NULL DEFAULT 'v_1_01',
+  `cs_version` varchar(25) NOT NULL DEFAULT 'unspecified version',
+  `geolocation_system` varchar(25) NOT NULL,
+  `last_updated` date NOT NULL DEFAULT '0000-00-00',
   `updated_by_whom` varchar(90) NOT NULL DEFAULT 'unspecified',
   `comments` longtext NOT NULL,
-  PRIMARY KEY (`version`)
-) ENGINE=MyISAM DEFAULT CHARSET=latin1
+  PRIMARY KEY (`data_version`)
+) ENGINE=InnoDB DEFAULT CHARSET=latin1

@kammerer
Copy link
Member

kammerer commented May 9, 2017

The most important column for me is plant_lines.species. I understand that it was replaced with a reference to taxonomy_terms table added in BIP schema.

I do not know why some other tables/columns (like scoring occasions table) are absent from BIP database even if they were included in the initial CropStoreDB dump.

@nowakowski
Copy link
Contributor

@kammerer Indeed, the taxonomy_terms reference supersedes species information in plant_lines.
Regarding your other question, we did remove certain tables/columns which either contained no data or were otherwise deemed unnecessary. In each case this was agreed upon with @wjurkowski.

@kammerer
Copy link
Member

@nowakowski Thanks, it is clear now.

@nowakowski
Copy link
Contributor

@wjurkowski I guess we need to move forward with this issue. I am ready to incorporate the changes outlined in my previous comment (see above) but we need to figure out how to treat the geolocation column - is it a simple string or a more complex data type? Could you share any examples of its actual usage?

@teatree1212
Copy link
Contributor

teatree1212 commented Jun 9, 2017 via email

@wjurkowski
Copy link
Collaborator Author

wjurkowski commented Jun 9, 2017 via email

@nowakowski
Copy link
Contributor

@wjurkowski Have you managed to come to an agreement re: geolocation data? If not, I can follow the standard suggested by @teatree1212.

@wjurkowski
Copy link
Collaborator Author

wjurkowski commented Jul 18, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants