Skip to content

Latest commit

 

History

History
273 lines (197 loc) · 23.2 KB

tripal2_version_README.md

File metadata and controls

273 lines (197 loc) · 23.2 KB

Tripal_analysis_expression

This is an extension module for the Tripal project.

This documentation is for the Tripal 2 version of the Tripal Analysis Expression. It is no longer maintained, future releases will require for Tripal 3.

Tripal Analysis: Expression

  1. Introduction
  2. Installation
  3. Module Features
  4. Loading Biomaterials
  5. Loading Expression Data
  6. Viewing Data
  7. [Searching features and displaying expression data in a heatmap](#Searching features and displaying expression data in a heatmap)
  8. Administrative Pages
  9. Example Files

Introduction

Tripal Analysis: Expression is a Drupal module built to extend the functionality of the Tripal toolset. The purpose of the module is to visually represent gene expression for Tripal features. This module requires the following Tripal modules:

  1. Tripal Core
  2. Tripal Views
  3. Tripal DB
  4. Tripal CV
  5. Tripal Analysis
  6. Tripal Feature
  7. Tripal Organism
  8. Tripal Contact

Installation

  1. Click on the green "Clone or download" button on the top right corner of this page to obtain the web URL. Download this module by running git clone <URL> on command line.
  2. Place the cloned module folder "tripal_analysis_expression" inside your /sites/all/modules. Then enable the module by running drush en tripal_analysis_expression (for more instructions, read the Drupal documentation page).

Module Features

Content Types

The is module provides four content types - Analysis: Expression, Biomaterial, Array Design, and Protocol.

  1. Analysis: Expression - The analysis: expression content type it built by hooking into the Tripal 2 Analysis module. This content type was modeled after the content types provided by the Tripal InterPro Analysis module and the Tripal Blast Analysis module. This content type provides an interface to describe the experiment from which expression data was gathered. This content type provides the ability to describe either microarray expression data or next generation sequencing expression data (such as data obtained from RNASeq). This content type also provides a form to load expression data associated with the analysis.

  2. Biomaterial - The biomaterial content type represents the Chado biomaterial table. The Chado biomaterial table is a member of the Chado MAGE module. The biomaterial content type is similar to the BioSample content type provided by NCBI. See the biomaterial description at the GMOD wiki.

  3. Array Design - The array design content type represents the Chado arraydesign table. This table is only used when describing the experimental design of data collected from a microarray expression experiment. See the arraydesign description at the GMOD wiki.

  4. Protocol - The protocol content type represents the Chado protocol table. This table is used to describe the protocol, software, and hardware used in different steps of the experiment. See the protocol description at the GMOD wiki.

Module Administrative Pages

For each of the above content types, this module provides full administrative capabilities which includes the following, administrative list of content, sync, delete, TOC (table of contents), settings, and help pages. These pages were modeled after and created using other Tripal modules.

User Searches

A simple anonymous user search (using Views) is also provided for each content type. These searches can be foud at the following urls:

site_name/chado/analysis-expression

site_name/chado/arraydesign

site_name/chado/biomaterial

site_name/chado/protocol

Data Loaders

Two loaders are provided by this module, a biomaterial loader, and an expression loader. The biomaterial loader has the ability to load data from a flat file or from an xml file downloaded from NCBI. The expression loader is included in the analysis: expression content type form.

Expression Display

Once expression data is loaded. A display will be shown on each feature page that has corresponding biomaterials and expression values.

Loading Biomaterials

Biomaterials may be loaded from a flat file or from an BioSample xml file downloaded from NCBI. The steps for loading biomaterials are as follows (detailed instructions can be found further below):

  1. First download or generate the flat (.csv, .tsv) or .xml file with biomaterials data you want to load.
  2. Add the organism associated with the biomaterial if it doesn't exist yet (Add content->Organism).
  3. Navigate to the Tripal site's Tripal Biomaterial Loader to submit the job with a .xml file or a flat file. Run the job via command line with Drush command.
  4. Sync the biomaterial(s) on the Tripal site. Run the sync job via command line with Drush command. Note that this step is not needed if biomaterial with the same "sample_name" already exists in the database. In that case, the database entries for that biomaterial should be updated.
  5. Verify that the biomaterial(s) loaded correctly by viewing it via Find content.

Downloading XML BioSample File From NCBI

To obtain a xml BioSample file from ncbi go the NCBI BioSample database. Search for and select the BioSamples you would like to download. Select BioSamples

Click the "Send to:" link. Then select "File" and select "Full XML (text)" as the format. Then click "Create File". Download BioSample XML File

Click here to see an example XML BioSample file from NCBI.

Loading NCBI XML BioSample File into Tripal

To upload the file into Chado/Tripal, Navigate to:
Tripal->Extensions->Expression Analysis->Tripal Biomaterial Loader

Select the organism for which you are uploading expression data. Select "NCBI biosample xml file" and then write the path in "File Path" field.

NCBI XML BioSample Loader

After clicking "Submit job", the page should reload with the job status and Drush command to run the job. Copy and paste the Drush command and run it on command line. Upon running the Drush command, any warning/error/success/status message should be displayed.

Similarily, after clicking "Submit job", the page should reload with the job status and Drush command to run the job. Copy and paste the Drush command and run it on command line. Upon running the Drush command, any warning/error/success/status message should be displayed.

Loading Biomaterials From a Flat File

Altenatively biomaterials may be loaded from a flat file (CSV or TSV). The flat file loader is designed to upload files that are in the NCBI BioSample submission format which can be downloaded here. Download the TSV version of the file. The file must have a header that specifies the type of data in the column. There must be one column labeled "sample_name". The loader will begin to collect data from the line that follows the line containing "sample_name" which is assumed to be the header line. Columns are not required to be in any order. Other columns will be either attributes or accessions. Available NCBI attributes can be found here. Available accession headers are bioproject_accession, sra_accession, biosample_accession. All other columns will be uploaded as properties. To upload other accessions use the bulk loader provided with this module labeled, "Biomaterial Accession Term Loader". This loader will load a flat file with 3 columns (sample name, database name, accession term). A Tripal database must be created with the same name as the database name in the upload file.

Click here to see an example of a CSV file and a TSV file.

Flat File Loader

Syncing Biomaterials

After loading, biomaterials must be synced to create nodes for each biomaterial content type. As an administrator or user with correct permissions, navigate to Tripal->Extensions->Expression Analysis->Tripal Expression Analysis Content Types->Biomaterial->SYNC. Select the biomaterials to sync and click "Sync Biomaterials".

Syncing Biomaterials

Similarily, after clicking "Sync Biomaterials", run the Drush command on command line and monitor for any warnings/error messages.

Loading a Single Biomaterial

Biomaterials may also be loaded one at a time. As an administer or a user with permission to create content, go to: Add content->Biomaterial. Available biomaterial fields include the following.

  • Biomaterial Name (must be unique - required)
  • Biomaterial description - A description of the biomaterial.
  • Biomaterial Provider - The person or organization responsible for collecting the biomaterial
  • Organism - The organism from which the biomaterial was collected.
  • Analysis - The expression analysis associated with the biomaterial. Note that a biomaterial can be created before an expression analysis.

There is also the ability to add properties or accession values to the biomaterial.

Loading Expression Data

The steps for loading expression data are as follows (detailed instructions can be found further below):

  1. Obtain expression data. Click here to read about the file formats accepted for expression data.
  2. Add the organism associated with the expression data (Add content->Organism) if it hasn't been added.
  3. Upload all features in the expression data to the Chado database. To bulk upload features, go to Tripal->Chado Data Loaders->FASTA file Loader or Tripal->Chado Modules->Features->Import via FASTA file and upload a fasta file (click here to see an example of fasta file of transcriptome sequences). Or upload one feature at a time via Add content->Feature or Tripal->Chado Modules->Features->Add feature->Feature. Submit and run job with Drush command. Then sync the features via Tripal->Chado Modules->Features->Sync. Submit and run job with Drush command. Finally, verify that the features have been added correctly via Find content.
  4. Create the experiment setup. Provide file path for the expression data or directory and make sure "Submit a job to parse the expression data into Chado" is checked. Save analysis and run the job with Drush command.
  5. View the expression data by going to Find content and clicking into the features just added.

Creating the Experiment Setup

Before loading data, describe the experimental setup used to collect the data. As an administrator or a user with permission to create content, go to: Add content->Analysis: Expression. The "Analysis: Expression" content type is a sub-type of the analysis content type. It contains all fields used in the analysis content type as well as fields that allow the description of the experimental design and the data loader.

Note that program name, program version, and source name must be unique as a whole for analysis to be inserted correctly (click here to read more about the data structure for analysis).

Analysis Fields:

  • Analysis Name (required)
  • Program, Pipeline Name or Method Name (required, part of unique constraint)
  • Program, Pipeline or Method version (required, part of unique constraint)
  • Algorithm
  • Source Name (required, part of unique constraint)
  • Source Version
  • Source URI
  • Time Executed (required)
  • Materials & Methods (Description and/or Program Settings)

Analysis Feilds There is also the ability to add analysis properties to this content type.

Experimental Design Fields

The "Experimental Design" fields allow a complete description of the experimental design. The Chado MAGE module which is used by the Analysis: Expression module. The Chado MAGE module uses, the arraydesign, assay, quantification, and acquisition tables to describe an experiment. This is reflected in the following fields available to describe an experiment.

  • Organism (required)
  • Biomaterial Provider - The person or organization responsible for collecting the biomaterial.
  • Array Design - This is only applicable for microarray expression data. This may be left blank for next generation sequencing expression data.
  • Assay Details - A description of the physical instance of the array used in the experiment
  • Date Assay Run - The date the assay was run.
  • Assay Description - A short description of the assay.
  • Assay Operator - The person or organization that ran the assay.
  • Assay Protocol - The assay protocol used in the experiment. (See protocol description below).
  • Acquisition Details - The scanning of the experiment.
  • Data Acquisition Run - The date the acquisition was run.
  • Acquisition URI - A web address to a page that describes the acquisition process.
  • Acquisition Protocol - The acquisition protocol used in the experiment. (See protocol description below).
  • Quantification Details - A description of the method used to transform raw expression data into numeric data.
  • Date Quantification Run - The date the quantification was run.
  • Quantification URI - A web address to a page that describes the quantification process.
  • Quantification Operator - The person or organization that ran the quantification.
  • Quantification Protocol - The quantification protocol used in the experiment. (See protocol description below).

Experimental Design Fields Protocol Descripton - The protocol content types can be created by navigating to Add content->Protocol. A protocol can be used to add extra detail to an experimental design. A protocol can be used to describe the assay, acquisition, and quantification steps of the experiment design. A protocol can also be used to further describe the array design content type. The fields of a protocol are:

  • Protocol Name (must be unique - required)
  • Protocol Link - A web address to a page that describes the protocol.
  • Protocol Description - A description of the protocol.
  • Hardware Description - A description of the wardware used in the protocol.
  • Software Description - A description of the software used in the protocol.
  • Protocol Type (required) - The protocol type can acquisition, array design, assay, or quantification. The user can also create new protocol types.
  • Publication - A publication that describes the protocol.

Data Loader

The data loader fields provide a way for the user to load expression data associated with the experiment. The loader can load data from two types of formats, matrix and column. The matrix format expects a row of data containing biomaterials names. The first column should be unique feature names. Features must already be loaded into the database. Biomaterials will be added if not present. Expression values will map to a library in the column and a feature in the row. Only one matrix file may be loaded at a time. The column format expects the first column to contain features and the second column to be expression values.

For an example column file, click here. For an example matrix file, click here.

The biomaterial name will be taken as the name of the file minus the file extension. Features must already be loaded into the database. Biomaterials will be added if not present. Multiple column format files may be loaded at the same time given that the files are in the same directory and contain the same file suffix. Either format may have header or footer information. Regex can be used in the form to only record data after the header and before the footer. Any file suffix can be used. The data loader fields are the following:

  • Source File Type - This can be either "Column Format" or "Matrix Format".
  • Checkbox - Check this box to submit a job to parse the data into Chado.
  • File Type Suffix - The suffix of the files to load. This is used to submit multiple column format files in the same directory. A suffix is not required for a matrix file.
  • File Path - The path to a single matrix or column format file. The path may also be set to a directory, in which case all column files with the "File Type Suffix" specified above will be loaded. When loading multiple files from a file suffix must be specified.
  • Regex for Start of Data - If the expression file has a header, use this field to capture the line that occurs before the start of expression data. This line of text and any text preceding this line will be ignored.
  • Regex for End of Data - If the expression file has a footer, use this field to capture teh line that occurs after the end of expression data. This line of text and all text following will be ignored.

Data Loader Fields

Viewing Data

The following panes are added to the following content types:

Feature

  • Expression - After biomaterials and expression data have been loaded the expression pane will appear on the corresponding feature page. The pane will 5 different links: Sort Descending, Sort Ascending, Only Non-Zero Values, Tile/Chart, Reset.
  • Sort Descending/Sort Ascending - Sort expression data based on expression values - descending or ascending.
  • Only Non-Zero Values - Remove biomaterials that do not expression the feature.
  • Tile/Chart - Toggle figure between a tile heatmap view or a chart view.
  • Reset - Reset the figure. Return the figure to it's original state.

Expression Tile Map

Organism

  • Biomaterial Browser - After loading biomaterials, a new pane with a list of biomaterials will appear on the corresponding organism page. Biomaterials are not required to be synced when to appear in this list.

Analysis: Expression

  • Overview (base) - The generic tripal overview pane.
  • Protocol - Protocols used in this analysis (acquisition protocol, assay protocol, and quantification protocol).

Biomaterial

  • Overview (base) - The generic tripal overview pane.
  • Properties - Properties associated with the biomaterial.
  • Cross References - Accession terms associated with the biomaterial.

Array Design

  • Overview (base) - The generic tripal overview pane.
  • Properties - Properties associated with the array design.

Protocol

  • Overview (base) - The generic tripal overview pane.

Searching features and displaying expression data in a heatmap

This module creates two blocks: one for features input and the other displaying a heatmap for the input features.

Turn On blocks

Go to Structure->blocks and find these two blocks: tripal_analysis_expression features form for heatmap and tripal_elasticsearch block for search form: blast_merged_transcripts. Config these two blocks to let them display at specific region and page(s). The tripal_analysis_expression features form for heatmap will display a form that allow you to input some feature IDs.

feature-input-form

After you enter some feature IDs, you click the "Display Expression Heatmap" button to generate a heatmap for the features.

expression-heatmap

Administrative Pages

Content Type Administrative Pages

Each Analysis: Expression content type has administrative pages. As an administrator or a user with correct permissions navigate to the following: Tripal->Extensions->Expression Analysis->Tripal Expression Analysis Content Types. Each content type has the following administrative pages.

  • Administrative Search - Administrative search to find, create, edit, or delete content type.
  • Sync - Page to sync content type from the chado database. Also provides a method to clean up orphaned nodes.
  • Delete - Page where content type can be deleted in bulk.
  • TOC - Page to change the default order and display of table of contents and panes for content type pages.
  • Settings - Page to set default page titles and default page urls for content type.
  • Help - Description the content type.

Administrator Pages for Content Types

Expression Display Administrative Page

The display of expression data on feature pages can be configured. To configure the expression figure, navigate to Tripal->Extensions->Expression Analysis->Tripal Expression Analysis Settings. Available options are:

  • Hide Expression Figure - Hide expression figures on all feature pages. With this option you can load expression data without displaying the expression figure.
  • Hide Biomaterial Labels - Hide the name of the biomaterial under the expression figure tile or column. Biomaterial names will still appear in tooltips.
  • Maximum Label Length - Set the maximum acceptable biomaterial name length. Biomaterial names that are longer than this length will be truncated.
  • Expession Column Width - Change the size of the width of the tile or column in the figure. Value must be 15 or greater.
  • Default Heatmap Display - The default display can be either a one dimensional heatmap or a bar chart.

Example Files

Biomaterial Loader

  1. Flat files: CSV file, TSV file
  2. XML file

Expression Data Loader

  1. Column file
  2. Matrix file