OSG-KINC

OSG-KINC is a Pegasus workflow that is configured to run on the Open Science Grid. The workflow uses the Knowledge Independent Network Construction (KINC) software package to build a similarity matrix representing correlation analysis of all pairwise comparisons of genes/transcripts in a Gene Expression Matrix (GEM).

This repository contains a pre-compiled kinc binary that is executable on the OSG resources requested by this workflow. The user must provide a properly formatted GEM, and specify the number of jobs that the computation will be split into.

Open Science Grid / Execution Environment

The OSG-GEM workflow is designed to execute on the Open Science Grid via the OSG Connect infrastructure. Access to the system can be requested on the sign up page.

Once you have an account and have joined a project, login to the login02 submit node. This node can be accessed by ssh:

    ssh [email protected]

A workflow specific ssh key has to be created. This key is used for some of the data staging steps of the workflow.

    $ mkdir -p ~/.ssh
    $ ssh-keygen -t rsa -b 2048 -f ~/.ssh/workflow

(just hit enter when asked for a passphrase)

One you have the ~/.ssh/workflow.pub file, add it to your profile as described in https://support.opensciencegrid.org/support/solutions/articles/12000027675-generate-ssh-key-pair-and-add-the-public-key-to-your-account#step-2-add-the-public-ssh-key-to-login-node

Input GEM Format

The workflow will automatically identify the dimensions of the input Gene Expression Matrix. However, this matrix must be formatted as follows:

Tab delimited text file
No 'GeneID' string in header. First value of header is a sample ID (it looks frameshifted, don't worry)

Sample1	Sample2	Sample3	Sample4	Sample5
Gene1	0.215	12.770	15.112	12.111	501.211
Gene2	0.111	10.265	50.555	200.331	423.221
Gene3	0.212	0.472	12.223	111.121	456.778
Gene4	0.543	0.778	23.421	224.431	333.421
Gene5	0.321	0.783	44.431	353.765	334.431

For this dataset, the kinc parameters would be identified as follows:

rows: 6

columns: 5

Once the input GEM is formatted properly, it must be named with the '.txt' file extension and placed in the 'task-files' directory of the workflow.

In addition, a compressed tar archive with the extension 'tar.gz' of this text file must be provided. This can be generated using the following command, using 'test.txt' as an example:

$ tar czf test.tar.gz test.txt

Submitting the Test Workflow

An example GEM, 'Yeast-GEM.txt', is cloned in the task-files directory in this repository. Please inspect this file to see an example of a properly formatted input file. The task-files directory also has the corresponding 'Yeast-GEM.tar.gz' file.

Once the input files are placed in the task-files directory, the workflow can be submitted by executing the submit script, with one additional argument representing the number of jobs that you want to submit. For example:

$ ./submit 8000

This will submit the workflow, and split the computation into 1000 jobs. This is the suggested number of jobs for submitting this test workflow.

Customizing the Workflow

In addition to properly formatting the input GEM, the user must specify an appropriate number of jobs. The number of pairwise comparisons performed scales quickly with the number of rows in the matrix:

(n x (n - 1)) / 2, where n = the number of rows in the matrix.

For example, a GEM containing 88,520 rows will result in over 3.9 billion comparisons being performed. For a matrix of this size, with less than 500 samples, we recommend submitting the workflow with 80,000 jobs.

The number of columns (samples) also has a large affect on the computational time needed to run this workflow. As a result, we have not yet determined a formula or algorithm that decides the optimum number of jobs for a given matrix. However, please aim for less than 1 GB of output per job, and an average job runtime of less than 4 hours. If you submit a workflow and none of the jobs are finishing within several hours, please resubmit with a larger number of jobs.

The kinc-wrapper in the tools directory contains the parameters that are passed to the KINC software. The user may modify this file to change clustering or correlation parameters. In addition, the user may specify missing value fields (Default is NA). Do not remove the --header parameter, as this will result in an incorrect identification of matrix dimensions.

Once the input GEM.txt, and corresponding GEM.tar.gz file have been placed in the task-files directory, the user must delete the 'test.txt' and 'test.tar.gz' files before submitting the workflow. To avoid confusion, the workflow will fail to submit if multiple input datasets are present in the task-files directory.

Monitoring the Workflow

Pegasus provides a set of commands to monitor the progress of the workflow.

    $ pegasus-analyzer
    $ pegasus-status
    $ pegasus-statistics

pegasus-analyzer will identify failed jobs for troubleshooting.
pegasus-statistics will report workflow progress, presented as number of succesful, queued, and failed jobs
pegasus-statistics will produce basic workflow statistics once it has completed. Use the following command to generate a detailed breakdown of job statistics:
```
  $ pegasus-statistics -s all
```

To run these commands, the user must provide the full path to the workflow directory, or cd into this directory. This directory is separate from where the workflow is submitted, and will be located on the /local-scratch filesystem:

/local-scratch/<user_id>/workflows/kinc-<workflow_id>/workflow/kinc-<workflow_id>

Here is an example of a full path to the workflow directory:

/local-scratch/wpoehlm/workflows/kinc-1489505008/workflow/kinc-1489505008

In the event that the workflow finishes with failed jobs, the failed jobs can be resubmitted with the following command:

    $ pegasus-run

Workflow Output

Output will be temporarily stored on a scratch filesystem, and then copied onto the /local-scratch filesystem. This output directory is present at the base of the workflow directory:

/local-scratch/<user_id>/workflows/kinc-<workflow_id>/outputs

for example:

/local-scratch/wpoehlm/workflows/kinc-1489505008/outputs

Using iRODS as Workflow Staging Site

OSG-KINC may pull in an input dataset from a remote iRODS server, use iRODS as the workflow staging site, and transfer output files to the iRODS server. To do so, the user must take the following steps:

install the iRODS icommands on the submit host and any worker nodes
create irods environment file: ~/.irods/irods_environment.json

This file must have the following content:

{
"irods_host": "some.host.edu",
"irods_port": 1247,
"irods_user_name": "someuser",
"irods_zone_name": "somezone",
"irodsPassword" : "somesecretpassword"
}

provide an iRODS path to a compressed input file as a second argument for the sumit script. For example:
```
./submit 1000 /myirodszone/home/user/test.tar.gz 
```

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
task-files		task-files
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pegasus.conf		pegasus.conf
sites.xml.template		sites.xml.template
submit		submit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSG-KINC

Open Science Grid / Execution Environment

Input GEM Format

Submitting the Test Workflow

Customizing the Workflow

Monitoring the Workflow

Workflow Output

Using iRODS as Workflow Staging Site

About

Releases

Packages

Contributors 4

Languages

License

feltus/OSG-KINC

Folders and files

Latest commit

History

Repository files navigation

OSG-KINC

Open Science Grid / Execution Environment

Input GEM Format

Submitting the Test Workflow

Customizing the Workflow

Monitoring the Workflow

Workflow Output

Using iRODS as Workflow Staging Site

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages