MetaPipe: A High-Performance Computing Pipeline for QTL Mapping of Large Ionomic and Metabolomic Datasets
The goal of MetaPipe is to provide an easy to use and powerful tool capable of performing QTL mapping analyses.
You can install the released version of MetaPipe from CRAN with:
install.packages("MetaPipe")
And the development version from GitHub with:
# install.packages(c("hexSticker", "kableExtra", "qpdf", "remotes")
remotes::install_github("villegar/MetaPipe", build_vignettes = TRUE)
For details about the data structure and extended documentation, see the vignette Load Raw Data.
vignette("load-raw-data", package = "MetaPipe")
MetaPipe::load_raw(raw_data_filename = "FILE.CSV", excluded_columns = c(...))
where raw_data_filename
is the filename containing the raw data, both
absolute and relative paths are accepted. Next, the argument
excluded_columns
is a vector containing the indices of the properties,
e.g. c(2, 3, ..., M)
.
# F1 Seedling Ionomics dataset
ionomics_path <- system.file("extdata",
"ionomics.csv",
package = "MetaPipe",
mustWork = TRUE)
ionomics <- MetaPipe::load_raw(ionomics_path)
knitr::kable(ionomics[1:5, 1:8])
ID | SampleWeight | Ca44 | K39 | P31 | Li7 | B11 | Na23 |
---|---|---|---|---|---|---|---|
E_199 | 79 | 32675.79 | 6051.023 | 2679.338 | 0.1159068 | 23.32975 | 9.372606 |
E_209 | 81 | 28467.95 | 5642.651 | 2075.403 | 0.0104801 | 27.31206 | 8.787553 |
E_035 | 81 | 27901.35 | 7357.856 | 2632.343 | 0.0561879 | 16.87480 | 14.369062 |
E_197 | 79 | 27855.36 | 5225.275 | 1761.725 | 0.0104453 | 25.34740 | 11.009597 |
E_016 | 79 | 27377.40 | 6141.001 | 2145.715 | 0.0172996 | 24.64500 | 6.999958 |
For extended documentation, see the vignette Replace Missing Data.
vignette("replace-missing-data", package = "MetaPipe")
MetaPipe::replace_missing(raw_data = example_data,
excluded_columns = c(2),
# Optional
out_prefix = "metapipe",
prop_na = 0.5,
replace_na = FALSE)
where raw_data
is a data frame containing the raw data, as described
in Load Raw Data and excluded_columns
is a vector
containing the indices of the properties, e.g. c(2, 3, ..., M)
. The
other arguments are optional, out_prefix
is the prefix for output
files, prop_na
is the proportion of NA values (used to drop traits),
and replace_na
is a logical flag to indicate whether or not NA
s
should be replace by half of the minimum value within each variable.
# F1 Seedling Ionomics dataset
data(ionomics) # Includes some missing data
ionomics_rev <- MetaPipe::replace_missing(ionomics, c(1, 2))
ionomics_rev <- MetaPipe::replace_missing(ionomics,
excluded_columns = c(1, 2),
prop_na = 0.025)
#> The following trait was dropped because it has 2.5% or more missing values:
#> - Se78
ionomics_rev <- MetaPipe::replace_missing(ionomics,
excluded_columns = c(1, 2),
replace_na = TRUE)
knitr::kable(ionomics_rev[1:5, 1:8])
ID | SampleWeight | Ca44 | K39 | P31 | Li7 | B11 | Na23 |
---|---|---|---|---|---|---|---|
E_001 | 79 | 15894.22 | 5888.311 | 1743.118 | 0.0128699 | 18.66673 | 6.970224 |
E_002 | 93 | 13155.45 | 7013.400 | 2244.684 | 0.0119316 | 14.47693 | 5.866392 |
E_004 | 97 | 14182.51 | 7966.273 | 2311.057 | 0.0212316 | 14.71313 | 10.251955 |
E_005 | 82 | 22550.82 | 7514.089 | 2315.675 | 0.0233063 | 20.10630 | 11.773697 |
E_006 | 99 | 15982.76 | 7608.464 | 1995.193 | 0.0588128 | 12.97801 | 11.043837 |
For extended documentation, see the vignette Assess Normality.
vignette("assess-normality", package = "MetaPipe")
MetaPipe
assesses the normality of variables (traits) by performing a
Shapiro-Wilk test on the raw data (see Load Raw
Data
and Replace Missing
Data.
Based on whether or not the data approximates a normal distribution, an
array of transformations will be computed, and the normality assessed
one more time.
MetaPipe::assess_normality(raw_data = raw_data,
excluded_columns = c(2, 3, ..., M),
# Optional
cpus = 1,
out_prefix = "metapipe",
plots_dir = tempdir(),
transf_vals = c(2, exp(1), 3, 4, 5, 6, 7, 8, 9, 10),
alpha = 0.05,
pareto_scaling = FALSE,
show_stats = TRUE)
where raw_data
is a data frame containing the raw data, as described
in Load Raw
Data
and excluded_columns
is a vector containing the indices of the
properties, e.g. c(2, 3, ..., M)
. The other arguments are optional,
cpus
is the number of cores to use, in other words, the number of
concurrent traits to process, out_prefix
is the prefix for output
files, plots_dir
is the output directory where the plots will be
stored, transf_vals
is a vector containing the transformation values
to be used when transforming the original data, alpha
is the
significance level for the Wilk-Shapiro tests, pareto_scaling
is a
boolean flag to indicate whether or not to scale the traits to the same
scale, and show_stats
is a boolean flag to show or hide some general
statistics of the normalisation process.
# F1 Seedling Ionomics dataset
data(ionomics) # Includes some missing data
ionomics_rev <- MetaPipe::replace_missing(ionomics,
excluded_columns = c(1, 2),
replace_na = TRUE)
ionomics_normalised <-
MetaPipe::assess_normality(ionomics_rev,
excluded_columns = c(1, 2),
transf_vals = c(2, exp(1)),
out_prefix = "README-ionomics",
plots_dir = "man/figures/",
pareto_scaling = FALSE)
#> Total traits (excluding all NAs traits): 21
#> Normal traits (without transformation): 2
#> Normal traits (transformed): 4
#> Total normal traits: 6
#> Total skewed traits: 15
#>
#> Transformations summary:
#> f(x) Value # traits
#> log 2 3
#> root e 1
# Extract normalised features
ionomics_norm <- ionomics_normalised$norm
ionomics_skew <- ionomics_normalised$skew
The function call to MetaPipe::assess_normality
will print a summary
of the transformations performed (if any), as well as an overview of the
number of traits that should be considered normal and skewed. Next,
we can preview some of the partial output of the normality assessment
process:
# Normal traits
knitr::kable(ionomics_norm[1:5, ])
ID | Ca44 | B11 | Na23 | Mg26 | Rb85 | Sr88 |
---|---|---|---|---|---|---|
E_001 | 15894.22 | 4.222397 | 2.042740 | 10.77021 | 1.555742 | 7.347059 |
E_002 | 13155.45 | 3.855684 | 1.917202 | 10.54095 | 2.058711 | 6.890243 |
E_004 | 14182.51 | 3.879033 | 2.354263 | 10.51931 | 2.198422 | 9.025915 |
E_005 | 22550.82 | 4.329576 | 2.477233 | 11.13450 | 1.791578 | 15.292360 |
E_006 | 15982.76 | 3.697997 | 2.419593 | 11.72734 | 2.229866 | 13.901449 |
# Skewed traits (partial output)
knitr::kable(ionomics_skew[1:5, 1:8])
ID | K39 | P31 | Li7 | Al27 | S34 | Fe54 | Mn55 |
---|---|---|---|---|---|---|---|
E_001 | 5888.311 | 1743.118 | 0.0128699 | 3.845879 | 1152.944 | 27.59340 | 54.53991 |
E_002 | 7013.400 | 2244.684 | 0.0119316 | 5.825639 | 1600.442 | 35.49159 | 52.57114 |
E_004 | 7966.273 | 2311.057 | 0.0212316 | 8.036047 | 1039.098 | 39.13434 | 36.66475 |
E_005 | 7514.089 | 2315.675 | 0.0233063 | 9.482051 | 1091.607 | 40.22041 | 43.24368 |
E_006 | 7608.464 | 1995.193 | 0.0588128 | 29.329605 | 1096.871 | 75.23614 | 53.64705 |
Among the transformed traits, we have B11
and Na23
. Both of which
seem to be skewed, but after a simple transformation, can be classify as
normalised traits.
qtl_scone <- function(x_data, cpus = 1, ...)
where x_data
# F1 Seedling Ionomics dataset
data(father_riparia) # Genetic map
# Load cross file with genetic map and raw data for normal traits
x <- MetaPipe::read.cross(father_riparia,
ionomics_norm,
genotypes = c("nn", "np", "--"))
#> --Read the following data:
#> 166 individuals
#> 1115 markers
#> 7 phenotypes
#> Warning in summary.cross(cross): Some markers at the same position on chr
#> 1,4,5,7,8,9,10,12,14,15,16,17; use jittermap().
#> --Cross type: f2
set.seed(123)
x <- qtl::jittermap(x)
x <- qtl::calc.genoprob(x, step = 1, error.prob = 0.001)
x_scone <- MetaPipe::qtl_scone(x, 1, model = "normal", method = "hk")