Skip to content

Accurate Bayesian reconstruction of cancer phylogenies from bulk sequencing.

License

Notifications You must be signed in to change notification settings

Roth-Lab/PhyClone

Repository files navigation

PhyClone

Accurate Bayesian reconstruction of cancer phylogenies from bulk sequencing. An implementation of the forest structured Chinese restaurant process with a Dirichlet prior on the node parameters.


Overview

  1. PhyClone Installation
  2. Input File Formats
  3. Running PhyClone: Basic Usage
  4. PhyClone Output

Installation

install with bioconda

The recommended way to install PhyClone is through mamba and the Bioconda package channel.

To install into a newly created environment (Recommended):

mamba create --name phyclone phyclone

Or if installing into a pre-existing environment:

mamba install phyclone

Input File Formats

PhyClone analysis has two possible input files:

Caution

In principle PhyClone can be used without pre-clustering. However, it drastically increases the computational complexity. Thus, pre-clustering is recommended for WGS data.


Main input format

Tip

There is an example file in examples/data/mixing.tsv

To run a PhyClone analysis you will need to prepare an input file. The file should be in tab delimited tidy data frame format and have the following columns:

  1. mutation_id - Unique identifier for the mutation. This is free form but should match across all samples.

Warning

PhyClone will remove any mutations without entries for all provided samples. If there are mutations with no data in a subset of the samples, the correct procedure is to extract ref and alt counts for these mutations from each affected sample's associated BAM file. Please refer to this thread for further detail.

  1. sample_id - Unique identifier for the sample.

  2. ref_counts - Number of reads matching the reference allele.

  3. alt_counts - Number of reads matching the alternate allele.

  4. major_cn - Major copy number of segment overlapping mutation.

  5. minor_cn - Minor copy number of segment overlapping mutation.

  6. normal_cn - Total copy number of segment in healthy tissue. For autosome this will be two and male sex chromosome one.

You can include the following optional columns:

  1. tumour_content - The tumour content (cellularity) of the sample. Default value is 1.0 if column is not present.

Note

In principle this could be different for each mutation/sample. However, in most cases it should be the same for all mutations in a sample.

  1. error_rate - Sequencing error rate. Default value is 0.001 if column is not present.
  2. chrom - Chromosome on which mutation_id is found.

Cluster file format

Tip

While any mutation pre-clustering method can be used, we recommend PyClone-VI. Both due to its established strong performance, and its output format which can be fed directly into PhyClone 'as-is'.

The file should be in tab delimited tidy data frame format and have the following columns:

  1. mutation_id - Unique identifier for the mutation.

    This is free form but should match across all samples and must match the identifiers provided in the main input file.

  2. sample_id - Unique identifier for the sample.

  3. cluster_id - Cluster that the mutation has been assigned to.

You can include the following optional columns:

  1. chrom - Chromosome on which mutation_id is found.

  2. cellular_prevalence - Cluster cellular prevalence estimate (included in all PyClone-VI clustering results).

Note

In order to make use of PhyClone's data informed loss probability prior assignment, columns 4 and 5 are required.

Tip

There is an example file in examples/data/mixing_clusters.tsv


Running PhyClone

PhyClone analyses are broken into two parts. First, sampling is performed using the run sub-command. Second, the output trace from the sampling run can be summarised as either a point-estimate tree (MAP or Consensus) or topology report.

Sampling can be run as follows:

phyclone run -i INPUT.tsv -c CLUSTERS.tsv -o TRACE.pkl.gz --num-chains 4

Which will take the INPUT.tsv and (optionally) the CLUSTERS.tsv file, as described above and write the trace file TRACE.pkl.gz in a compressed Python pickle format.

Relevant program options:

  • --num-chains command controls how many independent parallel PhyClone sampling chains to use. Though the default value is set to 1, PhyClone will benefit from running multiple chains; we recommend ≥4 chains, if the compute cores can be spared.
  • -n command can be used to control the number of iterations of sampling to perform.
  • -b command can be used to control the number of burn-in iterations to perform.
  • --seed command can be used to seed the random number generator for reproducible results.

Note

Burn-in is done using a heuristic strategy of unconditional SMC. All samples from the burn-in are discarded as they will not target the posterior.

  • The -d command can be used to select the emission density.
    • As in PyClone, the binomial and beta-binomial densities are available.

For more advanced options, run:

phyclone run --help

Outlier Modelling

As explored in the PhyClone paper, PhyClone is equipped with the ability to model mutational outliers and loss. There are two main approaches to running PhyClone with outlier modelling:

  1. Using a global outlier probability.
    • If running on un-clustered data, this is the only option available to activate outlier modelling.
      • Use --outlier-prob with a decimal value in the [0, 1] range. Barring prior knowledge, 0.001 should suffice.

Note

The --outlier-prob option will also allow for the use of a global loss probability prior on clustered runs as well.

  1. Assigning the outlier probability from clustered data.
    • PhyClone is also able to assign clusters either a high or low outlier prior probability, based on the input data.
    • This feature requires that the clustered data include mutational chromosome assignments, the chrom column (which can be supplied in either the data.tsv or cluster.tsv files), and cluster cellular prevalence (CCF) measures, the cellular_prevalence column (which should be included in the cluster.tsv file).
    • To activate this feature, ensure the input files are populated with the appropriate columns and include the --assign-loss-prob flag in the PhyClone run command.

Tip

If using PyClone-VI for clustering, the CCF column will come as a part of its results. And you need only append the chromosomal positioning column chrom to either input files.

Important

With outlier modelling active, the end result table will assign all mutations inferred to be lost or outliers to a clone with the id of -1.


PhyClone Output

PhyClone includes three ways to summarise the results from a sampling trace file. Two of which produce a point-estimate (a single tree), and a third which reports on and can optionally build results for all uniquely sampled topologies:

  1. MAP tree
    • (Recommended) Retrieves the tree with the highest sampled joint-likelihood.
  2. Consensus tree
    • Produces a tree built from the consensus of clades across the entire sample trace.
  3. Topology report and archive
    • Produces a summary report table and (optionally) archive file of all uniquely sampled topologies from an analysis run.

MAP Point-Estimate Tree

To build the PhyClone MAP tree, you can run the map command as follows:

phyclone map -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv

Where TRACE.pkl.gz is the result from a PhyClone sampling run.

Expected output:

  • TREE.nwk the inferred MAP clone tree topology in Newick format.
  • TABLE.tsv a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.

For more advanced options, run:

phyclone map --help

Consensus Point-Estimate Tree

To build the PhyClone consensus tree, you can run the consensus command as follows:

phyclone consensus -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv

Where TRACE.pkl.gz is the result from a PhyClone sampling run.

Expected output:

  • TREE.nwk the inferred consensus clone tree topology in Newick format.
  • TABLE.tsv a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.

For more advanced options, run:

phyclone consensus --help

Topology Report and Sampled Topologies Archive

Additionally, PhyClone is able to produce a summary report and archive file of all uniquely sampled topologies from a sampling run.

To build the PhyClone topology report and full sampled topologies archive, run the topology-report command as follows:

phyclone topology-report -i TRACE.pkl.gz -o TOPOLOGY_TABLE.tsv -t SAMPLED_TOPOLOGIES.tar.gz

Where TRACE.pkl.gz is the result from a PhyClone sampling run.

Expected output:

  • TOPOLOGY_TABLE.tsv, a high-level report table detailing each topology's log-likelihood, number of times sampled, and topology identifier (which can be used to identify the tree in the accompanying topologies archive).
  • SAMPLED_TOPOLOGIES.tar.gz, a compressed archive where each folder represents a uniquely sampled topology, folder names align with topology identifiers found in the TOPOLOGY_TABLE.tsv

Expected output, for each sampled topology folder in the SAMPLED_TOPOLOGIES.tar.gz (sampled-topologies archive):

  • TREE.nwk the inferred clone tree topology in Newick format.
  • TABLE.tsv a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.

Additional options:

  • --top-trees can be used to define that only the top (user-defined-value) x trees should be built.
    • trees are ranked by their log-likelihood, such that the command --top-trees 3, would populate the archive with only the top 3 most likely trees.

License

PhyClone is licensed under the GNU General Public License v3 or later (GPLv3+), see the LICENSE file for details.

About

Accurate Bayesian reconstruction of cancer phylogenies from bulk sequencing.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages