diff --git a/notebooks/extensions.ipynb b/notebooks/extensions.ipynb new file mode 100644 index 0000000..562c740 --- /dev/null +++ b/notebooks/extensions.ipynb @@ -0,0 +1,118 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "In this tutorial we will explore how different combinations of image channels can be defined for image generation in Cue, as well as how to define\n", + "new SV channels and extend the framework to new sequencing platforms.\n", + "\n", + "### Creating new SV signal sets using existing SV channels\n", + "\n", + "SV-informative signals are defined in the ```SVSignals``` enum inside ```img/constants```. For example ```SVSignals.RD``` corresponds to the read-depth channel\n", + "and ```SVSignals.RD_CLIPPED``` is the clipped-read channel.\n", + "\n", + "To generate a specific combination of image channels we need to define a new ```SV_SIGNAL_SET``` type and provide its channel composition\n", + "as a new entry in ```SV_SIGNALS_BY_TYPE``` dictionary inside ```img/constants```. For example, we can create a new set called ```DEMO``` that includes\n", + "the read-depth and clipped-read signal as follows:\n", + "\n", + "```\n", + "SV_SIGNAL_SET = Enum(\"SV_SIGNAL_SET\", 'SHORT '\n", + " ...\n", + " *'DEMO'*)\n", + "SV_SIGNALS_BY_TYPE = {\n", + " SV_SIGNAL_SET.SHORT: [SVSignals.RD, SVSignals.RD_LOW, SVSignals.SR_RP, SVSignals.LLRR, SVSignals.RL,\n", + " SVSignals.LLRR_VS_LR],\n", + " ...\n", + " *SV_SIGNAL_SET.DEMO: [SVSignals.RD, SVSignals.RD_CLIPPED]*\n", + "}\n", + "```\n", + "Since both of these signal types are already supported by the framework, we just need to specify this new signal set in the data generation YAML\n", + "config in order to switch images to this channel configuration. In particular, we need to update the ```signal_set``` in ```generate.yaml``` as follows:\n", + "\n", + "```\n", + "signal_set: \"DEMO\"\n", + "```\n", + "\n", + "Providing the updated ```generate.yaml``` config file to the generate.py script will generate 2-channel images with the read-depth and clipped-read channels.\n", + "\n", + "### Creating custom SV signals\n", + "\n", + "In order to create a new SV channel, we need to extend the ```SVSignals``` definition and add this channel to a signal set (we'll use the ```DEMO``` set).\n", + "For example, we can define a channel ```RD_MAX``` that will compute the scalar maximum read depth across two loci as follows:\n", + "\n", + "```\n", + "class SVSignals(str, Enum):\n", + " RD_MAX = \"RD_MAX\"\n", + " ...\n", + "\n", + "SV_SIGNALS_BY_TYPE = {\n", + " SV_SIGNAL_SET.DEMO: [SVSignals.RD, SVSignals.RD_MAX, SVSignals.RD_CLIPPED]\n", + "}\n", + "```\n", + "\n", + "We can define the function that this channel should compute inside the ```make_image()``` function of the ```img/datasets``` file by providing\n", + "a conditional clause for this channel:\n", + "\n", + "```\n", + "if signal == constants.SVSignals.RD_MAX:\n", + " counts = self.aln_index.scalar_apply(SVSignals.RD, interval_pair.intervalA, interval_pair.intervalB, op=max)\n", + "```\n", + "\n", + "This definition will apply the ```max``` operator to the read count at each pair of loci represented in the image.\n", + "\n", + "### Collecting custom alignment features\n", + "\n", + "In order to generate channels from custom alignment features that are not yet supported by the framework, we need to add these features\n", + "to the BAM index file generated by Cue and used to create the image channels.\n", + "\n", + "The ```seq/aln_index.py``` file contains the logic for BAM indexing. In particular, the method ```add_by_signal(...read, signal...)``` implements\n", + "what/how read alignment properties should be extracted to support a specific signal. The ```read``` input is a pysam ```AlignedSegment``` object\n", + "which can be queried for various alignment properties of a particular read in the BAM file (this function is called on all the reads in the file).\n", + "\n", + "For example, in order to support linked-reads, we construct a channel from the barcodes associated with each read.\n", + "In particular, the split-molecule channel ```SVSignals.SM``` captures how many barcodes where shared by the reads across\n", + "each pair of loci in the image. In order to compute the intersection of barcodes across two loci, we store the barcodes observed\n", + "at each locus as follows:\n", + "```\n", + "bin_id = get_bin_id(...read...)\n", + "if signal == SVSignals.SM :\n", + " barcode = read.get_tag('BX')\n", + " self.bins[signal][chr_id][bin_id].add(barcode)\n", + "```\n", + "\n", + "In the code above, the read barcode is extracted from the ```BX``` tag in the BAM file and added to the set of barcodes associated with\n", + "the locus of this read. Once the information is collected in the index, the functions ```scalar_apply()``` or ```intersect()``` provided\n", + "by the ```AlnIndex``` class (in ```aln_index.py```) can be used implement the desired operation over the values stored in\n", + "the bins associated with a pair of loci. In the case of the ```SM``` signal, we can use ```intersect()``` to find the intersection of the barcode\n", + "sets at each pair of loci. Additional features specific to a given sequencing platform can be collected and processed using a similar approach." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file