We just put a new manuscript up on Biorxiv (link). Here is an update on the thoughts behind the analysis pipeline.
Mandalorion: From Nanopore RNAseq Reads to Quantified Isoforms
When I first received a ONT MinION sequencer while I was still a Postdoc in Stephen Quake’s lab (must have been some time in 2014), my first thought was: Single Cell Full-Length RNAseq. I had been working on analyzing single B cells with Illumina at the time and was pretty sure that the low amount of genes expressed in a single cell would be a good match for the then relatively low through-put of the MinION device.
It wasn’t meant to be back then because the first couple of runs failed and we gave up on the device. Fast forward 1.5 years and I’m now running my own lab at UC Santa Cruz which just happens to be the epicenter of cool nanopore tech development with Mark Akeson’s lab really leading the way. With their help, Ashley Byrne, one of graduate students in my lab, got our MinION devices running and finally got around analyzing single B cells.
And the data came out beautifully. Full length cDNA reads, little-to-no length bias, and widespread isoform variation in the cells we analyzed. Seeing entire isoforms covered by a single read for the first time is pretty special.
Because the throughput of a single flowcell is enough to analyze 2-4 cells right now, Ashley developed a custom barcoding scheme to easily multiplex several cells per flowcell.
At that point we needed a pipeline to demultiplex, align, and systematically identify isoforms. To have a ground-truth to work with, we sequenced a synthetic mix of RNA transcripts (Lexogen SIRVs) and developed our Mandalorion (324 downloads) (Obscure Star Wars reference? Check!) pipeline using that data as test bed.
The idea behind Mandalorion is quite simple: What if we could write an algorithm that defines isoforms just like you would do by hand. So, after aligning and filtering reads, it would first identify isoform features based on those alignments: Transcription Start/End Sites (TSS/TES) and Splice Sites (SS)
Then it would check individual read alignments to identify which of those are used alternatively and sort raw reads into bins based on their alternative TSS/TES and SS usage. By counting those bins, we get isoform quantification and by creating consensus reads we get pretty high quality isoform sequences as well.
The pipeline we provide for download here is still very experimental, so let me know if you run into trouble. This is by far the most heavy duty pipeline my lab and I have developed so far, so this is sure to be a learning curve for everybody involved.
There is a README.txt file bundled with the scripts that I hope explains the usage of Mandalorion in a way somebody familiar with command-line and python should have not too much trouble understanding.
A couple of notes:
1.) The genome fasta you use to build the gmap genome index and your gtf annotation file have to use the same chromosome names (chr1 vs. chr1, not chr1 vs 1)
2.) As provided, the pipeline uses gmap for read alignments. If you took a look at the Biorxiv paper, it uses blat. Overall, the aligners do a very similar job. We are still doing a systematic comparison, but so far it loooks like gmap is doing a bit better with splice junctions and read ends and is considerably faster. We continue to use blat for the manuscript for consistency reasons. We have also implemented a split-merge parallelization of it on our server getting around the speed issue, but that would be impossible to include into a pipeline like this.
3.) The pipeline uses gmap, blat, poaV2 all of which are great and have their own citations.
4.) gtf files are weird. Seriously, who came up with that format? Basically, there can be considerable differences from gtf file to gtf file. Right now, I tried to make Mandalorion work with gtf files containing “exon” features and “gene_id” fields. We have run it on Mouse, SIRV, and Drosophila transcripts, all of which had very distinct gtf files, and it worked.
5.) I recommend to load read alignments and TSS.bed/TES.bed and SS.bed files to the genome browser to check whether the features make sense to you. The files should be compatible as long as your chromosome names are chr1, chr2,… instead of 1,2,…
6.) Things the Mandalorion will get wrong: Close together TSS/TES (<60bp) and SS (<20bp). It will treat those as a single site. Better read accuracy/algorithms should help with this in the future.
7.) We will ultimately move the code to github once I figure out how that works
8.) We also have some test data (338 downloads) with an example sample sheet and gene list to download and test Mandalorion on.
I’m one with the Force and the Force is with me,
Our lab just published its first paper! Here is an update on my thoughts on the paper and some hopefully useful information on how to analyse the data.
TMIseq: Sequencing Full Length Antibody Repertoires
Antibody repertoires have proven to be valuable tools in both basic and translational research. Antibody repertoire sequencing relies on creating amplicons of the antibody heavy chain transcript or gene that contain the incredible antibody diversity created by VDJ recombination, somatic hypermutation, and class-switch recombination.
There are about as many protocols out there to sequences antibody repertoires as there are laboratories sequencing antibody repertoires. All of these protocol are the results of compromises. Even the longest Illumina MiSeq reads (2x300bp) currently don’t perform well enough to sequence complete heavy chain variable regions plus enough of the constant region to call isotype subtypes. Therefore, either information in the variable region or constant region is missed or longer read technology (i.e. 454) has to be used which is more expensive and error-prone.
I spent most of my time as a postdoc in Steve Quake’s lab at Stanford working with Rene Sit, an extremely talented RA, to develop a protocol that allowed to correct sequencing and PCR errors in antibody repertoires by using molecular identifiers. While successfully eliminating most sequencing errors, the protocol we developed captured only a limited part of the antibody heavy chain Variable and Constant Regions.
When I started my own lab at UCSC, I wanted to develop a protocol that would combine molecular identifiers to achieve high accuracy with the necessary read length to capture all the information in an antibody heavy chain transcript. At the same time the financial realities of a small lab inherently required this protocol to be inexpensive.
The TMIseq protocol we ultimately settled on achieved both these aims with the added benefit of being quite easy to implement. A brisk 8am-6pm work day should be enough to go from RNA to final sequencing library. Wet lab work in the form of trouble-shooting different versions of the protocol and producing the final libraries that made it into the paper was performed by the Research Specialist Camille Scelfo-Dalbey, Undergraduate Researcher Roger Volden, MS student Sumedha Darmadhikari.
The data this protocol produces is quite complex. Several analysis step have to be performed to make it from raw read data to assembled antibody sequences. Python scripts to go from a pair of fastq files to assembled amplicons can be downloaded here. Credit for the AMPssembler script (which you can also find on github) that performs the final assembly goes to Charles Cole. Charles wrote this assembler to take advantage of the unique structure of the data our approach generates which makes it pretty fast.
Please understand that these scripts were written for the exact adapter/primer sequences/similarity cutoffs and folder structure we used and haven’t been run on more than two (Linux Mint and Ubuntu) computers. What I am trying to say is, they might not work and your computer might catch fire. For example, while I am a little bit proud about my implementation of the fuzzy string matching of molecular identifiers which can be computationally quite expensive when not optimized, the current implementation will not work if the cutoffs for adapter differences are set to >5, or >2, respectively. There is also a pre-filter step for the read that only lets reads pass that contain primer sequences, etc… You will also need the awesome trimmomatic program to run the pipeline.
To run the pipeline simply put all scripts in the same folder, edit the path to your trimmomatic binary in Tn5_analysis_v1.py and run
python Tn5_analysis_v1.py /path/to/content_file
I provided an example content file in the tar ball containing the scripts. Paired reads for the three libraries of each sample should be in subfolders labeled ‘A’, ‘B’, and ‘Uncut’.
I hope that providing these script will lower the entry barrier to using this protocol a little bit. If you have any question on how to run the pipeline don’t hesitate to contact us.
May the Force be with you,