De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for a genome sequence; this approach can be usefully applied, for instance, in research on 'non-model organisms' of ecological and evolutionary importance, cancer samples or the microbiome. In this protocol we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-seq data in non-model organisms. We also present Trinity-supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples and approaches to identify protein-coding genes. In the procedure, we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sourceforge.net. The run time of this protocol is highly dependent on the size and complexity of data to be analyzed. The example data set analyzed in the procedure detailed herein can be processed in less than 5 h.
Bibliographical noteFunding Information:
acKnoWleDGMents We are grateful to D. Jaffe and S. Young for access to additional computing resources, to Z. Chen for help in R-scripting, to L. Gaffney for help with figure illustrations, to C. Titus Brown for essential discussions and inspiration related to digital normalization strategies, to G. Marcais and C. Kingsford for supporting the use of their Jellyfish software in Trinity and to B. Walenz for supporting our earlier use of Meryl. We are grateful to our users and their feedback, in particular J. Wortman and P. Bain for comments on earlier drafts of the manuscript. This project has been funded in part (B.J.H.) with Federal funds from the National Institute of Allergy and Infectious Diseases (NIAID), US National Institutes of Health (NIH), Department of Health and Human Services (DHHS), under contract no. HHSN272200900018C. Work was supported by Howard Hughes Medical Institute (HHMI), a NIH PIONEER award, a Center for Excellence in Genome Science grant no. 5P50HG006193-02 from the National Human Genome Research Institute (NHGRI) and the Klarman Cell Observatory at the Broad Institute (A.R.). A.P. was supported by the CSIRO Office of the Chief Executive (OCE). M.Y. was supported by the Clore Foundation. P.B. was supported by the National Science Foundation (NSF) grant no. OCI-1053575 for the Extreme Science and Engineering Discovery Environment (XSEDE) project. B.L. and C.D. were partially supported by NIH grant no.1R01HG005232-01A1. In addition, B.L. was partially funded by J. Thomson’s MacArthur Professorship and by the Morgridge Institute for Research support for Computation and Informatics in Biology and Medicine. M.L. was supported by the Bundesministerium für Bildung und Forschung via the project ‘NGSgoesHPC’. N.P. was funded by the Fund for Scientific Research, Flanders (Fonds Wetenschappelijk Onderzoek (FWO) Vlaanderen), Belgium. R.H. and R.D.L. were funded by the NSF under grant nos. ABI-1062432 and CNS-0521433 to Indiana University, and by Indiana METACyt Initiative, which is supported in part by Lilly Endowment, Inc. J.B. was supported through a CSIRO eResearch Accelerated Computing Project. Any opinions, findings and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of any of the funding bodies and institutions including the National Science Foundation, the National Center for Genome Analysis Support and Indiana University.