Abstract

We present SeqOthello, an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments. It takes SeqOthello only 5 min and 19.1 GB memory to conduct a global survey of 11,658 fusion events against 10,113 TCGA Pan-Cancer RNA-seq datasets. The query recovers 92.7% of tier-1 fusions curated by TCGA Fusion Gene Database and reveals 270 novel occurrences, all of which are present as tumor-specific. By providing a reference-free, alignment-free, and parameter-free sequence search system, SeqOthello will enable large-scale integrative studies using sequence-level data, an undertaking not previously practicable for many individual labs.

Document Type

Article

Publication Date

10-19-2018

Notes/Citation Information

Published in Genome Biology, v. 19, 167, p. 1-13.

© The Author(s). 2018

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Digital Object Identifier (DOI)

https://doi.org/10.1186/s13059-018-1535-9

Funding Information

This work was supported by US National Science Foundation [award grant number 1054631 to J.L., CNS-1717948 and CNS-1750704 to C.Q.] and National Institutes of Health [grant number P30CA177558 and 1UL1TR001998-01 to J.L.] The Seven Bridges Cancer Genomics Cloud has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Contract No. HHSN261201400008C and ID/IQ Agreement No. 17X146 under Contract No. HHSN261201500003I.

Related Content

SeqOthello is an Open Source software under GPL 3.0 License. The source code of SeqOthello is available at Github repository, https://github.com/LiuBioinfo/SeqOthello. The SeqOthello version, and the scripts used to build and query the SeqOthello mapping are also available on Zenodo.

The results shown here are partially based upon data publicly available at sequence read archive (SRA) and data generated by the TCGA Research Network. The detailed list of datasets used for evaluation are provided in Additional file 2 and Additional file 6 respectively.

All final data generated or analyzed during this study are included in this published article and its supplementary information files.

13059_2018_1535_MOESM1_ESM.pdf (155 kB)
Additional file 1: Figure S1.

13059_2018_1535_MOESM2_ESM.xlsx (51 kB)
Additional file 2: Details of SRA samples used for performance comparison between SeqOthello and other SBT-based methods.

13059_2018_1535_MOESM3_ESM.pdf (12 kB)
Additional file 3: Table S1.

13059_2018_1535_MOESM4_ESM.pdf (53 kB)
Additional file 4: Table S2.

13059_2018_1535_MOESM5_ESM.pdf (142 kB)
Additional file 5: Figure S2.

13059_2018_1535_MOESM6_ESM.xlsx (3527 kB)
Additional file 6: Details of TCGA samples used to contruct SeqOthello as well as fusion occurrences detected by querying the index.

13059_2018_1535_MOESM7_ESM.pdf (107 kB)
Additional file 7: Figure S3.

13059_2018_1535_MOESM8_ESM.docx (166 kB)
Additional file 8: Review history.

Share

COinS