Abstract
Over 100 years of studies in Drosophila melanogaster and related species in the genus Drosophila have facilitated key discoveries in genetics, genomics, and evolution. While high-quality genome assemblies exist for several species in this group, they only encompass a small fraction of the genus. Recent advances in long-read sequencing allow high-quality genome assemblies for tens or even hundreds of species to be efficiently generated. Here, we utilize Oxford Nanopore sequencing to build an open community resource of genome assemblies for 101 lines of 93 drosophilid species encompassing 14 species groups and 35 sub-groups. The genomes are highly contiguous and complete, with an average contig N50 of 10.5 Mb and greater than 97% BUSCO completeness in 97/101 assemblies. We show that Nanopore-based assemblies are highly accurate in coding regions, particularly with respect to coding insertions and deletions. These assemblies, along with a detailed laboratory protocol and assembly pipelines, are released as a public resource and will serve as a starting point for addressing broad questions of genetics, ecology, and evolution at the scale of hundreds of species.
Document Type
Article
Publication Date
7-19-2021
Digital Object Identifier (DOI)
https://doi.org/10.7554/eLife.66405
Funding Information
National Institute of General Medical Sciences (F32GM135998): Bernard Y Kim
National Institute of General Medical Sciences (R35GM118165): Dmitri A Petrov
National Institute of Diabetes and Digestive and Kidney Diseases (K01DK119582): Jeremy R Wang
National Science Foundation (DEB-1457707): Corbin D Jones
National Institute of General Medical Sciences (R01GM121750): Daniel R Matute
National Institute of General Medical Sciences (R01GM125715): Daniel R Matute
Google (Google Cloud Research Credits): Bernard Y Kim, Jeremy R Wang
National Institute of General Medical Sciences (R35GM122592): Artyom Kopp
National Institute of General Medical Sciences (R35GM119816): Noah Whiteman
Uehara Memorial Foundation (201931028): Teruyuki Matsunaga
Ministry of Education, Science and Technological Development of the Republic of Serbia (451-03-68/2020-14/200178): Marina Stamenković-Radak, Mihailo Jelić, Marija Savić Veselinović
Ministry of Education, Science and Technological Development of the Republic of Serbia (451-03-68/2020-14/200007): Marija Tanasković, Pavle Erić
National Natural Science Foundation of China (32060112): Jian-Jun Gao
Japan Society for the Promotion of Science (JP18K06383): Masayoshi Watada
Horizon 2020 - Research and Innovation Framework Programme (765937-CINCHRON): Giulia Manoli, Enrico Bertolini
Czech Science Foundation (19-13381S): Vladimír Košťál
Japan Society for the Promotion of Science (JP19H03276): Aya Takahashi
National Science Foundation (1345247): Donald K Price
Related Content
All sequencing data and assemblies generated by this study are deposited at NCBI SRA and GenBank under NCBI BioProject PRJNA675888. Accession numbers for all data used but not generated by this study are provided in the supporting files. Dockerfiles and scripts for reproducing pipelines and analyses are provided on GitHub (https://github.com/flyseq/drosophila_assembly_pipelines; copy archived at https://archive.softwareheritage.org/swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead). A detailed wet lab protocol is provided at https://Protocols.io (https://doi.org/10.17504/protocols.io.bdfqi3mw).
The following data sets were generated:
Kim BY Wang JR (2020) NCBI BioProject ID PRJNA675888. Nanopore-based assembly of many drosophilid genomes. https://www.ncbi.nlm.nih.gov/bioproject/?term=prjna675888
The following previously published data sets were used:
Miller DE (2018) NCBI BioProject ID PRJNA427774. Sequencing and assembly of 14 Drosophila species. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA427774
The Drosophila modENCODE Project (2011) NCBI BioProject ID 63477. modENCODE Drosophila reference genome sequencing (fruit flies). https://www.ncbi.nlm.nih.gov/bioproject/63477
Yang H (2018) NCBI BioProject ID PRJNA484408. DNA-seq of sexed Drosophila grimshawi, Drosophila silvestris, and Drosophila heteroneura. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA484408
Bronski M (2019) NCBI BioProject ID PRJNA554346.Drosophila montium Species Group Genomes Project. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA554346
Rane R (2018) NCBI BioProject ID 476692. Invertebrate sample from Drosophila repleta. https://www.ncbi.nlm.nih.gov/bioproject/476692
Turissini D (2017) NCBI BioProject ID 395473. Fly lines. https://www.ncbi.nlm.nih.gov/bioproject/395473
National Institute of Genetics [Japan] (2016) NCBI BioProject ID PRJDB4817. Genome sequences of 10 Drosophila species. https://www.ncbi.nlm.nih.gov/bioproject/PRJDB4817
Ellison C (2019) NCBI BioProject ID PRJNA550077. Raw genomic sequencing data from 16 Drosophila species. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA550077
Repository Citation
Kim, Bernard Y.; Wang, Jeremy R.; Miller, Danny E.; Barmina, Olga; Delaney, Emily; Thompson, Ammon; Comeault, Aaron A.; Peede, David; D'Agostino, Emmanuel R. R.; Pelaez, Julianne; Aguilar, Jessica M.; Haji, Diler; Matsunaga, Teruyuki; Armstrong, Ellie E.; Zych, Molly; Ogawa, Yoshitaka; Stamenković-Radak, Marina; Jelić, Mihailo; Veselinović, Marija Savić; Tanasković, Marija; and Davis, Jeremy S., "Highly Contiguous Assemblies of 101 Drosophilid Genomes" (2021). Biology Faculty Publications. 215.
https://uknowledge.uky.edu/biology_facpub/215
Supplementary file 1: Detailed information on both long-read and short-read data used for this project, including accession numbers if publicly available data were used for assembly. https://cdn.elifesciences.org/articles/66405/elife-66405-supp1-v2.xlsx
elife-66405-supp2-v2.xlsx (32 kB)
Supplementary file 2: Assembly summary statistics and genome size estimates. https://cdn.elifesciences.org/articles/66405/elife-66405-supp2-v2.xlsx
elife-66405-supp3-v2.xlsx (30 kB)
Supplementary file 3: Counts of SNPs, indels, and per-site heterozygosity estimated from both long reads and short reads. https://cdn.elifesciences.org/articles/66405/elife-66405-supp3-v2.xlsx
elife-66405-supp4-v2.xlsx (27 kB)
Supplementary file 4: Consensus quality scores estimated with reference-free and reference-based methods. https://cdn.elifesciences.org/articles/66405/elife-66405-supp4-v2.xlsx
elife-66405-supp5-v2.xlsx (11 kB)
Supplementary file 5: Characterization of all coding sequence indel differences between Nanopore and Release six reference D. melanogaster assemblies. https://cdn.elifesciences.org/articles/66405/elife-66405-supp5-v2.xlsx
elife-66405-supp6-v2.xlsx (41 kB)
Supplementary file 6: Detailed sample information. https://cdn.elifesciences.org/articles/66405/elife-66405-supp6-v2.xlsx
elife-66405-transrepform-v2.pdf (201 kB)
Transparent reporting form: https://cdn.elifesciences.org/articles/66405/elife-66405-transrepform-v2.pdf
Notes/Citation Information
Published in eLife, v. 10, e66405.
© 2021, Kim et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
The first 20 authors and the author from the University of Kentucky are shown on the author list above. Please refer to the downloaded document for the complete author list.