Abstract

Over 100 years of studies in Drosophila melanogaster and related species in the genus Drosophila have facilitated key discoveries in genetics, genomics, and evolution. While high-quality genome assemblies exist for several species in this group, they only encompass a small fraction of the genus. Recent advances in long-read sequencing allow high-quality genome assemblies for tens or even hundreds of species to be efficiently generated. Here, we utilize Oxford Nanopore sequencing to build an open community resource of genome assemblies for 101 lines of 93 drosophilid species encompassing 14 species groups and 35 sub-groups. The genomes are highly contiguous and complete, with an average contig N50 of 10.5 Mb and greater than 97% BUSCO completeness in 97/101 assemblies. We show that Nanopore-based assemblies are highly accurate in coding regions, particularly with respect to coding insertions and deletions. These assemblies, along with a detailed laboratory protocol and assembly pipelines, are released as a public resource and will serve as a starting point for addressing broad questions of genetics, ecology, and evolution at the scale of hundreds of species.

Document Type

Article

Publication Date

7-19-2021

Notes/Citation Information

Published in eLife, v. 10, e66405.

© 2021, Kim et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

The first 20 authors and the author from the University of Kentucky are shown on the author list above. Please refer to the downloaded document for the complete author list.

Digital Object Identifier (DOI)

https://doi.org/10.7554/eLife.66405

Funding Information

National Institute of General Medical Sciences (F32GM135998): Bernard Y Kim

National Institute of General Medical Sciences (R35GM118165): Dmitri A Petrov

National Institute of Diabetes and Digestive and Kidney Diseases (K01DK119582): Jeremy R Wang

National Science Foundation (DEB-1457707): Corbin D Jones

National Institute of General Medical Sciences (R01GM121750): Daniel R Matute

National Institute of General Medical Sciences (R01GM125715): Daniel R Matute

Google (Google Cloud Research Credits): Bernard Y Kim, Jeremy R Wang

National Institute of General Medical Sciences (R35GM122592): Artyom Kopp

National Institute of General Medical Sciences (R35GM119816): Noah Whiteman

Uehara Memorial Foundation (201931028): Teruyuki Matsunaga

Ministry of Education, Science and Technological Development of the Republic of Serbia (451-03-68/2020-14/200178): Marina Stamenković-Radak, Mihailo Jelić, Marija Savić Veselinović

Ministry of Education, Science and Technological Development of the Republic of Serbia (451-03-68/2020-14/200007): Marija Tanasković, Pavle Erić

National Natural Science Foundation of China (32060112): Jian-Jun Gao

Japan Society for the Promotion of Science (JP18K06383): Masayoshi Watada

Horizon 2020 - Research and Innovation Framework Programme (765937-CINCHRON): Giulia Manoli, Enrico Bertolini

Czech Science Foundation (19-13381S): Vladimír Košťál

Japan Society for the Promotion of Science (JP19H03276): Aya Takahashi

National Science Foundation (1345247): Donald K Price

Related Content

All sequencing data and assemblies generated by this study are deposited at NCBI SRA and GenBank under NCBI BioProject PRJNA675888. Accession numbers for all data used but not generated by this study are provided in the supporting files. Dockerfiles and scripts for reproducing pipelines and analyses are provided on GitHub (https://github.com/flyseq/drosophila_assembly_pipelines; copy archived at https://archive.softwareheritage.org/swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead). A detailed wet lab protocol is provided at https://Protocols.io (https://doi.org/10.17504/protocols.io.bdfqi3mw).

The following data sets were generated:

Kim BY Wang JR (2020) NCBI BioProject ID PRJNA675888. Nanopore-based assembly of many drosophilid genomes. https://www.ncbi.nlm.nih.gov/bioproject/?term=prjna675888

The following previously published data sets were used:

Miller DE (2018) NCBI BioProject ID PRJNA427774. Sequencing and assembly of 14 Drosophila species. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA427774

The Drosophila modENCODE Project (2011) NCBI BioProject ID 63477. modENCODE Drosophila reference genome sequencing (fruit flies). https://www.ncbi.nlm.nih.gov/bioproject/63477

Yang H (2018) NCBI BioProject ID PRJNA484408. DNA-seq of sexed Drosophila grimshawi, Drosophila silvestris, and Drosophila heteroneura. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA484408

Bronski M (2019) NCBI BioProject ID PRJNA554346.Drosophila montium Species Group Genomes Project. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA554346

Rane R (2018) NCBI BioProject ID 476692. Invertebrate sample from Drosophila repleta. https://www.ncbi.nlm.nih.gov/bioproject/476692

Turissini D (2017) NCBI BioProject ID 395473. Fly lines. https://www.ncbi.nlm.nih.gov/bioproject/395473

National Institute of Genetics [Japan] (2016) NCBI BioProject ID PRJDB4817. Genome sequences of 10 Drosophila species. https://www.ncbi.nlm.nih.gov/bioproject/PRJDB4817

Ellison C (2019) NCBI BioProject ID PRJNA550077. Raw genomic sequencing data from 16 Drosophila species. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA550077

elife-66405-supp1-v2.xlsx (24 kB)
Supplementary file 1: Detailed information on both long-read and short-read data used for this project, including accession numbers if publicly available data were used for assembly. https://cdn.elifesciences.org/articles/66405/elife-66405-supp1-v2.xlsx

elife-66405-supp2-v2.xlsx (32 kB)
Supplementary file 2: Assembly summary statistics and genome size estimates. https://cdn.elifesciences.org/articles/66405/elife-66405-supp2-v2.xlsx

elife-66405-supp3-v2.xlsx (30 kB)
Supplementary file 3: Counts of SNPs, indels, and per-site heterozygosity estimated from both long reads and short reads. https://cdn.elifesciences.org/articles/66405/elife-66405-supp3-v2.xlsx

elife-66405-supp4-v2.xlsx (27 kB)
Supplementary file 4: Consensus quality scores estimated with reference-free and reference-based methods. https://cdn.elifesciences.org/articles/66405/elife-66405-supp4-v2.xlsx

elife-66405-supp5-v2.xlsx (11 kB)
Supplementary file 5: Characterization of all coding sequence indel differences between Nanopore and Release six reference D. melanogaster assemblies. https://cdn.elifesciences.org/articles/66405/elife-66405-supp5-v2.xlsx

elife-66405-supp6-v2.xlsx (41 kB)
Supplementary file 6: Detailed sample information. https://cdn.elifesciences.org/articles/66405/elife-66405-supp6-v2.xlsx

elife-66405-transrepform-v2.pdf (201 kB)
Transparent reporting form: https://cdn.elifesciences.org/articles/66405/elife-66405-transrepform-v2.pdf

Share

COinS