comparing de novo genome assembly the long and short of it比较新创基因组组装它的长和短.pdf
文本预览下载声明
Comparing De Novo Genome Assembly: The Long and
Short of It
1 1,2
Giuseppe Narzisi *, Bud Mishra
1 Courant Institute of Mathematical Sciences, New York University, New York, New York, United States of America, 2 NYU School of Medicine, New York University, New
York, New York, United States of America
Abstract
Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have
rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a
plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to
comparative assessments of these assemblers’ quality and accuracy. No commonly accepted and standardized method for
comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly
capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in
assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage,
contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC
transparently captures the trade-offs between contigs’ quality against their sizes. For this purpose, most of the publicly
available major sequence assemblers – both for low-coverage long (Sanger) and high-coverage short (Illumina) reads
technologies – are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia,
Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-
leng
显示全部