Thursday, February 14, 2013

Improving PacBio Long Read Accuracy by Short Read Alignment

Hi,

I am very excited to start using third generation sequencing (TGS) data, we aim to use hybrid assembly approach to produce a good quality genome. Like you probably know that PacBio produces reads of length from 2,500-10,000 base-pairs but these long reads have errors up to 15%.

Here, we discuss LSC approach, which uses homopolymer compression (HC) transformation. In simple words, all the sites with same consecutive bases are merged into one, i.e, 'TTT' is replaced by T. Authors claim that datasets are reduced to 60% on cost of little sensitivity and specificity.

In principle, method sounds very simple.  Long and short reads are transformed as shown in following figure. Reads with less than 40 non-Ns AAs are filtered out. In third step, shorts reads are mapped on long reads using NovoAlign but we can also use BWA or Seqalto.

Long reads are then modified based on the alignments and correct a substantial amount of nucleotide sites. In example, authors have used 100,000 LRs and 60 million SRs. It takes around 10 hours and 20 Gb space around 10 times less time and space compare to PacBioToCA tools.




This reviews represent my understanding of the article. For a better explanation, please refer to....

http://www.ncbi.nlm.nih.gov/pubmed/23056399  

No comments:

Post a Comment