Saturday, February 16, 2013

Sequence assembly demystified

This is another good article by Niranjan with Dr. Pop. They have produced a wealth of articles on NGS analysis during the Niranjan's stay at the Computer science department of Maryland.

Here authors have addressed commonly known issues with the genomic/transcriptomits assembly. Also trade-offs for many approaches are mentioned in detail. There are issues during assembly such as extent of repetitive genomic sequences, which will always affect genome assembly quality independent of the assembly tools being used.

Article also provides slight overview of four different algorithms that are used during merging the short reads into the contigs. These approaches are
1. Greedy - elongating local contigs based on maximum overlap between reads - TIGR Assembler
2. Overlap-layout-consensus - independent overlapping of all read pairs and then creating graph with nodes, conncted by edges - Cerela Assembler
3. De Brujin graph - k-mer based approach, connect the reads with k-1 base overlaps. Commonly used. Errors are corrected both before and after assembly to produce good quality output - Vetlvet, SOAPdenovo and ALLPATHS.

Some of the assembly tools and sequencing technology supported....

Genome assemblers
ALLPATHS-LG - illumina, pacific bioscience
SOAPdenovo - illumina
Velvet - illumina, SOLiD, 454, Sanger
ABySS - illumina, SOLiD, 454, Sanger

Transcriptome assembler
Trinity - Illumina, 454
Oases - Illumina, SOLiD, 454, Sanger

Authors have summarized possible criteria to be considered when sequencing, in order to obtain good quality assemblies. Also there is an urge to increase the interaction between the molecular biologists and bioinformaticians.

Article also talks about approaches that can be used for evaluating the assemblies. People mostly use parameters such as N50 to describe quality but it can be misleading when looking at the sequences, which are present in varying abundance, i.e., in transcriptomics and in metagenomics. There are tools such AMOSvalidate, FR-curve and GAV that can be used for quality evaluation. We evaluated Lotus genome assembly by using break-dancer, which detects degree of structural genomic variants. It requires remapping of all the reads back to targeted assembly.

Following figure is from article where authors have shown an example of FR-curve, showing a relationship between the assembly sizes and corresponding amount of errors detected by mapping on correct sequence. Authors asked not to make any conclusions about the tools performances from the figure.




--------------------------------------------------------------------------------------------------
I intend not to break any copyrights rules.
Please find article at: http://www.ncbi.nlm.nih.gov/pubmed/?term=Sequence+assembly+demystified 
Authors: Niranjan Nagarajan and Mihai Pop
journal: Nature Reviews Genetics
Publishers: Macmillan Publishers Limited
If you would like review to be removed, please write me at vikas0633@gmail.com
--------------------------------------------------------------------------------------------------


Thursday, February 14, 2013

Improving PacBio Long Read Accuracy by Short Read Alignment

Hi,

I am very excited to start using third generation sequencing (TGS) data, we aim to use hybrid assembly approach to produce a good quality genome. Like you probably know that PacBio produces reads of length from 2,500-10,000 base-pairs but these long reads have errors up to 15%.

Here, we discuss LSC approach, which uses homopolymer compression (HC) transformation. In simple words, all the sites with same consecutive bases are merged into one, i.e, 'TTT' is replaced by T. Authors claim that datasets are reduced to 60% on cost of little sensitivity and specificity.

In principle, method sounds very simple.  Long and short reads are transformed as shown in following figure. Reads with less than 40 non-Ns AAs are filtered out. In third step, shorts reads are mapped on long reads using NovoAlign but we can also use BWA or Seqalto.

Long reads are then modified based on the alignments and correct a substantial amount of nucleotide sites. In example, authors have used 100,000 LRs and 60 million SRs. It takes around 10 hours and 20 Gb space around 10 times less time and space compare to PacBioToCA tools.




This reviews represent my understanding of the article. For a better explanation, please refer to....

http://www.ncbi.nlm.nih.gov/pubmed/23056399