This is another good article by Niranjan with Dr. Pop. They have produced a wealth of articles on NGS analysis during the Niranjan's stay at the Computer science department of Maryland.
Here authors have addressed commonly known issues with the genomic/transcriptomits assembly. Also trade-offs for many approaches are mentioned in detail. There are issues during assembly such as extent of repetitive genomic sequences, which will always affect genome assembly quality independent of the assembly tools being used.
Article also provides slight overview of four different algorithms that are used during merging the short reads into the contigs. These approaches are
1. Greedy - elongating local contigs based on maximum overlap between reads - TIGR Assembler
2. Overlap-layout-consensus - independent overlapping of all read pairs and then creating graph with nodes, conncted by edges - Cerela Assembler
3. De Brujin graph - k-mer based approach, connect the reads with k-1 base overlaps. Commonly used. Errors are corrected both before and after assembly to produce good quality output - Vetlvet, SOAPdenovo and ALLPATHS.
Some of the assembly tools and sequencing technology supported....
Genome assemblers
ALLPATHS-LG - illumina, pacific bioscience
SOAPdenovo - illumina
Velvet - illumina, SOLiD, 454, Sanger
ABySS - illumina, SOLiD, 454, Sanger
Transcriptome assembler
Trinity - Illumina, 454
Oases - Illumina, SOLiD, 454, Sanger
Authors have summarized possible criteria to be considered when sequencing, in order to obtain good quality assemblies. Also there is an urge to increase the interaction between the molecular biologists and bioinformaticians.
Article also talks about approaches that can be used for evaluating the assemblies. People mostly use parameters such as N50 to describe quality but it can be misleading when looking at the sequences, which are present in varying abundance, i.e., in transcriptomics and in metagenomics. There are tools such AMOSvalidate, FR-curve and GAV that can be used for quality evaluation. We evaluated Lotus genome assembly by using break-dancer, which detects degree of structural genomic variants. It requires remapping of all the reads back to targeted assembly.
Following figure is from article where authors have shown an example of FR-curve, showing a relationship between the assembly sizes and corresponding amount of errors detected by mapping on correct sequence. Authors asked not to make any conclusions about the tools performances from the figure.
Here authors have addressed commonly known issues with the genomic/transcriptomits assembly. Also trade-offs for many approaches are mentioned in detail. There are issues during assembly such as extent of repetitive genomic sequences, which will always affect genome assembly quality independent of the assembly tools being used.
Article also provides slight overview of four different algorithms that are used during merging the short reads into the contigs. These approaches are
1. Greedy - elongating local contigs based on maximum overlap between reads - TIGR Assembler
2. Overlap-layout-consensus - independent overlapping of all read pairs and then creating graph with nodes, conncted by edges - Cerela Assembler
3. De Brujin graph - k-mer based approach, connect the reads with k-1 base overlaps. Commonly used. Errors are corrected both before and after assembly to produce good quality output - Vetlvet, SOAPdenovo and ALLPATHS.
Some of the assembly tools and sequencing technology supported....
Genome assemblers
ALLPATHS-LG - illumina, pacific bioscience
SOAPdenovo - illumina
Velvet - illumina, SOLiD, 454, Sanger
ABySS - illumina, SOLiD, 454, Sanger
Transcriptome assembler
Trinity - Illumina, 454
Oases - Illumina, SOLiD, 454, Sanger
Authors have summarized possible criteria to be considered when sequencing, in order to obtain good quality assemblies. Also there is an urge to increase the interaction between the molecular biologists and bioinformaticians.
Article also talks about approaches that can be used for evaluating the assemblies. People mostly use parameters such as N50 to describe quality but it can be misleading when looking at the sequences, which are present in varying abundance, i.e., in transcriptomics and in metagenomics. There are tools such AMOSvalidate, FR-curve and GAV that can be used for quality evaluation. We evaluated Lotus genome assembly by using break-dancer, which detects degree of structural genomic variants. It requires remapping of all the reads back to targeted assembly.
Following figure is from article where authors have shown an example of FR-curve, showing a relationship between the assembly sizes and corresponding amount of errors detected by mapping on correct sequence. Authors asked not to make any conclusions about the tools performances from the figure.
--------------------------------------------------------------------------------------------------
I intend not to break any copyrights rules.
Please find article at: http://www.ncbi.nlm.nih.gov/pubmed/?term=Sequence+assembly+demystified
Authors: Niranjan Nagarajan and Mihai Pop
journal: Nature Reviews Genetics
Publishers: Macmillan Publishers Limited
If you would like review to be removed, please write me at vikas0633@gmail.com
--------------------------------------------------------------------------------------------------