Bioinformatics Article Reviews: 2013

Friday, August 9, 2013

Transcriptome analysis at four developmental stages of grape berry (Vitis vinifera cv. Shiraz) provides insights into regulated and coordinated gene expression

Hi,

I have been working on RNA-seq data from Blueberry for a while and couple of day ago I came across this article. It describes developmental stages of grape berry using gene expression data.

Similar to blueberry, grape is also a berry producing plant so it makes more sense to compare blueberry against the grape rather than comparing against the Arabidopsis. Both blueberry and grape produce anthocyanins responsible for ripening red color. Here authors motivate the use of RNA-seq against microarray for finding the genes controlling the biological process involved during the berry development. Authors here sequenced a single lane of Illumina HiSeq 2000, getting around 16 billion bases. They do not have any replicates and all four sample types have about 40 million reads. For transcript reconstruction, they have used something called GAZE, which I have no idea about but I wonder why not using well established TopHat/Cufflinks. In particular, I liked the clustering in figure using Cluter3.0.

--------------------------------------------------------------------------------------------------

I intend not to break any copyrights rules.

Please find article at: http://www.biomedcentral.com/1471-2164/13/691

Authors: Crystal Sweetman, Darren CJ Wong, Christopher M Ford and Damian P Drew

journal: BMC Genomics

Publishers: BioMed Central Ltd.

If you would like review to be removed, please write me at vikas0633@gmail.com

--------------------------------------------------------------------------------------------------

Saturday, February 16, 2013

Sequence assembly demystified

This is another good article by Niranjan with Dr. Pop. They have produced a wealth of articles on NGS analysis during the Niranjan's stay at the Computer science department of Maryland.

Here authors have addressed commonly known issues with the genomic/transcriptomits assembly. Also trade-offs for many approaches are mentioned in detail. There are issues during assembly such as extent of repetitive genomic sequences, which will always affect genome assembly quality independent of the assembly tools being used.

Article also provides slight overview of four different algorithms that are used during merging the short reads into the contigs. These approaches are
1. Greedy - elongating local contigs based on maximum overlap between reads - TIGR Assembler
2. Overlap-layout-consensus - independent overlapping of all read pairs and then creating graph with nodes, conncted by edges - Cerela Assembler
3. De Brujin graph - k-mer based approach, connect the reads with k-1 base overlaps. Commonly used. Errors are corrected both before and after assembly to produce good quality output - Vetlvet, SOAPdenovo and ALLPATHS.

Some of the assembly tools and sequencing technology supported....

Genome assemblers
ALLPATHS-LG - illumina, pacific bioscience
SOAPdenovo - illumina
Velvet - illumina, SOLiD, 454, Sanger
ABySS - illumina, SOLiD, 454, Sanger

Transcriptome assembler
Trinity - Illumina, 454
Oases - Illumina, SOLiD, 454, Sanger

Authors have summarized possible criteria to be considered when sequencing, in order to obtain good quality assemblies. Also there is an urge to increase the interaction between the molecular biologists and bioinformaticians.

Article also talks about approaches that can be used for evaluating the assemblies. People mostly use parameters such as N50 to describe quality but it can be misleading when looking at the sequences, which are present in varying abundance, i.e., in transcriptomics and in metagenomics. There are tools such AMOSvalidate, FR-curve and GAV that can be used for quality evaluation. We evaluated Lotus genome assembly by using break-dancer, which detects degree of structural genomic variants. It requires remapping of all the reads back to targeted assembly.

Following figure is from article where authors have shown an example of FR-curve, showing a relationship between the assembly sizes and corresponding amount of errors detected by mapping on correct sequence. Authors asked not to make any conclusions about the tools performances from the figure.

--------------------------------------------------------------------------------------------------

I intend not to break any copyrights rules.

Please find article at: http://www.ncbi.nlm.nih.gov/pubmed/?term=Sequence+assembly+demystified

Authors: Niranjan Nagarajan and Mihai Pop

journal: Nature Reviews Genetics

Publishers: Macmillan Publishers Limited

If you would like review to be removed, please write me at vikas0633@gmail.com

--------------------------------------------------------------------------------------------------

Thursday, February 14, 2013

Improving PacBio Long Read Accuracy by Short Read Alignment

Hi,

I am very excited to start using third generation sequencing (TGS) data, we aim to use hybrid assembly approach to produce a good quality genome. Like you probably know that PacBio produces reads of length from 2,500-10,000 base-pairs but these long reads have errors up to 15%.

Here, we discuss LSC approach, which uses homopolymer compression (HC) transformation. In simple words, all the sites with same consecutive bases are merged into one, i.e, 'TTT' is replaced by T. Authors claim that datasets are reduced to 60% on cost of little sensitivity and specificity.

In principle, method sounds very simple. Long and short reads are transformed as shown in following figure. Reads with less than 40 non-Ns AAs are filtered out. In third step, shorts reads are mapped on long reads using NovoAlign but we can also use BWA or Seqalto.

Long reads are then modified based on the alignments and correct a substantial amount of nucleotide sites. In example, authors have used 100,000 LRs and 60 million SRs. It takes around 10 hours and 20 Gb space around 10 times less time and space compare to PacBioToCA tools.

This reviews represent my understanding of the article. For a better explanation, please refer to....

http://www.ncbi.nlm.nih.gov/pubmed/23056399