Bioinformatics Article Reviews
Friday, August 9, 2013
Transcriptome analysis at four developmental stages of grape berry (Vitis vinifera cv. Shiraz) provides insights into regulated and coordinated gene expression
Hi,
I have been working on RNA-seq data from Blueberry for a while and couple of day ago I came across this article. It describes developmental stages of grape berry using gene expression data.
Similar to blueberry, grape is also a berry producing plant so it makes more sense to compare blueberry against the grape rather than comparing against the Arabidopsis. Both blueberry and grape produce anthocyanins responsible for ripening red color. Here authors motivate the use of RNA-seq against microarray for finding the genes controlling the biological process involved during the berry development. Authors here sequenced a single lane of Illumina HiSeq 2000, getting around 16 billion bases. They do not have any replicates and all four sample types have about 40 million reads. For transcript reconstruction, they have used something called GAZE, which I have no idea about but I wonder why not using well established TopHat/Cufflinks. In particular, I liked the clustering in figure using Cluter3.0.
I have been working on RNA-seq data from Blueberry for a while and couple of day ago I came across this article. It describes developmental stages of grape berry using gene expression data.
Similar to blueberry, grape is also a berry producing plant so it makes more sense to compare blueberry against the grape rather than comparing against the Arabidopsis. Both blueberry and grape produce anthocyanins responsible for ripening red color. Here authors motivate the use of RNA-seq against microarray for finding the genes controlling the biological process involved during the berry development. Authors here sequenced a single lane of Illumina HiSeq 2000, getting around 16 billion bases. They do not have any replicates and all four sample types have about 40 million reads. For transcript reconstruction, they have used something called GAZE, which I have no idea about but I wonder why not using well established TopHat/Cufflinks. In particular, I liked the clustering in figure using Cluter3.0.
--------------------------------------------------------------------------------------------------
I intend not to break any copyrights rules.
Please find article at: http://www.biomedcentral.com/1471-2164/13/691
Authors: Crystal Sweetman, Darren CJ Wong, Christopher M Ford and Damian P Drew
journal: BMC Genomics
Publishers: BioMed Central Ltd.
If you would like review to be removed, please write me at vikas0633@gmail.com
--------------------------------------------------------------------------------------------------
Saturday, February 16, 2013
Sequence assembly demystified
This is another good article by Niranjan with Dr. Pop. They have produced a wealth of articles on NGS analysis during the Niranjan's stay at the Computer science department of Maryland.
Here authors have addressed commonly known issues with the genomic/transcriptomits assembly. Also trade-offs for many approaches are mentioned in detail. There are issues during assembly such as extent of repetitive genomic sequences, which will always affect genome assembly quality independent of the assembly tools being used.
Article also provides slight overview of four different algorithms that are used during merging the short reads into the contigs. These approaches are
1. Greedy - elongating local contigs based on maximum overlap between reads - TIGR Assembler
2. Overlap-layout-consensus - independent overlapping of all read pairs and then creating graph with nodes, conncted by edges - Cerela Assembler
3. De Brujin graph - k-mer based approach, connect the reads with k-1 base overlaps. Commonly used. Errors are corrected both before and after assembly to produce good quality output - Vetlvet, SOAPdenovo and ALLPATHS.
Some of the assembly tools and sequencing technology supported....
Genome assemblers
ALLPATHS-LG - illumina, pacific bioscience
SOAPdenovo - illumina
Velvet - illumina, SOLiD, 454, Sanger
ABySS - illumina, SOLiD, 454, Sanger
Transcriptome assembler
Trinity - Illumina, 454
Oases - Illumina, SOLiD, 454, Sanger
Authors have summarized possible criteria to be considered when sequencing, in order to obtain good quality assemblies. Also there is an urge to increase the interaction between the molecular biologists and bioinformaticians.
Article also talks about approaches that can be used for evaluating the assemblies. People mostly use parameters such as N50 to describe quality but it can be misleading when looking at the sequences, which are present in varying abundance, i.e., in transcriptomics and in metagenomics. There are tools such AMOSvalidate, FR-curve and GAV that can be used for quality evaluation. We evaluated Lotus genome assembly by using break-dancer, which detects degree of structural genomic variants. It requires remapping of all the reads back to targeted assembly.
Following figure is from article where authors have shown an example of FR-curve, showing a relationship between the assembly sizes and corresponding amount of errors detected by mapping on correct sequence. Authors asked not to make any conclusions about the tools performances from the figure.
Here authors have addressed commonly known issues with the genomic/transcriptomits assembly. Also trade-offs for many approaches are mentioned in detail. There are issues during assembly such as extent of repetitive genomic sequences, which will always affect genome assembly quality independent of the assembly tools being used.
Article also provides slight overview of four different algorithms that are used during merging the short reads into the contigs. These approaches are
1. Greedy - elongating local contigs based on maximum overlap between reads - TIGR Assembler
2. Overlap-layout-consensus - independent overlapping of all read pairs and then creating graph with nodes, conncted by edges - Cerela Assembler
3. De Brujin graph - k-mer based approach, connect the reads with k-1 base overlaps. Commonly used. Errors are corrected both before and after assembly to produce good quality output - Vetlvet, SOAPdenovo and ALLPATHS.
Some of the assembly tools and sequencing technology supported....
Genome assemblers
ALLPATHS-LG - illumina, pacific bioscience
SOAPdenovo - illumina
Velvet - illumina, SOLiD, 454, Sanger
ABySS - illumina, SOLiD, 454, Sanger
Transcriptome assembler
Trinity - Illumina, 454
Oases - Illumina, SOLiD, 454, Sanger
Authors have summarized possible criteria to be considered when sequencing, in order to obtain good quality assemblies. Also there is an urge to increase the interaction between the molecular biologists and bioinformaticians.
Article also talks about approaches that can be used for evaluating the assemblies. People mostly use parameters such as N50 to describe quality but it can be misleading when looking at the sequences, which are present in varying abundance, i.e., in transcriptomics and in metagenomics. There are tools such AMOSvalidate, FR-curve and GAV that can be used for quality evaluation. We evaluated Lotus genome assembly by using break-dancer, which detects degree of structural genomic variants. It requires remapping of all the reads back to targeted assembly.
Following figure is from article where authors have shown an example of FR-curve, showing a relationship between the assembly sizes and corresponding amount of errors detected by mapping on correct sequence. Authors asked not to make any conclusions about the tools performances from the figure.
--------------------------------------------------------------------------------------------------
I intend not to break any copyrights rules.
Please find article at: http://www.ncbi.nlm.nih.gov/pubmed/?term=Sequence+assembly+demystified
Authors: Niranjan Nagarajan and Mihai Pop
journal: Nature Reviews Genetics
Publishers: Macmillan Publishers Limited
If you would like review to be removed, please write me at vikas0633@gmail.com
--------------------------------------------------------------------------------------------------
Thursday, February 14, 2013
Improving PacBio Long Read Accuracy by Short Read Alignment
Hi,
I am very excited to start using third generation sequencing (TGS) data, we aim to use hybrid assembly approach to produce a good quality genome. Like you probably know that PacBio produces reads of length from 2,500-10,000 base-pairs but these long reads have errors up to 15%.
Here, we discuss LSC approach, which uses homopolymer compression (HC) transformation. In simple words, all the sites with same consecutive bases are merged into one, i.e, 'TTT' is replaced by T. Authors claim that datasets are reduced to 60% on cost of little sensitivity and specificity.
In principle, method sounds very simple. Long and short reads are transformed as shown in following figure. Reads with less than 40 non-Ns AAs are filtered out. In third step, shorts reads are mapped on long reads using NovoAlign but we can also use BWA or Seqalto.
Long reads are then modified based on the alignments and correct a substantial amount of nucleotide sites. In example, authors have used 100,000 LRs and 60 million SRs. It takes around 10 hours and 20 Gb space around 10 times less time and space compare to PacBioToCA tools.
This reviews represent my understanding of the article. For a better explanation, please refer to....
http://www.ncbi.nlm.nih.gov/pubmed/23056399
I am very excited to start using third generation sequencing (TGS) data, we aim to use hybrid assembly approach to produce a good quality genome. Like you probably know that PacBio produces reads of length from 2,500-10,000 base-pairs but these long reads have errors up to 15%.
Here, we discuss LSC approach, which uses homopolymer compression (HC) transformation. In simple words, all the sites with same consecutive bases are merged into one, i.e, 'TTT' is replaced by T. Authors claim that datasets are reduced to 60% on cost of little sensitivity and specificity.
In principle, method sounds very simple. Long and short reads are transformed as shown in following figure. Reads with less than 40 non-Ns AAs are filtered out. In third step, shorts reads are mapped on long reads using NovoAlign but we can also use BWA or Seqalto.
Long reads are then modified based on the alignments and correct a substantial amount of nucleotide sites. In example, authors have used 100,000 LRs and 60 million SRs. It takes around 10 hours and 20 Gb space around 10 times less time and space compare to PacBioToCA tools.
This reviews represent my understanding of the article. For a better explanation, please refer to....
http://www.ncbi.nlm.nih.gov/pubmed/23056399
Thursday, September 15, 2011
A whole-genome phylogeny of the family Pasteurellaceae
I have been delaying the work on whole genome based phylogeny for almost two months but I have no escape now and had work on it. In principle it is pretty simple but data mining and formatting is itself a challenge. I will try to warm up by reading this article published in Molecular Phylogenetics and Evolution. This work in done Sackler Institute last year.
They have used 12 whole genome sequences with 3130 genes. It is shown that we should have at least 160 genes concatenated in order to produce reliable results.
"More recent phylogenetic studies
(Christensen et al., 2004; Gioia et al., 2006; Redfield et al., 2006)
have included the added power of considering multiple genes in
phylogenetic analysis. With over 10 species of Pasteurellaceae with
whole-genome sequences it is now possible to use whole-genome
datasets to assess the evolutionary relationships in this family."
Here is a list of species used on article,
Here comes the tough part, following methods and material,
1. Matrix construction
They have used 12 whole genome sequences with 3130 genes. It is shown that we should have at least 160 genes concatenated in order to produce reliable results.
"More recent phylogenetic studies
(Christensen et al., 2004; Gioia et al., 2006; Redfield et al., 2006)
have included the added power of considering multiple genes in
phylogenetic analysis. With over 10 species of Pasteurellaceae with
whole-genome sequences it is now possible to use whole-genome
datasets to assess the evolutionary relationships in this family."
Here is a list of species used on article,
Here comes the tough part, following methods and material,
1. Matrix construction
Tuesday, September 6, 2011
Structure-Function Analysis of a CVNH-LysM Lectin Expressed during Plant Infection by the Rice Blast Fungus Magnaporthe oryzae
Hi Guys(probably only me :P),
This article was discussed in our group and worth mentioning a short review here. Just to start with if CVNH means Cyanovirin-N homolog proteins which are not very well studied. Here author talks about different types of CVNH and differentiate these on functional/structural basis. Here is type III domains of CVNH:
Both of the domains CVNH and LysM has been shown here to be Carbohydrate specific. Similarity between CVNH and LysM domain has been shown to argue for the similar function.
Further they have determined 3D NMR structure of this construct(MoCVNH-LysM).
This article was discussed in our group and worth mentioning a short review here. Just to start with if CVNH means Cyanovirin-N homolog proteins which are not very well studied. Here author talks about different types of CVNH and differentiate these on functional/structural basis. Here is type III domains of CVNH:
Both of the domains CVNH and LysM has been shown here to be Carbohydrate specific. Similarity between CVNH and LysM domain has been shown to argue for the similar function.
Further they have determined 3D NMR structure of this construct(MoCVNH-LysM).
That
is all except more explanation on carbohydrate binding and for more details you
can always refer to article.
Subscribe to:
Posts (Atom)