Sunday, November 7, 2010

How to map billions of short reads onto genomes (Nature Biotechnology)

Hi Guys, 

I am taking a course in Next Generation Sequencing so as a thought I am trying to summarize some of articles we have been following. This particular article was published in May 2009 by Cole Trapnell & Steven L Salzberg. It gives us brief idea about tools available for dealing with huge amount of short reads obtained by deep sequencing technologies. In this paper, we can also find a well explained and concise concept used behind making tools such as Mac and Bowtie. I will try to explain more as we progress through this log. 

Lets talk about a bit on so called 'read mapping problem'. After using Next Generation Sequencing Technologies, one will get millions of reads  as an output and then it is a challenge to map these reads against a known or predicted genome. As we have huge amount of small reads and have to find target for each read, software such as BLAST might take a huge amount of time. Such problems led to development of tools those are based on search for short reads efficiently in both time in space. Few of such tools are given in the article

  

We used Bowtie in our exercise because it gives alignment results in less time than other competitive softwares but it does not guaranty based results at the same time. But as far we learned it leads to pretty extensive results. 

As explained in article, both methods Bowtie and Mac, use a linear transformation of genomic information in such way that we can find aligned reads in less time and more/less space. Mac uses Spaced seeds technique where it divides reference genome into small seeds and then store such seeds in indices. Similar way, read is also divided into small seeds and then we compare smaller  lengths and if matched then we match rest of the read sequences. In simple words chop-off both reference genome sequence and read sequences then look for smaller pieces. If there is no smaller piece matching then there will no bigger either. You can understand this flow chart from the following figure. In Bowtie, they follow Burrows-Wheeler concept, which was originally used to compress data of big files into  smaller size by using transformation. In Bowtie, we simply look for matching of first character and then continue this matching till last character. This is done in time efficient way but only drawback is that it does not allow gap insertions and will return only perfect match. 


   Many challenges and questions remain for developers of read mapping software. As all the sequencing machine vendors are trying to produce longer reads, will the short-read mapping programs scale well as the reads get longer? Mac, Bowtie and several other short-read packages support reads longer than 100 bp, but at some point, software designed for longer reads, such as BLAST, may be a better fit for downstream analysis. Furthermore, when mapping reads from an organism that has diverged significantly from its reference genome, how should a program’s parameters be adjusted, and can that adjustment happen automatically? How useful is mapping quality in downstream analysis, and should it be computed while aligning reads, as Mac does, or later? The answers to each of these questions will depend on the type of assay and the scale of the analysis, and as long as the technology continues to change, the programs will have to change rapidly to keep up.

Tuesday, October 19, 2010

An Integrated View of Molecular Coevolution in Protein–Protein Interactions

Hi Guys,

Recently an article giving an overview of protein coevolution theory is published in the Molecular Biology and Evolution journal by David L. Robertson and Simon C. Lovell from University of Manchester. We can find useful references of current work on this topic such as Hakes(2007), Suel(2003), Socolich(2005), Halabi(2009), Wang and Pollock(2005) and Fares(2006) .

This article talks about the action of evolutionary pressure on the regions of the interacting proteins that contribute to binding. Basically this paper talks about how mutation of a protein at one binding site may lead to mutation of another residue involved in same binding site.

Following figures represents the effect of lowering fitness at one interdependent site leads to increase in fitness of other site.  


They also explain the definition of Coevolution from 1950s as "reciprocal evolutionary change in interacting species".  There is well stabilized field of evolution(correlation) in RNA structure prediction but in terms of protein structure prediction using coevolution method is still on the stage of development. But it has been proved that coevolving residues are present in many interacting or non-interacting protein domains. We can see an example of the coevolving residues from the same article but is derived from the David Hausslers (2007) article, which calculates coevolving residues based on the the single parametric model of double amino acid pair substitution and then expands their work on whole Pfam database.


An example of sites that demonstrate intermolecular coevolution. (a) Cyanobacterial and (b) human superoxide dismutase. The residues highlighted are at structurally equivalent positions and exhibit strong covariation (Yeang and Haussler 2007). In the proteins shown, the Phe and the Asn/Gln residues have exchanged positions. (c and d) Sequence profile for the equivalent positions. In each case, the cyanobacterialsequence corresponding to panel (a) is at the top and the human sequence corresponding to panel (b) is at the bottom. For other sequences, the Pfam family names are used. 

It is known that there are only few residues in each protein domain which may  show coevolution. In the next table they have sorted few of the already existing methods for detecting coevolving residues and then compared their prediction of residues involved.

 There are several mechanisms that contribute to the degree of correlations of replacements on amino acids either within one protein chain or between chains. Waddell et al. (2007) are explicit: ‘‘correlated evolution is what is detected,
whereas coevolution is the hypothesized cause.’’

There are, however, a set of causes that may be hypothesized:

(i) Site-specific coevolution between interacting proteinshas been detected in a range of systems (Moyle et al. 1994; Atchley et al. 2000; Mintseris and Weng 2005; Travers and Fares 2007; Yeang and Haussler 2007; Madaoui and Guerois 2008). It is relatively strong on a ‘‘per-residue’’ basis, indicated by its identification from the analysis of a handful of residues. The signal is most easily detected in the ‘‘rim’’ regions surrounding the interaction interface (Travers and Fares 2007; Yeang and Haussler 2007; Kann et al. 2009) rather than the core of the interface itself (Hakes et al. 2007). This is probably
because the interface itself can be somewhat conserved.

(ii) Correlations of evolutionary rates between interacting proteins when measured over the entire protein length (Williams and Hurst 2000; Fraser et al. 2002). The evidence suggests that these rate correlations are unrelated to coevolution; rather they are due to external factors. This suggestion solves the puzzle of evolutionary correlations between spatial distant sites within protein structures (Hakes et al. 2007) and between proteins that do not directly interact (Juan et al. 2008a). It also explains the relative strength of the observed correlated rates. For obligate complexes (i.e., those that are constitutively bound to their interacting partners), the rate correlation between proteins distant in the complex is as strong as for those directly interacting (Hakes et al. 2007). By contrast, for proteins with a more tenuous functional link, the correlation is much weaker (Juan et al. 2008a).

It is clear that site specific molecular coevolution not only exists but it is also necessary to maintain biological function. Authors argue to improve methods for predicting coevolving residues and they also ask for including the fact that sequences considered for detecting coevolution might have come different origin. 

Sunday, October 17, 2010

An Integrated view of protein evolution (review in nature 2006)

Hi Guys,

This article is discuss a very important issue in protein evolution and it well stated in the abstract as " Protein evolution is not determined exclusively by selection on protein structure and function, but is also affected by the genome position of the encoding genes, their expression patters, their position in biological networks and possibly their robustness to mistranslation". So before we go deep into this paper which basically talks about evolutionary rates of different sites, we should take a look on some basics.

I know that most of you understand definition of transition matrix if not then it is a matrix that contains probabilities of each type of amino-acid substitution for a given period of evolution. I highlighted the last part of definition as it is very important to understand we can't use any transition matrix until it is specified for the evolutionary distances according to your data used in multiple sequence alignment. One can use PAM1, PAM120, PAM250 etc. depending on your data. If you don't know which one to use then just use PAM120 which considered to be optimal.

Good point to notice that protein encoded by genes under high recombination rates should evolve quickly. Well it makes sense ;). In terms of applicability, mutations at the most conserved sites of disease-associated genes are those most likely to be involved in pathology but no one knows if genes related to disease class evolve slower or faster than rest of genome. If you remember the molecular clock hypothesis which told us that protein evolution proceeds at an approximately constant rate over time, has been proved wrong by current research. In real, evolutionary rates of the proteome vary considerably across species.

I have introduced few figures from the article and those give precise information about rate of protein evolution with gene dispensability and expression level of gene.


First figure is obtained by Wall et al. using sequences from four yeast species of the Saccharomyces genus. The rate of protein evolution is weekly associated with the severity of the fitness effect of gene deletions in yeast.


In the above figure, we can see that gene expression level correlates strongly and negatively with the rate of evolution in yeast. These are again calculated by Wall et al. on the same data set as for gene dispensibility.



In last figure, we can have an evolution rate affecting factors. I am going to describe each process step by step-
a) Transcription causes increased spontaneous mutation rates in Sacchromyces cerevisiae and E.coli, probably by exposing the non-transcribed ssDNA to mutagenic chemicals
b) Recombinational repair of double stranded breaks in S.cerevisiae increases the frequency of near by point mutations. 
c) Genes that are close to recombination hotspots in S.cerevisiae are expressed at higher levels during vegetative growth than most other genes.
d) Essential genes are clustered in region of low recombination in S.cerevisiae and Caenorhanditis elegans.
e) Proteins that are more dispensable tend to be expressed at lower level than less dispensable onces. 
f)  More protein–protein interactions have been reported for highly expressed proteins than for low-bundance proteins in S. cerevisiae. However, this correlation is not supported by all interaction-detection methods, and might reflect a detection bias towards high-abundance proteins.  
g) It has been reported that essential genes have more protein–protein interactions than non-essential genes. However, this correlation might be an artefact of biases in certain interaction data sets

As in conclusions this articles states that genomic data has a range of influence on protein evolution but there is still plenty of empty stacks to fill in. These studies can incorporate further information based on duplecation and functional divergence of genes, protein domain shuffling, and horizontal gene transfer across species.  These studies can improve our understanding of protein evolution and then this knowledge can be used  to validate protein interaction data by comparing evolutionary rates, or when identifying potential drug targets in microbes, under the assumption that they are slowly evolving. 

This article might have some minor mistakes if you want more details then you can find it on article by Martin J. Lercher. 

Thanks.

Best Regards,
Vikas Gupta