Analysis and quantification of information in biological networks for protein function prediction

   page       attach   
Frens Tedeschini

A major challenge in bioinformatics is to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated by wet-lab assays. 

Recently, new experimental  techniques have become available, producing data which offer clues about gene function (e.g. protein-protein interaction data, gene expression  data,  metabolic  profiling,  etc).
However, it  has  become  clear that while each data type contains important  information that can help in determining the  function of  a gene, no single data  type  by  itself  suffices. Also, it  has been shown  that  large-scale  functional  inference  greatly  improves  by  integrating
evidence from different sources.

The aim of this project is to analyze and quantify the amount of information that each data type conveys about gene function.
This  project  will  begin  by  collecting  publicly  available  experimental  and
computational datasets for the following organisms:
- S. cerevisiae
- C. elegans
- D. melanogaster
- A.  thaliana
- H. sapiens
Particularly we shall  focus on  these different  types  of  data  (when  available  for  a  given  organism):  homology  data, sequence  pattern,  structure  pattern,  gene  expression,  protein  expression, metabolite  expression,  protein-protein  interaction,  genetic  interaction and
pathway  information.
For  the  genes  which  have  already  been  experimentally annotated we shall also collect  their  respective  functional annotation  (based on GO - the Gene Ontology).
We shall then attempt to quantify the amount of information about gene function contained  in  each  data  type.  To  do  this  we  shall  use  both  standard  statistical techniques as well as Random Graph Dependency,  a  novel  information  theory measure for quantifying distances between large graphs.