Frens Tedeschini
A major challenge in bioinformatics is to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated by wet-lab assays.
Recently, new experimental techniques have become available, producing data which offer clues about gene function (e.g. protein-protein interaction data, gene expression data, metabolic profiling, etc).
However, it has become clear that while each data type contains important information that can help in determining the function of a gene, no single data type by itself suffices. Also, it has been shown that large-scale functional inference greatly improves by integrating
evidence from different sources.
The aim of this project is to analyze and quantify the amount of information that each data type conveys about gene function.
This project will begin by collecting publicly available experimental and
computational datasets for the following organisms:
- S. cerevisiae
- C. elegans
- D. melanogaster
- A. thaliana
- H. sapiens
Particularly we shall focus on these different types of data (when available for a given organism): homology data, sequence pattern, structure pattern, gene expression, protein expression, metabolite expression, protein-protein interaction, genetic interaction and
pathway information.
For the genes which have already been experimentally annotated we shall also collect their respective functional annotation (based on GO - the Gene Ontology).
We shall then attempt to quantify the amount of information about gene function contained in each data type. To do this we shall use both standard statistical techniques as well as Random Graph Dependency, a novel information theory measure for quantifying distances between large graphs.