Max-Planck-Institut für Informatik
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

FunSimMat - Functional Similarity Matrix

Semantic similarity scores

Resnik (Res columns)

Resnik's measure is a measure of semantic similarity between ontology terms. It is based on the information content of a term,
which uses the term probability. We define the relative frequency of a GO term in the UniProt database as its probability.
Resnik's measure for comparing two terms t1 and t2 is defined as follows:
Resnik


S(t1, t2) is the set of common ancestors of terms t1 and t2 in the ontology. It ranges from 0 for terms without similarity to infinity.
We use the abbreviation Res for referring to Resnik's measure. By default, these columns are hidden in the results of functional
comparison.
Ref.: Resnik, J Artif Intell Res (1999), 11:95-130.

Lin (Lin columns)

Lin's measure of semantic similarity is also based on the information content of GO terms. Lin's measure for comparing two terms
t1 and t2 is defined as follows:
Lin


Again, S(t1, t2) is the set of common ancestors of terms t1 and t2 in the ontology. Lin's measure ranges from 0 for terms without
similarity to 1 for terms with maximum similarity. We use the abbreviation Lin for referring to Lin's measure.
Ref.: Lin, Proc 15th Int'l Conf. on Machine Learning (ICML-98) (1998), 296-304.

simRel (simRel columns)

The simRel score is a functional similarity measure for comparing two GO terms with each other. It is based on Resnik's and Lin's
similarity measures. The simRel score ranges from 0 for terms that have no similarity to 1 for terms with maximum similarity. It is
calculated as follows:
simrel


t1 and t2 are two GO terms, and S(t1, t2) is the set of their common ancestors in the ontology. simRel ranges from 0 for terms without
similarity to 1 for terms with maximum similarity. We use the abbreviation simRel for referring to the simRel score.
Ref.: Schlicker et al., BMC Bioinformatics (2006), 7(1):302

Jiang and Conrath (Jiang columns)

Jiang and Conrath defined a distance measure between GO terms that is based on the information content. The similarity measure
using Jiang and Conrath's distance is defined as follows:
Jiang and Conrath


S(t1, t2) is the set of common ancestors of terms t1 and t2 in the ontology. It ranges from 0 for terms without similarity to 1.
We use the abbreviation Jiang for referring to this score. By default, these columns are hidden in the results of functional
comparison.
Ref.: Couto et al., Data & Knowledge Engineering (2007), 137-152


Functional similarity scores

Functional similarity measures are used to compare two proteins or protein families. There are two types of functional similarity
measures, structure-based and semantic similarity-based. The different functional similarity measures between two proteins
or protein families p and q annotated with the sets GOp and GOq of sizes N and M, respectively, are defined as follows.

UI

The UI score is defined as follows: UI


gp and gq are the nodes in graphs induced by the sets GOp and GOq, respectively. The graph induced by a term t
contains t and all of its ancestor terms. The UI score ranges from 0 for no similarity to 1 for highest similarity.
By default, these columns are hidden in the results of functional comparison.
Ref.: Guo et al., Bioinformatics (2006), 967-973

simGIC

The simGIC score is based on the UI score. Instead of counting the number of nodes in the union and intersection of the two induced
graphs, it sums up their information content. It is defined as follows:
simGIC


The simGIC score ranges from 0 for no similarity to 1 for highest similarity. By default, these columns are hidden in the
results of functional comparison.
Ref.: Pesquita et al., Proc 10th Annual Bio-Ontologies Meeting (2007)

TO and NTO

The term overlap (TO) and normalized term overlap (NTO) scores are based on the number of annotated terms shared between two proteins.
The TO score is defined as the number of terms shared between gp and gq, but excluding the root terms:
term overlap


The NTO score is defined as the TO score normalized by the number of terms in the smaller graphs:
normalized term overlap


Ref.: Mistry and Pavlidis, BMC Bioinformatics (2008), 9:327

GOscore BM

A GOscore is a measure of functional similarity between two proteins or protein families with respect to either biological process
(BPscore), molecular function (MFscore), or cellular component (CCscore). Considering two gene products A and B annotated
with the sets GOA and GOB of GO terms with sizes N and M, respectively, a similarity matrix S is calculated. This matrix contains
all pair wise similarity values of mappings GOAi of gene product A and mappings GOBj of gene product B:
simmatrix


The matrix S is not necessarily symmetric or square since the proteins can have different types and numbers of GO mappings. The
rows and the columns of S represent two different directional comparisons, row vectors correspond to a comparison of A to B
and column vectors to a comparison of B to A. The best hits for the comparison of A with B are determined as maximum values in the
rows in matrix S (row maxima). The maximum values in the columns of S (column maxima) are the best hits for the direction B to A.
The averages over the row maxima and the column maxima give similarity values for the comparison of A to B and the comparison of
B to A, respectively:
rowscore


columnscore


The GOscore is then computed as the maximum of rowScore and columnScore:
goscore bm


It can be computed either by Resnik's measure, Lin's measure or simRel. We use the abbreviations BP, MF, and CC for referring to
the BPscore, MFscore, and CCscore, respectively.
Ref.: Schlicker et al., BMC Bioinformatics (2006), 7(1):302

GOscore max (max columns)

GOscore max is a measure of functional similarity between two proteins or protein families with respect to either biological process
(BPscore), molecular function (MFscore), or cellular component (CCscore). The matrix S is computed as described for GOscore BM.
GOscore max is defined as the maximum over all sij:
goscore max


It can be computed either by Resnik's measure, Lin's measure or simRel.
Ref.: Lord et al., Bioinformatics (2003), 19(10):1275-1283

GOscore avg (avg columns)

GOscore avg is a measure of functional similarity between two proteins or protein families with respect to either biological process (BPscore),
molecular function (MFscore), or cellular component (CCscore). The matrix S is computed as described for GOscore BM.
GOscore max is defined as the maximum over all sij:
goscore avg


It can be computed either by Resnik's measure, Lin's measure or simRel.
Ref.: Lord et al., Bioinformatics (2003), 19(10):1275-1283

funSim

The funSim score is calculated from the BPscore and the MFscore of a pair of proteins or protein families. It is defined as follows:
funsim


Here, max(BPscore) and max(MFscore) denote the maximal score for biological process and molecular function, respectively.
The funSim score is computed using simRel, and GOscore. It ranges from 0 for no functional similarity to 1 for maximal functional similarity.
Ref.: Schlicker et al., BMC Bioinformatics (2006), 7(1):302

rfunSim

The rfunSim score is calculated from the funSim of a pair of proteins or protein families. It is defined as square root of the funSim score. It
ranges from 0 for no functional similarity to 1 for maximal functional similarity.
Ref.: Schlicker et al., Genome Biol (2007), 8(3):R33

funSimAll

The funSimAll score is calculated from the BPscore, MFscore and the CCscore of a pair of proteins or protein families. It is defined as:
funsimall


Here, max(BPscore), max(MFscore) and max(CCscore) denote the maximum possible score for biological process, molecular function, and
cellular component, respectively. The funSim score is computed using simRel, and GOscore. It ranges from 0 for no functional similarity to 1
for maximal functional similarity.

rfunSimAll

The rfunSimAll score is calculated as the square root of the funSimAll score of a pair of proteins or protein families.
It ranges from 0 for no functional similarity to 1 for maximal functional similarity.