FunSimMat - Functional Similarity Matrix

Semantic similarity scores

Resnik (Res columns)

Resnik's measure is a measure of semantic similarity between ontology terms. It is based on the information content of a term,
which uses the term probability. We define the relative frequency of a GO term in the UniProt database as its probability.
Resnik's measure for comparing two terms t₁ and t₂ is defined as follows:
Resnik

S(t₁, t₂) is the set of common ancestors of terms t₁ and t₂ in the ontology. It ranges from 0 for terms without similarity to infinity.
We use the abbreviation Res for referring to Resnik's measure. By default, these columns are hidden in the results of functional
comparison.
Ref.: Resnik, J Artif Intell Res (1999), 11:95-130.

Lin (Lin columns)

Lin's measure of semantic similarity is also based on the information content of GO terms. Lin's measure for comparing two terms
t₁ and t₂ is defined as follows:
Lin

Again, S(t₁, t₂) is the set of common ancestors of terms t₁ and t₂ in the ontology. Lin's measure ranges from 0 for terms without
similarity to 1 for terms with maximum similarity. We use the abbreviation Lin for referring to Lin's measure.
Ref.: Lin, Proc 15th Int'l Conf. on Machine Learning (ICML-98) (1998), 296-304.

simRel (simRel columns)

The simRel score is a functional similarity measure for comparing two GO terms with each other. It is based on Resnik's and Lin's
similarity measures. The simRel score ranges from 0 for terms that have no similarity to 1 for terms with maximum similarity. It is
calculated as follows:
simrel

t₁ and t₂ are two GO terms, and S(t₁, t₂) is the set of their common ancestors in the ontology. simRel ranges from 0 for terms without
similarity to 1 for terms with maximum similarity. We use the abbreviation simRel for referring to the simRel score.
Ref.: Schlicker et al., BMC Bioinformatics (2006), 7(1):302

Jiang and Conrath (Jiang columns)

Jiang and Conrath defined a distance measure between GO terms that is based on the information content. The similarity measure
using Jiang and Conrath's distance is defined as follows:

S(t₁, t₂) is the set of common ancestors of terms t₁ and t₂ in the ontology. It ranges from 0 for terms without similarity to 1.
We use the abbreviation Jiang for referring to this score. By default, these columns are hidden in the results of functional
comparison.
Ref.: Couto et al., Data & Knowledge Engineering (2007), 137-152

Functional similarity scores

Functional similarity measures are used to compare two proteins or protein families. There are two types of functional similarity
measures, structure-based and semantic similarity-based. The different functional similarity measures between two proteins
or protein families p and q annotated with the sets GO^p and GO^q of sizes N and M, respectively, are defined as follows.

UI

The UI score is defined as follows:

g^p and g^q are the nodes in graphs induced by the sets GO^p and GO^q, respectively. The graph induced by a term t
contains t and all of its ancestor terms. The UI score ranges from 0 for no similarity to 1 for highest similarity.
By default, these columns are hidden in the results of functional comparison.
Ref.: Guo et al., Bioinformatics (2006), 967-973

simGIC

The simGIC score is based on the UI score. Instead of counting the number of nodes in the union and intersection of the two induced
graphs, it sums up their information content. It is defined as follows:
simGIC

The simGIC score ranges from 0 for no similarity to 1 for highest similarity. By default, these columns are hidden in the
results of functional comparison.
Ref.: Pesquita et al., Proc 10th Annual Bio-Ontologies Meeting (2007)

TO and NTO

The term overlap (TO) and normalized term overlap (NTO) scores are based on the number of annotated terms shared between two proteins.
The TO score is defined as the number of terms shared between g^p and g^q, but excluding the root terms:

The NTO score is defined as the TO score normalized by the number of terms in the smaller graphs:
normalized term overlap

Ref.: Mistry and Pavlidis, BMC Bioinformatics (2008), 9:327

GOscore BM

A GOscore is a measure of functional similarity between two proteins or protein families with respect to either biological process
(BPscore), molecular function (MFscore), or cellular component (CCscore). Considering two gene products A and B annotated
with the sets GO^A and GO^B of GO terms with sizes N and M, respectively, a similarity matrix S is calculated. This matrix contains
all pair wise similarity values of mappings GO^A_i of gene product A and mappings GO^B_j of gene product B:
simmatrix

The matrix S is not necessarily symmetric or square since the proteins can have different types and numbers of GO mappings. The
rows and the columns of S represent two different directional comparisons, row vectors correspond to a comparison of A to B
and column vectors to a comparison of B to A. The best hits for the comparison of A with B are determined as maximum values in the
rows in matrix S (row maxima). The maximum values in the columns of S (column maxima) are the best hits for the direction B to A.
The averages over the row maxima and the column maxima give similarity values for the comparison of A to B and the comparison of
B to A, respectively:

The GOscore is then computed as the maximum of rowScore and columnScore:
goscore bm

It can be computed either by Resnik's measure, Lin's measure or simRel. We use the abbreviations BP, MF, and CC for referring to
the BPscore, MFscore, and CCscore, respectively.
Ref.: Schlicker et al., BMC Bioinformatics (2006), 7(1):302

GOscore max (max columns)

GOscore max is a measure of functional similarity between two proteins or protein families with respect to either biological process
(BPscore), molecular function (MFscore), or cellular component (CCscore). The matrix S is computed as described for GOscore BM.
GOscore max is defined as the maximum over all s_ij:

It can be computed either by Resnik's measure, Lin's measure or simRel.
Ref.: Lord et al., Bioinformatics (2003), 19(10):1275-1283

GOscore avg (avg columns)

GOscore avg is a measure of functional similarity between two proteins or protein families with respect to either biological process (BPscore),
molecular function (MFscore), or cellular component (CCscore). The matrix S is computed as described for GOscore BM.
GOscore max is defined as the maximum over all s_ij:
goscore avg

It can be computed either by Resnik's measure, Lin's measure or simRel.
Ref.: Lord et al., Bioinformatics (2003), 19(10):1275-1283

funSim

The funSim score is calculated from the BPscore and the MFscore of a pair of proteins or protein families. It is defined as follows:
funsim

Here, max(BPscore) and max(MFscore) denote the maximal score for biological process and molecular function, respectively.
The funSim score is computed using simRel, and GOscore. It ranges from 0 for no functional similarity to 1 for maximal functional similarity.
Ref.: Schlicker et al., BMC Bioinformatics (2006), 7(1):302

rfunSim

The rfunSim score is calculated from the funSim of a pair of proteins or protein families. It is defined as square root of the funSim score. It
ranges from 0 for no functional similarity to 1 for maximal functional similarity.
Ref.: Schlicker et al., Genome Biol (2007), 8(3):R33

funSimAll

The funSimAll score is calculated from the BPscore, MFscore and the CCscore of a pair of proteins or protein families. It is defined as:

Here, max(BPscore), max(MFscore) and max(CCscore) denote the maximum possible score for biological process, molecular function, and
cellular component, respectively. The funSim score is computed using simRel, and GOscore. It ranges from 0 for no functional similarity to 1
for maximal functional similarity.

rfunSimAll

The rfunSimAll score is calculated as the square root of the funSimAll score of a pair of proteins or protein families.
It ranges from 0 for no functional similarity to 1 for maximal functional similarity.