G-SESAME
Gene Semantic Similarity Analysis and Measurement Tools


Help and Tool Descriptions

Since many biological data analysis methods require numeric representation of the functional similarity of genes, automatically discovering the descriptive similarities of genes and converting them into measurable numeric values are very important for such analyses; This research project will solve this important problem by designing novel algorithms to measure the semantic similarity of vocabularies used to annotate genes and, in turn, devising effective algorithms to determine the functional similarity of genes.

Based on these algorithms, the following online tools are implemented with more tools still under development.

Using tools in interactive mode

Based on our novel method to decode a GO term's semantics into a numeric value by aggregating the semantic contributions of their ancestor terms (including this specific term) in the GO hierarchy, we implemented the following tools to measure the semantic similarity of GO terms:

  1. Semantic similarity of two GO terms.
  2. Input:
    1. Two GO terms1 and GO terms2, the format is only numbers without "GO:";
    2. "is-a" value (default value is 0.8, G-SESAME only);
    3. "part-of" value (default value is 0.6, G-SESAME only);
    Output:
    1. Numerical similarity value of two GO terms;
    2. Accession IDs;
    3. Names;
    4. Definitions;
    5. Synonyms;
    6. Two GOs and their ancestors similarity DAGs;
  3. Semantic similarity of two GO term sets.
  4. Input:
    1. Two files containing GO terms1 and GO terms2, the format is only numbers without "GO:";
    2. "is-a" value (default value is 0.8, G-SESAME only);
    3. "part-of" value (default value is 0.6, G-SESAME only);
    4. Email (Optional);
    5. Name of the result (Optional);
    Output:
    1. The similarity table of these two sets of GO terms;

Based on the semantic correlation of GO terms used to annotate genes, we implemented the following tools to measure the functional similarity of genes:

  1. Functional similarity of two genes.
  2. Input:
    1. Gene symbol1 and gene symbol2;
    2. "is-a" value (default value is 0.8, G-SESAME only);
    3. "part-of" value (default value is 0.6, G-SESAME only);
    4. Ontology;
    5. One source and one target species;
    6. Source and target data sources;
    7. Source and target evidence codes;
    Output:
    1. Numerical similarity value of two genes;
    2. Accession IDs;
    3. Names;
    4. Definitions;
    5. Synonyms;
    6. Two genes annotation DAG;

Based on the gene functional similarity measurement, we implemented the following tools for gene functionality analyses:

  1. Cluster genes based on their functional similarities.
  2. Input:
    1. Two files containing gene symbols, in which each gene symbol takes one line;
    2. "is-a" value (default value is 0.8, G-SESAME only);
    3. "part-of" value (default value is 0.6, G-SESAME only);
    4. Ontology;
    5. One species;
    6. Data sources;
    7. Evidence codes;
    8. Threshold step;
    9. Email (Optional);
    10. Name of the result (Optional);
    Output:
    1. Gene similarity table;
    2. Gene clustering result;
  3. Cluster genes based on their functional similarities using Resnik, Jiang, or Lin's methods.
  4. Input:
    1. Two files containing gene symbols, in which each gene symbol takes one line;
    2. One of Resnik, Jiang, or Lin's method;
    3. Ontology;
    4. One species;
    5. Data sources;
    6. Evidence codes;
    7. Threshold step;
    8. Email (Optional);
    9. Name of the result (Optional);
    Output:
    1. Gene similarity table;
    2. Gene clustering result;

Using tools in batch mode

We know the contents of the GO project can be roughly separated into two parts: GO terms and genes/gene products. Our tools also reflect them. Our tools are listed here, whose names can also be partitioned into two parts. Also, all the programs are named and numbered accordingly.

What you need to do is to write your own programs (in whatever languages you are comfortable with) to simulate http queries as what the programs of number 1 do, and then let your program calls the related the programs of number 2.

For example, check the source html code of geneCompareTwo1.php and fill what it does (needs) into your own local program and, after that, let your programs send a http query (as we click the mouse when we load the number 1 program, a form action) which can trigger geneCompareTwo2.php. The results from geneCompareTwo2.php will be sent back to you via http. Your local programs then need to accept and parse the results from the program number 2. This is also the standard way of simulating http queries in batch mode.

The limitations of our multiple GO comparion and gene clustering programs

Every program has limitations, and our programs are not exceptions. We try to let our programs handle as many inputs as possible. However, because of the extensive and intensive usage of our tools, we have to balance the maximum number of queries and the maxumum of GO terms or gene symbols for each query.

The maximum number of GO terms that multiple GO terms comparison tool is 4,000, and the maximum number of gene clustering tool is 5,000.

What do you need to do if you have huge amount of data which excede the limitations of our program?

If you have many GO terms or gene sysbols to analyze, there are two options. One, you write your own program and call our program in batch mode. Please refer to last section for how to call our public APIs. Two, you can run one of the multiple GO terms comparion or multiple gene clustering tools for multiple genes. However, due to the limitations of our programs, if your data excede the limitations, you have to use divide and conquer method to solve it.

For example, if you have 10,000 gene symbols while the limitation of our program is 5,000 a time, you can split your data, for example, 5,000 * 2. You have to run the programs 2*2=4 times to obtain the complete combinations of your data.

Do you have the source program to download locally?

No. Our paper has clearly presented the algorithm. It is also not easily to set up the whole environment. You can implement our algorithm in your own language. We believe it's eariler to run our online tool interactively or in batch mode.

How to query the species that are not listed on the web page?

The web page only contains limited number of ncbi_taxa_id while in data base there are tens of thousands species_id - ncbi_taxa_id. It's impossible to list all of them. We only use the same as what in gene ontology web page. If you want to explore more, you need first check the species table and find the related id and ncbi_taxa_id and use them into your own program to make a query.

More tools are under development. We promise to bring them online as soon as possible.