sig_gutc_tool¶

Compute similarity of input queries to CMap perturbagens, adjusting the results w.r.t to a background distribution

Synopsis¶

sig_gutc_tool [--query_result QUERY_RESULT] [--up, --uptag UP] [--down, --dntag DOWN] [--query_meta QUERY_META] [--is_matched IS_MATCHED] [--match_group MATCH_GROUP] [--metric METRIC] [--es_tail ES_TAIL] [--score SCORE] [--rank RANK] [--build_id BUILD_ID] [--feature_space FEATURE_SPACE] [--sample_space SAMPLE_SPACE] [--pcl_set PCL_SET] [--bkg_path BKG_PATH] [--save_matrices SAVE_MATRICES] [--save_digests SAVE_DIGESTS]

Arguments¶

--query_result QUERY_RESULT : Load pre-computed query results from supplied connectivity matrix.

--up, --uptag UP : Geneset(s) to use for the up portion of the query

--down, --dntag DOWN : Geneset(s) to use for the down portion of the query

--query_meta QUERY_META : Metadata for each query. This is required for matched_mode. The following fields are required for matching with default parameters: [pert_id, cell_id, pert_idose, pert_itime]

--is_matched IS_MATCHED : If true, compute GUTC in cell-line matched mode. Default is 0

--match_group MATCH_GROUP : Query grouping variable(s) for cell-line matching. Note that the tool expects 1 query per cell-line for each unique grouping. Default is pert_id|pert_idose|pert_itime

--metric METRIC : Similarity metric. Default is wtcs. Options are {wtcs}

--es_tail ES_TAIL : Specify two-tailed or one-tailed statistic for enrichment metrics. Default is both. Options are {both|up|down}

--score SCORE : Custom dataset of differential expression scores (e.g. zscores) in GCT(X) format. Use in combination with rank parameter.

--rank RANK : Custom dataset of ranks corresponding to the score matrix in GCT(X) format. Use in combination with score parameter. Note that if

--build_id BUILD_ID : Data build identifier. a2 refers to the GSE92742 dataset with Affymetrix feature ids. a2geneid is the same dataset mapped to Entrez GeneIDs. Default is a2geneid. Options are {a2|a2geneid}

--feature_space FEATURE_SPACE : Feature space for query comparisions. Select lm for landmark space, bing for best-inferred gene space or full for complete genespace. Default is bing. Options are {lm|bing|full}

--sample_space SAMPLE_SPACE : Signature space. Default is full. Options are {full}

--pcl_set PCL_SET : Perturbational classes in GMT format. Default is /cmap/data/vdb/touchstone_v1.1/matched/annot/pcl_n171_20170201.gmt

--bkg_path BKG_PATH : Path to background signature definition and percentile transforms. Default is /cmap/data/vdb/touchstone_v1.1/matched

--save_matrices SAVE_MATRICES : Save result matrices. Default is 1

--save_digests SAVE_DIGESTS : Save per-query digest folders. Default is 1

Description¶

Sig GUTC computes the similarity between input genesets (queries) and perturbational gene expression signatures in the CMap database. The results are transformed to a percentile scale and reported at different levels of granularity to aid interpretation.

Briefly the algorithm operates as follows. First raw similarity scores between a query and CMap signatures are computed. While the method is agnostic to the specific similarity metric used, the default choice is a two-tailed weighted enrichment score.

The raw scores are then scaled (Normalized) to adjust for co-variates like cell line and the type of perturbation. The normalized scores are transformed to percentile scores by comparing the test scores to those of a reference collection of signatures called Touchstone.

The per-signature normalized connectivity scores are summarized to yield connectivity to individual perturbagens within a cell line, across-cell lines and for perturbational classes (PCLs). Any summary statistic can be employed, but in practice the maximal-quantile (MAXQ) score is used. Given a set of scores X and a pair of percentiles PL and PU, MAXQ returns the percentile value of X that has the maximum absolute value (By default GUTC uses PL=33 and PU=67).

At each level of summarization, percentile scores are re-computed by comparing to the corresponding results when applied to the Touchstone signatures. For a given connection, the percentiles are computed within perturbagens with the cell type that the connection corresponds to.

An important variant of GUTC is the matched mode specified by the is_matched parameter. Matched mode incorporates cell-line information when query data has been generated systematically in cell types that match the touchstone signatures. Currently this includes the following 9 cell types : [A375, A549, HEPG2, HCC515, HA1E, HT29, MCF7, PC3, VCAP]. To run GUTC in this mode, the is_matched flag should be set to true. Also, the required metadata should be provided using the query_meta argument. Note that the the tool expects 1 query per cell-line for each unique [pert_id, pert_idose, pert_itime] combination. The default query grouping variables can be changed using the match_group argument.

Examples¶

Run queries and apply GUTC

sig_gutc_tool --up 'up.gmt' --down 'down.gmt'

Apply GUTC on pre-computed query results

sig_gutc_tool --query_result '/path/to/sig_query/results/wtcs.gctx'

Run GUTC in cell-line matched mode

sig_gutc_tool --query_result '/path/to/sig_query/results/wtcs.gctx' --query_meta '/path/to/query_info.txt' --is_matched true

Run GUTC using a custom dataset, Expects that

sig_gutc_tool --bkg_path '/path/to/gutc_background' --score '/path/to/modzs.gctx' --rank '/path/to/rank.gctx' --up 'up.gmt' --down 'down.gmt'