Running SigTools in MATLAB
SigTools are command-line programs that allow easy execution of core CMap algorithms and methods via a standardized user interface.
This tutorial demonstrates running connectivity analyses on custom datasets in MATLAB using the source code available in the cmapM Github repository.
We also provide Docker images of the SigTools via DockerHub. These images come pre-configured with all the software dependencies for executing the tool on a local environment without requiring a commercial MATLAB license. To learn how to use the Docker images see here.
Pre-requisites for the demo¶
- Clone the cmapM code repository
git clone https://github.com/cmap/cmapM
- Configure the MATLAB environment and download test data for the demo
% within a MATLAB sesssion type: cd cmapM setup
- Run all the demos, described in more detail below
sig_tool_demos
Connectivity analysis using SigTools¶
1. Running a Cmap Query against an L1000 dataset using the QueryL1k tool¶
The QueryL1k tool computes a set-based enrichment similarity between input genesets (aka queries) and a small subset of L1000 perturbational gene-expression signatures. (Note that while the tool is optimized for datasets generated by the L1000 platform, any perturbational dataset can be used).
The algorithm operates as follows. First raw similarity (connectivity) scores between a query and CMap signatures are computed. While query methodology is agnostic to the specific similarity metric, the default choice is a non-parametric, two-tailed weighted gene-set enrichment score (Subramanian, A. et al. Cell 2017).
The raw scores are then scaled (normalized) by the signed-means to allow for comparisons across different queries.
Finally the statistical significance of the connections adjusted for multiple hypotheses is estimated. FDR q-values are estimated by comparing the distributions of treatments to null signatures in the dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Outputs: the tool produces the following output (in the results
folder)
arfs/
: Per-query analysis report files (ARFs)
<QUERY_NAME>/query_result.gct
: a GCT format text file listing the annotations,
connectivity scores and q-values for each signature in the dataset. The
following fields are computed by the query tool:
raw_cs
: Raw connectivity scoresnorm_cs
: Normalized connectivity score computed by dividing the raw connectivity scores by the signed-mean scores of signatures (specified by the is_ncs_sig field in the signature metadata file) If the ncs_group field is not empty the scores are normalized within each group, otherwise the scores are normalized using the global means across all signatures.fdr_q_nlog10
: Negative log10 transformed FDR q-values estimated relative to the null signatures (specified by theis_null_sig
field in the signature annotation file).
matrices/query
: Query parameters and result matrices in GCTx format for all
queries:
up.gmt
,dn.gmt
: query genesets in GMT formatcs.gctx
: Raw connectivity scores matrix [signatures x queries]ncs.gctx
: Normalized connectivity score matrix [signatures x queries]fdr_qvalue.gctx
: Estimated false discovery rate q-values [signatures x queries]
2. Testing enrichment of user-defined sets using the GSEA Preranked tool¶
The GSEA Preranked tool computes set-based enrichment analysis against a user-defined rank-ordered dataset. It determines whether a priori defined sets show statistically significant enrichment at either end of the ranking.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
3. Querying cell viability data withe the Curie tool using cell-sets¶
The Curie tool computes a set-based enrichment similarity between input cell-line sets (aka queries) and a perturbational cell-fitness signature dataset. Note that while the tool is optimized for datasets generated by the PRISM platform, any high-dimensional cell-fitness dataset can be used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|