Running SigTools with Docker
SigTools are command-line programs that allow easy execution of core CMap algorithms and methods via a standardized user interface. We provide Docker images of the SigTools via DockerHub. These images come pre-configured with all the software dependencies for executing the tool on a local environment without requiring a commercial license.
This tutorial demonstrates running connectivity analyses on custom datasets using dockerized sig-tools.
Pre-requisites for the demo¶
- Install Docker for your platform from docker.com
-
Pull the CMap SigTool Runtime Docker image (optional)
docker pull cmap/sigtool-runtime`
-
Clone the cmap-sig-tools repository
git clone https://github.com/cmap/cmap-sig-tools
- Download and extract the tools and test data
cd cmap-sig-tools/demo sh ./get-demo.sh
Docker Demos¶
1. Running a Cmap Query against an L1000 dataset using the QueryL1k tool¶
The QueryL1k tool computes a set-based enrichment similarity between input genesets (aka queries) and a small subset of L1000 perturbational gene-expression signatures. (Note that while the tool is optimized for datasets generated by the L1000 platform, any perturbational dataset can be used).
The algorithm operates as follows. First raw similarity (connectivity) scores between a query and CMap signatures are computed. While query methodology is agnostic to the specific similarity metric, the default choice is a non-parametric, two-tailed weighted gene-set enrichment score (Subramanian, A. et al. Cell 2017).
The raw scores are then scaled (normalized) by the signed-means to allow for comparisons across different queries.
Finally the statistical significance of the connections adjusted for multiple hypotheses is estimated. FDR q-values are estimated by comparing the distributions of treatments to null signatures in the dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
Outputs: the tool produces the following output (in the results
folder)
arfs/
: Per-query analysis report files (ARFs)
<QUERY_NAME>/query_result.gct
: a GCT format text file listing the annotations,
connectivity scores and q-values for each signature in the dataset. The
following fields are computed by the query tool:
raw_cs
: Raw connectivity scoresnorm_cs
: Normalized connectivity score computed by dividing the raw connectivity scores by the signed-mean scores of signatures (specified by the is_ncs_sig field in the signature metadata file) If the ncs_group field is not empty the scores are normalized within each group, otherwise the scores are normalized using the global means across all signatures.fdr_q_nlog10
: Negative log10 transformed FDR q-values estimated relative to the null signatures (specified by theis_null_sig
field in the signature annotation file).
matrices/query
: Query parameters and result matrices in GCTx format for all
queries:
up.gmt
,dn.gmt
: query genesets in GMT formatcs.gctx
: Raw connectivity scores matrix [signatures x queries]ncs.gctx
: Normalized connectivity score matrix [signatures x queries]fdr_qvalue.gctx
: Estimated false discovery rate q-values [signatures x queries]
2. Testing enrichment of user-defined sets using the GSEA Preranked tool¶
The GSEA Preranked tool computes set-based enrichment analysis against a user-defined rank-ordered dataset. It determines whether a priori defined sets show statistically significant enrichment at either end of the ranking.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
3. Querying cell viability data withe the Curie tool using cell-sets¶
The Curie tool computes a set-based enrichment similarity between input cell-line sets (aka queries) and a perturbational cell-fitness signature dataset. Note that while the tool is optimized for datasets generated by the PRISM platform, any high-dimensional cell-fitness dataset can be used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|