sig_tsne_tool¶

Run T-SNE on a dataset

Synopsis¶

sig_tsne_tool [--ds DS] [--ds_meta DS_META] [--is_pairwise IS_PAIRWISE] [--cid CID] [--rid RID] [--row_space ROW_SPACE] [--sample_dim SAMPLE_DIM] [--out_dim OUT_DIM] [--algorithm ALGORITHM] [--initial_dim INITIAL_DIM] [--perplexity PERPLEXITY] [--theta THETA] [--missing_action MISSING_ACTION] [--missing_fill_value MISSING_FILL_VALUE] [--disable_table DISABLE_TABLE]

Arguments¶

--ds DS : Input dataset

--ds_meta DS_META : Optional annotations as a TSV table for the input dataset for the dimension being operated on. The first column must match the corresponding id field in ds

--is_pairwise IS_PAIRWISE : Handles the input dataset as a distance or similarity matrix if true. Expects the the input to be square and symmetric. Assumes the values are similarities if the main diagonal is one or distances if the main diagonal is zero. Skips the initial dimensionality reduction (via PCA) and pairwise euclidean distance computation and uses the tsne_d algorithm to perform the low-dimensional embedding. Default is 0

--cid CID : List of column ids to use specified as a GRP file or cell array. If empty all columns are used.

--rid RID : List of row ids to to use specified as a GRP file or cell array. If empty all rows are used

--row_space ROW_SPACE : Common row-id space definitions to use as an alternative to the rid parameter. Default is all. Options are {all|lm|bing|aig|lm_probeset|bing_probeset|full_probeset|custom}

--sample_dim SAMPLE_DIM : Sample dimension of the dataset. Default is column. Options are {1|2|column|row}

--out_dim OUT_DIM : Output dimensionality. Default is 2

--algorithm ALGORITHM : The t-SNE implemention to use. The standard algorithm is a native matlab implementation that is appropriate for small to moderate sized datasets and if more than 2 output dimensions are required. The Barnes Hut algorithm is a fast C++ implementation suitable for 2D tSNE representation of large datasets (>5000 samples).. Default is auto. Options are {auto|standard|barnes-hut}

--initial_dim INITIAL_DIM : Initial number of PCA dimensions to use. Default is 50

--perplexity PERPLEXITY : Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. It may be viewed as a tuning parameter that sets the number of effective nearest neighbors. It is comparable to the number of nearest neighbors k that is employed in many manifold learners. The performance of t-SNE is fairly robust under different settings of the perplexity. The most appropriate value depends on the density of the data. In general a denser dataset requires a larger perplexity. Typical values for the perplexity range between 5 and 50. Default is 30

--theta THETA : Used only in the Barnes-Hut implementation. It's a trade-off parameter to choose between speed and accuracy: theta = 0 corresponds to standard, slow t-SNE, while theta = 1 makes very crude approximations. Appropriate values for theta are between 0.1 and 0.7. Default is 0.5

--missing_action MISSING_ACTION : Action to take if data contains missing values. If 'drop' is specified the entire column (or row if sample_dim='row') is excluded prior to analysis. If 'impute' is specified the missing values are replaced by row means (or column means if sample_dim='row'). If 'fill' is specified, missing values are replaced with 'missing_fill_value'. Default is none. Options are {none|drop|impute|fill}

--missing_fill_value MISSING_FILL_VALUE : Replace missing data with specified value if the 'fill' option is specified for missing_action. Default is 0

--disable_table DISABLE_TABLE : Disable generating annotated text table for first two TSNE components. The table can be generated post-hoc from the saved tsne.gctx matrix if needed.. Default is 0

Description¶

Applies t-distributed stochastic neighbor embedding (t-SNE) to high dimensional datasets and returns a 2-d mapping of datapoints. t-SNE is a dimensionality reduction technique that is particularly well suited for visualization of high dimensional data in 2 or 3 dimensions.

For datasets with <= 5000 samples, the standard t-SNE algorithm is used. For larger datasets the Barnes-HUT algorithm is employed.

For details see http://homepage.tudelft.nl/19j49/t-SNE.html

Examples¶

tSNE with default parameters

sig_tsne_tool --ds 'x.gctx

tSNE along rows of a large dataset with >5000 rows

sig_tsne_tool --ds 'large.gctx' --dim row --algorithm barnes-hut