Ensemble methods

The ensemble methods are based on [Ruyssinck2014] . The three main executables work the same way and have the same options. They all work by resampling the expression data along samples and genes, which often reduces variance in their predictions:

  • el-ensemble: Uses an ensemble of Elastic Net regression predictors.
  • svm-ensemble: Uses an ensemble of Support Vector Machine predictors.
  • llr-ensemble: Uses an ensemble of Support Vector Machine predictors.

The Elastic Net code uses the GLMNET Fortran backend from [Friedman2010] .

Running Ensembles

Each ensemble needs a minimum of two input files:

  • -i, --infile: An expression matrix (genes are columns, samples are rows) without headers.
  • -g, --genes: A file containing gene names that correspond to columns in the expression matrix.

Here is an example matrix containing expression data for five genes in ten samples:

0.4254475 0.0178292 0.9079888 0.4482474 0.1723238
0.4424002 0.0505248 0.8693676 0.4458513 0.1733112
1.0568470 0.2084539 0.4674478 0.5050774 0.2448833
1.1172264 0.0030010 0.3176543 0.3872039 0.2537921
0.9710677 0.0010565 0.3546514 0.4745322 0.2077183
1.1393856 0.1220468 0.4024654 0.3484362 0.1686139
1.0648694 0.1405077 0.4817628 0.4748571 0.1826433
0.8761173 0.0738140 1.0582917 0.7303661 0.0536562
1.2059661 0.1534070 0.7608608 0.6558457 0.1577311
1.0006755 0.0789863 0.8036309 0.8389751 0.0883061

In the genes files, we provide the column headers for the expression matrix in order:

G1
G2
G3
G4
G5

With that, we can run the ensembles:

el-ensemble -i expr_mat.tsv -g genes.txt
svm-ensemble -i expr_mat.tsv -g genes.txt
llr-ensemble -i expr_mat.tsv -g genes.txt

The output is a square matrix of scores:

0       0   0.876   0.124   0
0.894   0   0.106   0       0
0.894   0   0       0.106   0
0.894   0   0.106   0       0
0.894   0   0.106   0       0

Optional arguments for the Ensemble methods

  • -s, --scale: This triggers feature scaling of the expression matrix before the regression calculation. Generally this should be on.
  • -X, --max-experiment-size: In each resampling iteration, choose maximally this many samples along rows (experiments) of the dataset.
  • -x, --min-experiment-size: In each resampling iteration, choose minimally this many samples along rows (experiments) of the dataset.
  • -P, --max-predictor-size: In each resampling iteration, choose maximally this many genes along columns (predictors) of the dataset.
  • -p, --min-predictor-size: In each resampling iteration, choose minimally this many genes along columns (predictors) of the dataset.
  • -e, --ensemble: Perform this many resampling iterations for each gene.

The sampling boundaries -X, -x, -P and -p default to 4/5th of samples/predictors for the upper bound and 1/5th for the lower. In runs with small experiment sizes (<50) one should set this manually higher to avoid undersampling. In these cases, I suggest 90% for the upper boundary and 50% for the lower (in experiments). These are absolute numbers. E.g., if you have 50 samples and you want 50% - 90% as a lower & upper bound, set -x 25 -X 45.

Running ensembles for a subset of genes

Often we have only a small number of genes of interest. We can instruct the ensembles to only calculate interactions involving those genes by providing a -t, --targets file containing these gene names:

G3
G4

And running it with the -t, --targets options:

llr-ensemble -i expr_mat.tsv -g genes.txt -t targets.txt
svm-ensemble -i expr_mat.tsv -g genes.txt -t targets.txt
el-ensemble -i expr_mat.tsv -g genes.txt -t targets.txt

In this case we will receive an edge list as output:

G3  G1  0.894
G3  G2  0
G3  G4  0.106
G3  G5  0
G4  G1  0.894
G4  G2  0
G4  G3  0.106
G4  G5  0

Running Ensembles in MPI mode

Each ensemble can use parallel processing. For general info on how to run parallel algorithms in seidr, please see Using multiple processors to infer networks

The difference between SVM and LLR

LLR and SVM are based on different implementations of SVMs in C. One is based on LibLinear , the other on LibSVM using a linear kernel. While they should in general agree most of the time, coefficients are handled differently. SVM is closer to the reference implementation by [Ruyssinck2014] , but LLR is much faster.