ARACNE

ARACNE ([Margolin2006]) is an inference algorithm based on mutual information and applies data processing inequality to delete most indirect edges.

Our implementation differs to the original in that it estimates the initial mutual information using a B-spline approach as described in [Daub2004] .

Running ARACNE

ARACNE needs a minimum of two input files:

  • -i, --infile: An expression matrix (genes are columns, samples are rows) without headers.
  • -g, --genes: A file containing gene names that correspond to columns in the expression matrix.

Here is an example matrix containing expression data for five genes in ten samples:

0.4254475 0.0178292 0.9079888 0.4482474 0.1723238
0.4424002 0.0505248 0.8693676 0.4458513 0.1733112
1.0568470 0.2084539 0.4674478 0.5050774 0.2448833
1.1172264 0.0030010 0.3176543 0.3872039 0.2537921
0.9710677 0.0010565 0.3546514 0.4745322 0.2077183
1.1393856 0.1220468 0.4024654 0.3484362 0.1686139
1.0648694 0.1405077 0.4817628 0.4748571 0.1826433
0.8761173 0.0738140 1.0582917 0.7303661 0.0536562
1.2059661 0.1534070 0.7608608 0.6558457 0.1577311
1.0006755 0.0789863 0.8036309 0.8389751 0.0883061

In the genes files, we provide the column headers for the expression matrix in order:

G1
G2
G3
G4
G5

With that, we can run ARACNE:

mi -m ARACNE -i expr_mat.tsv -g genes.txt

The output is a lower triangular matrix of scores:

0
0.798215    0.874873
0           0.889133    0
0           0           0.860645    0.95965

Tuning the number of bins and spline degree

Estimating mutual infofrmation from discrete data is well defined, but normalized expression data is usually continuous. To estimate the MI from continuous data, each data point is usually assigned to one bin. This can lead to a loss of information.

The B-Spline estimator for MI therefore performs fuzzy assignment of the data to bins. The -s, --spline parameter controls the spline degree (therefore the shape) of the indicator function. For s=1 the indicator function is the same as for simple binning. Improvements in the MI beyond a degree of s=3 are rarely seen, therefore it is a good choice as a default.

The number of bins used in the assignment can be controlled with the -b, --bins option. By default it is automatically inferred from the data, but this can lead to high memory requirements on large datasets. Generally, the number of bins is assumed not to influence the MI much as long as it’s within a reasonable range. A value between 5 and 10 is a good starting point for typically sized datasets from RNA-Seq.

Running ARACNE for a subset of genes

Often we have only a small number of genes of interest. We can instruct ARACNE to only calculate interactions involving those genes by providing a -t, --targets file containing these gene names:

G3
G4

And running it with the -t, --targets options:

mi -m ARACNE -i expr_mat.tsv -g genes.txt -t targets.txt

In this case we will receive an edge list as output:

G3  G1  0.798215
G3  G2  0.874873
G3  G4  0
G3  G5  0.860645
G4  G1  0
G4  G2  0.889133
G4  G3  0
G4  G5  0.95965

Running ARACNE in MPI mode

ARACNE can use parallel processing in the MI estimation step. For general info on how to run parallel algorithms in seidr, please see Using multiple processors to infer networks