Getting Started

Tutorial Data

The tutorial data is a set of 500 salmon pseudo-alignments targeting Saccharomyces cerevisiae. They originate from public SRA data. In fact the files names are simply their SRA accessions. In this tutorial, we will go through a simple pre-processing step and then create individual and aggregate networks. First, let’s get the data:

mkdir seidr_tutorial
cd seidr_tutorial
wget https://bschiffthaler.s3-eu-west-1.amazonaws.com/SeidrPublic/seidr_tutorial.tar.gz
tar -xvf seidr_tutorial.tar.gz

Pre-processing

Data pre-processing is done in R. We’ll use tximport to load the data into R, and then use DESeq2 to variance-stabilize.

library(tximport)
library(DESeq2)
library(readr)

# This load the mapping of genes to transcripts into R so that tximport can
# summarize counts
tx2g <- read_tsv('seidr_tutorial/tx2gene.tsv',
                 col_names=c('Transcript', 'Gene'))

# Now let's find all our count files and load them into R using tximport
input_files <- dir('seidr_tutorial', pattern='*_quant.sf$', full.names=TRUE)
txi <- tximport(input_files, 'salmon', tx2gene=tx2g)

# In order to make a DESeq2 data set, we need some metadata. For now we'll
# just use some dummy data
dummy_meta <- data.frame(N = seq_along(input_files))
dds <- DESeqDataSetFromTximport(txi, dummy_meta, ~1)

# Now we run variance stabilization and get the stabilized data as a matrix. If
# you have good metadata, you can use the experimental design in the DESeqDataSet
# and set blind=FALSE here
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
vst <- t(assay(vsd))

# Genes that do not vary at all create problems down the line, so it's better
# to drop them
vars <- apply(vst, 2, var)
filt_id <- which(is.finite(vars))
vst <- vst[, filt_id]

# Let's also center samples around their median, which has been shown to
# improve reconstruction accuracy
medians <- apply(vst, 1, median)
vst <- sweep(vst, MARGIN=1, FUN='-', STATS=medians)

# MASS's write.matrix function is a bit faster and better suited for our task
# compared to write.table. Don't forgot to unname(), otherwise you will have
# column headers in the output
MASS::write.matrix(x=unname(vst), sep='\t', file='expression.tsv')

# Finally, let's write the column headers (== gene names) as a text file
write(colnames(vst), file="genes.txt")

All versus all networks

Sub-setting the data

In this section, we will create an all vs. all comparison, meaning we will estimate connectivity of all genes to all other genes. This approach is the most resource demanding, so we’ll create a smaller subset of the tutorial data first. We’ll take 100 samples and the first 1000 genes. Our output network will therefore be \(\frac{1000 \cdot 999}{2}\).

tail -n 100 expression.tsv | cut -f 1-1000 > expression_sub.tsv
head -n 1000 genes.txt > genes_sub.txt

Network inference

Now that we have a data subset, we can get started with the inference. In this step, we’ll create 13 different all vs. all networks using algorithms that seidr ships with. If you have any inference algorithm you would like to include that is not yet implemented in seidr, you can run that as well, but make sure its output is in a format seidr can import (Importing text based formats into Seidr). Even though this is a sub-set, you’ll probably need to set aside an hour for the inference. If you want a quicker run, you can leave out el-ensemble. You’ll see that we use the --scale option a number of times. This instructs seidr to perform feature scaling on the data, which, in general, will improve the results.:

# fast
correlation -m pearson -i expression_sub.tsv -g genes_sub.txt --scale
correlation -m spearman -i expression_sub.tsv -g genes_sub.txt
pcor -i expression_sub.tsv -g genes_sub.txt --scale

# medium
mi -m RAW -i expression_sub.tsv -g genes_sub.txt -o mi_scores.tsv
mi -m CLR -i expression_sub.tsv -g genes_sub.txt -M mi_scores.tsv -o clr_scores.tsv
mi -m ARACNE -i expression_sub.tsv -g genes_sub.txt -M mi_scores.tsv -o aracne_scores.tsv

# slow
narromi -m interior-point -i expression_sub.tsv -g genes_sub.txt -o narromi_scores.tsv
plsnet -i expression_sub.tsv -g genes_sub.txt -o plsnet_scores.tsv --scale
llr-ensemble -i expression_sub.tsv -g genes_sub.txt -o llr_scores.tsv --scale
svm-ensemble -k POLY -i expression_sub.tsv -g genes_sub.txt -o svm_scores.tsv --scale
genie3 -i expression_sub.tsv -g genes_sub.txt -o genie3_scores.tsv --scale
tigress -i expression_sub.tsv -g genes_sub.txt -o tigress_scores.tsv --scale

# very slow
el-ensemble -i expression_sub.tsv -g genes_sub.txt -o elnet_scores.tsv --scale

Network ranking

Different inference algorithms output networks with different metrics for edge weights. A correlation network, will assign scores anywhere in \([-1, ..., 1]\), whereas mututal information is in \([0, ..., N]\), and many of the regression algorithms in \([0, ..., 1]\). We can therefore not just sum the weights to get a final, community network. In order to do that, we want to convert the scores to ranks. The command seidr import takes care of that. Some useful options:

  • -A: This option computes the rank on the absolute value of the score, so -1 and +1 would get the same rank.
  • -r: This option indicates that higher scores are better. A score of 1 would get a lower (== better) rank than a score of 0.5.
  • -u: This option creates an undirected network. We use this in algorithms where we know the output is symmetric (A->B and B->A are the same), but only have the lower triangular matrix. Examples are all correlation and mutual information based methods.
  • -z: This option drops edges with a score of 0. By default we keep all edges, but this will create sparser networks for methods that output 0-valued edges.
seidr import -A -r -u -n PEARSON -o person_scores.sf -F lm -i pearson_scores.tsv -g genes_sub.txt
seidr import -A -r -u -n SPEARMAN -o spearman_scores.sf -F lm -i spearman_scores.tsv -g genes_sub.txt
seidr import -A -r -u -n PCOR -o pcor_scores.sf -F lm -i pcor_scores.tsv -g genes_sub.txt

seidr import -r -u -n MI -o mi_scores.sf -F lm -i mi_scores.tsv -g genes_sub.txt
seidr import -r -u -z -n CLR -o clr_scores.sf -F lm -i clr_scores.tsv -g genes_sub.txt
seidr import -r -u -z -n ARACNE -o aracne_scores.sf -F lm -i aracne_scores.tsv -g genes_sub.txt

seidr import -r -z -n NARROMI -o narromi_scores.sf -F m -i narromi_scores.tsv -g genes_sub.txt
seidr import -r -z -n PLSNET -o plsnet_scores.sf -F m -i plsnet_scores.tsv -g genes_sub.txt
seidr import -r -z -n LLR -o llr_scores.sf -F m -i llr_scores.tsv -g genes_sub.txt
seidr import -r -z -n SVM -o svm_scores.sf -F m -i svm_scores.tsv -g genes_sub.txt
seidr import -r -z -n GENIE3 -o genie3_scores.sf -F m -i genie3_scores.tsv -g genes_sub.txt
seidr import -r -z -n TIGRESS -o tigress_scores.sf -F m -i tigress_scores.tsv -g genes_sub.txt
seidr import -r -z -n ELNET -o elnet_scores.sf -F m -i elnet_scores.tsv -g genes_sub.txt

Aggregating

Aggregating refers to the construction of a community network from the individual networks created before. Note that there are several aggregation methods available. We will use the “Inverse Rank Product” method described in [Zhong2014].

seidr aggregate -m irp aracne_scores.sf clr_scores.sf elnet_scores.sf genie3_scores.sf llr_scores.sf mi_scores.sf narromi_scores.sf pcor_scores.sf person_scores.sf plsnet_scores.sf spearman_scores.sf svm_scores.sf tigress_scores.sf

This creates a community network of all the 1000 genes in our sample data. If you don’t want to learn how you can create a network for a group of genes (e.g. only transcription factors), jump right to Post processing.

Taking a look at the final network

We can have a look at the top three edges in the network:

seidr top -n 3 aggregated.sf | column -t
YDL039C    YDL037C    Undirected  0.777779;83  14.1252;1   0.511;433  3.61084;7   0.138;61223.5  0.777779;204  0.709272;39   0.0787986;93     0.837201;940  1.5595;25      0.780375;1512  nan;nan        1;1.5       0.949186;3
YDL025W-A  YBL006W-A  Undirected  1.05879;2    8.3004;29   0.543;4    3.30464;21  0.496;2775     1.05879;2     0.372718;529  0.0172536;31848  0.909562;219  0.654752;1758  0.847192;364   0.463;4415     0.8677;144  0.967855;2
YAL037C-B  YCR013C    Directed    1.00539;7    9.08174;14  0.519;168  2.82549;92  0.517;407      1.00539;7     0.777048;28   0.032911;3601    0.928263;137  0.991901;297   0.846061;373   0.204;12028.5  0.99325;16  1;1

Most of these are from dubious ORFs (which should have maybe been filtered beforehand). The one that is not, is definitely a good result, YDL039C and YDL037C as both these genes form the IMI1 protein.

Creating targeted networks

Sometimes, we are not interested in the interactions of all genes, we just want to know what our genes of interest look like in the network. We can then run seidr in targeted mode, which will compute only what’s necessary to understand that particular group of genes. The slowest of the bunch will probably be the mutual information based algorithms CLR and ARACNe, since they are context dependent and the full mutual information matrix needs to be computed first.

Making targets

Since we need some targets to look at, I select a single transcription factor FZF1, and store the gene identifier in a file.

echo "YGL254W" > FZF1.txt

Inferring sub-networks

The network inference step is nearly the same, but now we use the full expression set (all ~6500 genes and 500 samples) as well as the FZF1.txt targets file.

# fast
correlation -t FZF1.txt -m pearson -i expression.tsv -g genes.txt --scale -o pearson_fzf1_scores.tsv
correlation -t FZF1.txt -m spearman -i expression.tsv -g genes.txt -o spearman_fzf1_scores.tsv
pcor -t FZF1.txt -i expression.tsv -g genes.txt --scale -o pcor_fzf1_scores.tsv

# medium
mi -t FZF1.txt -m RAW -i expression.tsv -g genes.txt -M mi_full_scores.tsv -o mi_fzf1_scores.tsv
mi -t FZF1.txt -m CLR -i expression.tsv -g genes.txt -M mi_full_scores.tsv -o clr_fzf1_scores.tsv
mi -t FZF1.txt -m ARACNE -i expression.tsv -g genes.txt -M mi_full_scores.tsv -o aracne_fzf1_scores.tsv

# slow
narromi -t FZF1.txt -m interior-point -i expression.tsv -g genes.txt -o narromi_fzf1_scores.tsv
plsnet -t FZF1.txt -i expression.tsv -g genes.txt -o plsnet_fzf1_scores.tsv --scale
llr-ensemble -t FZF1.txt -i expression.tsv -g genes.txt -o llr_fzf1_scores.tsv --scale
svm-ensemble -t FZF1.txt -k POLY -i expression.tsv -g genes.txt -o svm_fzf1_scores.tsv --scale
genie3 -t FZF1.txt -i expression.tsv -g genes.txt -o genie3_fzf1_scores.tsv --scale
tigress -t FZF1.txt -i expression.tsv -g genes.txt -o tigress_fzf1_scores.tsv --scale

el-ensemble -t FZF1.txt -i expression.tsv -g genes.txt -o elnet_fzf1_scores.tsv --scale

Importing

Targeted mode outputs results in edge list format, so all out imports now contain -F el instead of -F lm or -F m.

seidr import -A -r -u -n PEARSON -o person_fzf1_scores.sf -F el -i pearson_fzf1_scores.tsv -g genes.txt
seidr import -A -r -u -n SPEARMAN -o spearman_fzf1_scores.sf -F el -i spearman_fzf1_scores.tsv -g genes.txt
seidr import -A -r -u -n PCOR -o pcor_fzf1_scores.sf -F el -i pcor_fzf1_scores.tsv -g genes.txt

seidr import -r -u -n MI -o mi_fzf1_scores.sf -F el -i mi_fzf1_scores.tsv -g genes.txt
seidr import -r -u -z -n CLR -o clr_fzf1_scores.sf -F el -i clr_fzf1_scores.tsv -g genes.txt
seidr import -r -u -z -n ARACNE -o aracne_fzf1_scores.sf -F el -i aracne_fzf1_scores.tsv -g genes.txt

seidr import -r -z -n NARROMI -o narromi_fzf1_scores.sf -F el -i narromi_fzf1_scores.tsv -g genes.txt
seidr import -r -z -n PLSNET -o plsnet_fzf1_scores.sf -F el -i plsnet_fzf1_scores.tsv -g genes.txt
seidr import -r -z -n LLR -o llr_fzf1_scores.sf -F el -i llr_fzf1_scores.tsv -g genes.txt
seidr import -r -z -n SVM -o svm_fzf1_scores.sf -F el -i svm_fzf1_scores.tsv -g genes.txt
seidr import -r -z -n GENIE3 -o genie3_fzf1_scores.sf -F el -i genie3_fzf1_scores.tsv -g genes.txt
seidr import -r -z -n TIGRESS -o tigress_fzf1_scores.sf -F el -i tigress_fzf1_scores.tsv -g genes.txt
seidr import -r -z -n ELNET -o elnet_fzf1_scores.sf -F el -i elnet_fzf1_scores.tsv -g genes.txt

Aggregating

This is exactly the same as for the full network.

seidr aggregate -m irp -o aggregated_fzf1.sf aracne_fzf1_scores.sf clr_fzf1_scores.sf elnet_fzf1_scores.sf genie3_fzf1_scores.sf llr_fzf1_scores.sf mi_fzf1_scores.sf narromi_fzf1_scores.sf pcor_fzf1_scores.sf person_fzf1_scores.sf plsnet_fzf1_scores.sf spearman_fzf1_scores.sf svm_fzf1_scores.sf tigress_fzf1_scores.sf

Taking a look at the final network

Just as before, let’s look at the top three connections of our TF.

seidr top -n 3 aggregated_fzf1.sf
YGL254W  YEL051W  Directed  nan;nan  3.11877;43     0.458;2    1.81032;4     0.004;3226  0.387233;48    nan;nan       -0.00566415;1138  -0.57406;1    0.035306;28   -0.462468;6  0.001;3050.5  0.74525;2  0.874875;3
YGL254W  YGL128C  Directed  nan;nan  0.886213;1151  0.443;4.5  0.266012;279  0.248;86    0.265754;1222  0.0648351;43  0.010073;254      0.453637;131  0.0382971;21  0.404141;49  0.207;210.5   0.19945;7  0.888152;2
YGL254W  YFL044C  Directed  nan;nan  3.24149;31     0.451;3    0.581017;113  0.352;17    0.383645;52    0.1711;4      0.00935282;329    0.477231;78   0.0320321;45  0.444953;11  0.041;1019    0.25265;4  1;1

The top connections are OTU1 (YFL044C), CWC23 (YGL128C), and VMA8 (YEL051W).

Post processing

Pruning noisy edges

In most cases, the community network will be fully dense, meaning every gene is connected to every other gene with a certain score. Many of these edges are just noise and we would like to prune them. [Coscia2017] have developed a smart approach to pruning noisy edges called “Network backboning”. We can apply this to our community network as:

seidr backbone -F 1.28 aggregated.sf

Viewing edges in the network

The seidr view command offers an interface to query the seidr output. Let’s look at a few edges.

seidr view --column-headers aggregated.bb.sf | head -n 3 | column -t
Source  Target  Type        ARACNE_score;ARACNE_rank  CLR_score;CLR_rank  ELNET_score;ELNET_rank  GENIE3_score;GENIE3_rank  LLR_score;LLR_rank  MI_score;MI_rank  NARROMI_score;NARROMI_rank  PCOR_score;PCOR_rank  PEARSON_score;PEARSON_rank  PLSNET_score;PLSNET_rank  SPEARMAN_score;SPEARMAN_rank  SVM_score;SVM_rank  TIGRESS_score;TIGRESS_rank  irp_score;irp_rank  NC_Score;NC_SDev;SEC;EBC
Q0017   Q0010   Undirected  nan;nan                   3.58054;6569        0.317;14818             0.587286;20930            0.221;43021         0.255911;100939   0.364063;574                0.0246169;10270       0.644044;14824              1.01683;268               0.509318;33915                0.154;13896         0.17055;3855.5              0.440763;1637       0.602226;0.375774;0.459647;130
Q0032   Q0010   Undirected  nan;nan                   2.38815;27858       0.269;18028             0.905035;9646             0.138;61223.5       0.138116;379828   nan;nan                     0.0481595;843         0.679379;10339              1.19476;122               0.386693;81327                0.287;9567          0.0787;8143.5               0.37659;3358        0.621852;0.388965;0.364875;218

Querying specific nodes or edges

If the seidr output is indexed with the seidr index command, we can query specific nodes and edges.

seidr index aggregated.bb.sf
# Node
seidr view -n YBR142W aggregated.bb.sf
# Edge
seidr view -n YBR142W:YDL063C aggregated.bb.sf

Graph and centrality statistics

Seidr can compute statistics on the entire graph and some node centrality measures. Before we do that, it’s best to make sure we have no disconnected nodes in the graph, which we drop with:

seidr reheader aggregated.bb.sf

Then, we can use seidr graphstats to compute graph summary stats.

seidr graphstats aggregated.bb.sf
Number of Nodes:        974
Number of Edges:        4150
Number of Connected Components: 2
Global clustering coefficient:  0.338051
Scale free fit: 0.0555276
Average degree: 8.52156
Average weighted degree:        3.71159
Network diameter:       4.00379
Average path length:    1.66305

Finally, we can compute node centrality statistics with seidr stats

seidr stats --exact aggregated.bb.sf
seidr view --centrality aggregated.bb.sf | sort -k2g | tail -n 5 | column -t
YBL006W-A  0.002748    1344   14.8438  0.140839     0.0339691
YBL039C    0.00279961  1322   14.4697  0.000898011  0.0339036
YBR142W    0.00282644  6460   14.3319  0.00113027   0.03388
YBL012C    0.00296141  10198  16.4288  0.156134     0.0342437
YAL045C    0.00306967  4364   17.8655  0.234332     0.0344952