SeidrFiles¶
Introduction¶
Seidr employs it’s own file format (called SeidrFile) to store network data. This is done to increase performance, as SeidrFiles are:
- Losslessly compressed using bgzip (to save space)
- Ordered in a lower triangular to enable faster algorithms
- Ranked, so that scores can be rank-aggregated
The SeidrFile header¶
A SeidrFile has a header that keeps information such as the number of edges, nodes, the node names etc. You can view the header of a SeidrFile with the command:
seidr view -H <SeidrFile>
The output might look something like this:
# [G] Nodes: 50
# [G] Edges: 1225
# [G] Storage: Dense
# [G] Algorithms #: 14
# [G] Supplementary data #: 13
# [A] ARACNE
# [A] CLR
# [A] ELNET
# [A] MI
# [A] GENIE3
# [A] LLR
# [A] NARROMI
# [A] PCOR
# [A] PEARSON
# [A] PLSNET
# [A] SPEARMAN
# [A] SVM
# [A] TIGRESS
# [A] irp
# [S] D1
# [S] D2
# [S] D3
# [S] D4
# [S] D5
# [S] D6
# [S] D7
# [S] D8
# [S] D9
# [S] D10
# [S] D11
# [S] D12
# [S] D13
# [R] Version: 0.10.0
# [R] Cmd: seidr aggregate -f -k -m irp aracne.sf clr.sf elnet.sf elranks.sf genie3.sf llr.sf narromi.sf pcor.sf pearson.sf plsnet.sf spearman.sf svm.sf tigress.sf
# [N] G1
# [N] G2
# [N] G3
# [N] G4
# [N] G5
# [N] G6
The SeidrFile body¶
In the main body of a SeidrFile, we store the edges of a network. Specifically, for each edge, we have at least four columns:
- Source: For directed edges, this is the originating node, for undirected edges, this is simply one of the partners
- Target: For directed edges, this is the destination node, for undirected edges, this is simply the other partner
- Type: Undirected if the node is undirected, Directed otherwise
- X_score;X_rank: This column holds the original score for algorithm “X” as well as its computed rank.
Besides these four mandatory columns, a SeidrFile can hold any number of additional score/rank columns if it is an aggregated or otherwise processed file and and additional supplementary column that annotates the edge with extra information. To view the body of a SeidrFile you can use:
seidr view <SeidrFile>
Here is the output of a simple imported network:
G1 G2 Directed 0.004;334084
G3 G1 Directed 0.334;22729.5
G1 G4 Directed 0.071;89307
G4 G2 Directed 0.053;104778
G3 G4 Directed 0.006;282776
And one that is a little more complex, with 14 score/rank columns and a supplementary column at the end. In aggregated SeidrFiles, the representative score/rank is always the rightmost (last) score/rank column:
G2 G1 Directed 0.288087;1.30856e+06 nan;nan 1.87357;106802 0.004;334084 -0.018736;243746 0.0904447;42007 0.244;37455.5 0.0128741;202752 -0.159435;202751 1.07712e-05;360264 -0.00225177;1.32058e+06 0.152;26168 nan;nan 0.978291;117022 11
You might notice the columns with nan:nan as score/rank. Seidr uses nan as a placeholder to denote a missing edge. That means this particular edge (G2 -> G1) was not found in the second and thirteenth algorithms.
The SeidrFile index¶
SeidrFiles can be indexed with the command:
seidr index <SeidrFile>
This will create an index file with the extension .sfi. The index allows us to access edges quickly in a SeidrFile without having to decompress unnecessary data. Some seidr commands therefore need the index. As an example, let’s see what happens if we try to pull out a specific edge from a SeidrFile without an index:
seidr view -n G1000:G3 <SeidrFile>
[ ERROR ][ 2018-05-02T21:35:45 ][ seidr ]: SeidrFile index <SeidrFile.sfi> must exist when using --nodelist
Otherwise, if the index exists:
seidr view -n G1000:G3 ../dream_net1/aggregate/aggregated.sf
G1000 G3 Undirected 0.388607;611152 nan;nan nan;nan 0.001;581639 -0.0200038;209560 0.00623208;1.16541e+06 0.057;174410 0.00177422;752791 -0.0595161;752789 2.76065e-06;1.11154e+06 -0.0432047;834369 0.031;315583 0.0006;123144 0.507107;458113