Using multiple processors to infer networks¶
A number of computationally intensive network inference algorithms in seidr
are written using a hybrid MPI/OpenMP approach. This allows for sahred memory
parallelism on a single computer or across many nodes in a cluster. Some inference
algorithms in seidr
have been run on hundreds of CPUs across many nodes in
a high performance compute cluster.
Running in OMP mode¶
By default, if your computer has multiple CPU cores availble, seidr
will use
as many as it can. If the subprogram has parallel processing support, you can
control the extent of the parallelization with the -O,--threads
option.
Example:
# Use all available threads by default:
seidr import ...
# Use two threds
seidr import -O 2 ...
# Use environment variables to control the number of threads
export OMP_NUM_THREADS=2
seidr import ..
Running in MPI mode¶
By default all inference algorithms will use all cores to process data. Let’s
use CLR
as an example:
mi -m CLR -i expr_mat.tsv -g genes.txt
This will spawn eight compute threads (on my laptop) to process the data.
In order to control the allocated number of CPUs, we can use the -O
flag
of the mi
program:
mi -O 4 -m CLR -i expr_mat.tsv -g genes.txt
This will use 4 compute threads.
If we want to use multiple nodes, we can use we can run the same command as a child of the mpirun
program. You should first define a hostfile.:
mpirun -hostfile myhostfile.cfg mi -m CLR -i expr_mat.tsv -g genes.txt
This will spawn a distributed version of the MI inference, running the maximum
amount of OpenMP threads. You can combine mpirun
and the program’s -O
argument to control the number of compute threads each MPI worker spawns.
A special note on MPI rank order: the highest memory node on the cluster you are using should always be rank 0. If there are any high memory tasks, Seidr will assign them to this MPI worker.
For more info on running MPI jobs (including running them on several nodes), please refer to the OpenMPI webpage
The batchsize argument¶
All MPI enabled inference algorithms in seidr
have a --batch-size
argument.
This is the number of genes a compute thread will process at once before requesting
more from the master thread. Lower batch sizes will lead to more time spent in I/O
operations and more temporary files, but setting it too high might leave compute
threads without work for portions of the run. A good rule of thumb is to set this
to \(\frac{n_{genes}}{n_{nodes}}\). As an example, if I am estimating the
network for 25,000 genes using a five nodes, I set --batch-size
to \(\frac{25000}{5} = 5000\). In general, it is safe to let seidr
decide on the batch size.