Embarrassingly parallel problems
An embarrassingly parallel problem is a problem for which little or no effort is required to separate the problem into a number of parallel tasks [11]
chdb, which stands for Calcul à Haut débit using a database, is a very convenient application for distributing those parallel tasks among several processors. It is a generalist tool, which is able to present the input files taken from a hierarchy and call the same command line for every file.
chdb is written in C++ and uses the MPI library : MPI has a standardized API. There are several MPI implementations, some of them opensource, it is thus a defacto standard and may now be installed on any parallel computer.
Input and output data
The input data can be a set of files read from a single hierarchy. The output data can be also a set of files, written to an analogous hierachy (intermediate directories are created by chdb).
But it will be possible in the near future to read the input data from a specialized database, and to write the output data to some other similar database.
Distribution of jobs and load balancing
The jobs are currently distributed using MPI, using a client-server paradigm : the master process is the server, the other processes (the slaves) are the clients. Thus, you must have at least 2 processes running. More generally speaking if you want N slaves working you must ask for N+1 MPI processes.
one input file per job in a first-in, first-out basis. However, files may be sorted in alphabetical order, or sorted in size : doing the hypothesis that the biggest the file, the longest the job (which is true for certain applications), files are presented to the processors from biggest to smallest, which can lead to some load balancing improvement.
We are currently working on a more sophisticated distribution method, taking advantage of the architecture of the target machine.
Future improvements
Many improvements are planned :
- Storage of the input files and output files in a database
- Interrupt and restart of the processing (checkpointing)
- Read two input directories and combine each pair of files as data input
- Distribution of jobs on the processors using MPI for the internode communication, and forking several processes inside the nodes
Reading the documentation
The chdb documentation for use in Calmip is available here.
Licensing and download
chdb is currently only available for the CALMIP users, however its distribution under a free license is planned in the next months.