In Bash you can start new processes (theads) on the background simply by running a command with ampersand &. The wait
command can be used to wait until all background processes have finished (to wait for a certain process do wait PID
where PID
is a process ID). So here’s a simple pseudocode for parallel processing:
for ARG in $*; do command $ARG & NPROC=$(($NPROC+1)) if [ "$NPROC" -ge 4 ]; then wait NPROC=0 fi done
I.e. you run 4 processes at a time and wait until all of them have finished before executing the next four. This is a sufficient solution if all of the processes take equally long to finish. However this is suboptimal if running time of the processes vary a lot.
A better solution is to track the process IDs and poll if all of them are still running. In Bash $!
returns the ID of last initiated background process. If a process is running, the corresponding PID is found in directory /proc/
.
Based on the ideas given in a Ubuntu forum thread and a template on command line parsing, I wrote a simple script “parallel
” that allows you to run virtually any simple command concurrently.
Assume that you have a program proc
and you want to run something like proc *.jpg
using three concurrent processes. Then simply do
parallel -j 3 proc *.jpg
The script takes care of dividing the task. Obviously -j 3
stands for three simultaneous jobs.
If you need command line options, use quotes to separate the command from the variable arguments, e.g.
parallel -j 3 "proc -r -A=40" *.jpg
Furthermore, -r
allows even more sophisticated commands by replacing asterisks in the command string by the argument:
parallel -j 6 -r "convert -scale 50% * small/small_*" *.jpg
I.e. this executes convert -scale 50% file1.jpg small/small_file1.jpg
for all the jpg files. This is a real-life example for scaling down images by 50% (requires imagemagick).
Finally, here’s the script. It can be easily manipulated to handle different jobs, too. Just write your command between #DEFINE COMMAND
and #DEFINE COMMAND END
.
#!/bin/bash NUM=0 QUEUE="" MAX_NPROC=2 # default REPLACE_CMD=0 # no replacement by default USAGE="A simple wrapper for running processes in parallel. Usage: `basename $0` [-h] [-r] [-j nb_jobs] command arg_list -h Shows this help -r Replace asterix * in the command string with argument -j nb_jobs Set number of simultanious jobs [2] Examples: `basename $0` somecommand arg1 arg2 arg3 `basename $0` -j 3 \"somecommand -r -p\" arg1 arg2 arg3 `basename $0` -j 6 -r \"convert -scale 50% * small/small_*\" *.jpg" function queue { QUEUE="$QUEUE $1" NUM=$(($NUM+1)) } function regeneratequeue { OLDREQUEUE=$QUEUE QUEUE="" NUM=0 for PID in $OLDREQUEUE do if [ -d /proc/$PID ] ; then QUEUE="$QUEUE $PID" NUM=$(($NUM+1)) fi done } function checkqueue { OLDCHQUEUE=$QUEUE for PID in $OLDCHQUEUE do if [ ! -d /proc/$PID ] ; then regeneratequeue # at least one PID has finished break fi done } # parse command line if [ $# -eq 0 ]; then # must be at least one arg echo "$USAGE" >&2 exit 1 fi while getopts j:rh OPT; do # "j:" waits for an argument "h" doesnt case $OPT in h) echo "$USAGE" exit 0 ;; j) MAX_NPROC=$OPTARG ;; r) REPLACE_CMD=1 ;; \?) # getopts issues an error message echo "$USAGE" >&2 exit 1 ;; esac done # Main program echo Using $MAX_NPROC parallel threads shift `expr $OPTIND - 1` # shift input args, ignore processed args COMMAND=$1 shift for INS in $* # for the rest of the arguments do # DEFINE COMMAND if [ $REPLACE_CMD -eq 1 ]; then CMD=${COMMAND//"*"/$INS} else CMD="$COMMAND $INS" #append args fi echo "Running $CMD" $CMD & # DEFINE COMMAND END PID=$! queue $PID while [ $NUM -ge $MAX_NPROC ]; do checkqueue sleep 0.4 done done wait # wait for all processes to finish before exit