In Bash you can start new processes (theads) on the background simply by running a command with ampersand &. The wait
command can be used to wait until all background processes have finished (to wait for a certain process do wait PID
where PID
is a process ID). So here’s a simple pseudocode for parallel processing:
for ARG in $*; do command $ARG & NPROC=$(($NPROC+1)) if [ "$NPROC" -ge 4 ]; then wait NPROC=0 fi done
I.e. you run 4 processes at a time and wait until all of them have finished before executing the next four. This is a sufficient solution if all of the processes take equally long to finish. However this is suboptimal if running time of the processes vary a lot.
A better solution is to track the process IDs and poll if all of them are still running. In Bash $!
returns the ID of last initiated background process. If a process is running, the corresponding PID is found in directory /proc/
.
Based on the ideas given in a Ubuntu forum thread and a template on command line parsing, I wrote a simple script “parallel
” that allows you to run virtually any simple command concurrently.
Assume that you have a program proc
and you want to run something like proc *.jpg
using three concurrent processes. Then simply do
parallel -j 3 proc *.jpg
The script takes care of dividing the task. Obviously -j 3
stands for three simultaneous jobs.
If you need command line options, use quotes to separate the command from the variable arguments, e.g.
parallel -j 3 "proc -r -A=40" *.jpg
Furthermore, -r
allows even more sophisticated commands by replacing asterisks in the command string by the argument:
parallel -j 6 -r "convert -scale 50% * small/small_*" *.jpg
I.e. this executes convert -scale 50% file1.jpg small/small_file1.jpg
for all the jpg files. This is a real-life example for scaling down images by 50% (requires imagemagick).
Finally, here’s the script. It can be easily manipulated to handle different jobs, too. Just write your command between #DEFINE COMMAND
and #DEFINE COMMAND END
.
#!/bin/bash NUM=0 QUEUE="" MAX_NPROC=2 # default REPLACE_CMD=0 # no replacement by default USAGE="A simple wrapper for running processes in parallel. Usage: `basename $0` [-h] [-r] [-j nb_jobs] command arg_list -h Shows this help -r Replace asterix * in the command string with argument -j nb_jobs Set number of simultanious jobs [2] Examples: `basename $0` somecommand arg1 arg2 arg3 `basename $0` -j 3 \"somecommand -r -p\" arg1 arg2 arg3 `basename $0` -j 6 -r \"convert -scale 50% * small/small_*\" *.jpg" function queue { QUEUE="$QUEUE $1" NUM=$(($NUM+1)) } function regeneratequeue { OLDREQUEUE=$QUEUE QUEUE="" NUM=0 for PID in $OLDREQUEUE do if [ -d /proc/$PID ] ; then QUEUE="$QUEUE $PID" NUM=$(($NUM+1)) fi done } function checkqueue { OLDCHQUEUE=$QUEUE for PID in $OLDCHQUEUE do if [ ! -d /proc/$PID ] ; then regeneratequeue # at least one PID has finished break fi done } # parse command line if [ $# -eq 0 ]; then # must be at least one arg echo "$USAGE" >&2 exit 1 fi while getopts j:rh OPT; do # "j:" waits for an argument "h" doesnt case $OPT in h) echo "$USAGE" exit 0 ;; j) MAX_NPROC=$OPTARG ;; r) REPLACE_CMD=1 ;; \?) # getopts issues an error message echo "$USAGE" >&2 exit 1 ;; esac done # Main program echo Using $MAX_NPROC parallel threads shift `expr $OPTIND - 1` # shift input args, ignore processed args COMMAND=$1 shift for INS in $* # for the rest of the arguments do # DEFINE COMMAND if [ $REPLACE_CMD -eq 1 ]; then CMD=${COMMAND//"*"/$INS} else CMD="$COMMAND $INS" #append args fi echo "Running $CMD" $CMD & # DEFINE COMMAND END PID=$! queue $PID while [ $NUM -ge $MAX_NPROC ]; do checkqueue sleep 0.4 done done wait # wait for all processes to finish before exit
June 3, 2008 at 21:49
Change:
$CMD &
To:
eval “$CMD &”
If you want to do things like:
par.sh ‘tr -d ” ” * > $(basename * .txt)-stripped.txt’ *.txt
Without the eval it’ll treat > and $(basename…) as arguments to tr.
October 18, 2008 at 13:53
Great script. Curiously when I use it to batch compress a folder of .wav files to .mp3 it doesn’t always take the same amount of time, sometimes finishing around 1m20s, sometimes 1m40s.
October 20, 2008 at 16:30
> Paul
Good point, never thought of that.
> Leon Roy
Hmm, maybe you have some other processes running that occasionally steal your cputime. There’s an command line utility called htop that you can use to monitor what your CPUs are actually doing…
December 1, 2008 at 12:03
Thank you, great script!
Using it to spawn some server instances.
Any idea how to keep track (in Bash) the spawned
processes and kill them after N seconds?
Joe
December 2, 2008 at 9:40
Thanks!
I don’t know, I never tried that. But since you have the PIDs, you could poll for the run times in the checkqueue routine and terminate processes if necessary. I suppose that there is a way for getting run times in Bash.
June 22, 2010 at 22:47
GNU Parallel http://www.gnu.org/software/parallel makes it possible to distribute the jobs to computers you have ssh access to.
Watch the intro video http://www.youtube.com/watch?v=OpaiGYxkSuQ
September 18, 2010 at 18:32
This is very impressive. I was trying to line up couple tasks uploading files to hosting sites. But single parallel could not reach the network limit. So I would like to try this script any way. Thx a lot~
October 22, 2010 at 0:12
Hmmm, I wish there was a way to not have to wait for all the jobs to finish… I want to be able to start a new job as soon as one finishes.
March 17, 2011 at 16:10
The script below is designed to do precisely that.
July 18, 2011 at 22:15
…or you might use parallelstarter, it does exactly that too:
ftp://ente.limmat.ch/pub/software/bash/badi/public_scripts/parallelstarter/
September 27, 2011 at 14:53
Check out sem from GNU Parallel if you want to start the job in the background but only want a certain numbers running at the same time:
http://www.gnu.org/s/parallel/sem.html
Together with GNU Parallel it should cover most of the needs for running parallel jobs in bash.
January 3, 2011 at 1:15
Very, very much useful.
Since I’m using a set of fairly heterogeneous commands, I slightly modified the script so that it takes as input a text file with a set of commands (one per line) to be executed –said file is compiled a priori via a dedicated script.
A general purpose modification: I suggest to substitute
eval $CMD &
to
$CMD &
Without eval, any output redirection options that are present in the command being executed (e.g., echo “foo” > goo.text) are not correctly interpreted.
April 20, 2011 at 23:22
[…] From my testing it seems to produce varying results, you can use the techniques outlined here: https://pebblesinthesand.wordpress.com/2008/05/22/a-srcipt-for-running-processes-in-parallel-in-bash/ to see if your scripts can benefit from parallel […]
May 20, 2011 at 13:01
[…] processes, and I know that someone out there must have long beat me to the solution. There it was: parallel. It’s a script you can download and make executable in your path, and then run it with a few […]
October 15, 2011 at 2:13
[…] parallel comes to rescue. I first ran into this post https://pebblesinthesand.wordpress.com/2008/05/22/a-srcipt-for-running-processes-in-parallel-in-bash/. But I quickly found GNU parallel was a better choice. Fully loaded with detailed documentation. […]
October 29, 2011 at 18:08
[…] we’re exploring options for getting better build times. We came across this script found here for executing commands and running the processes in […]
December 25, 2012 at 10:26
Really helpful script for supporting multiprocessing systems.
March 30, 2013 at 18:24
Thank you for this nice piece of code!
I know this thread is quite old but two questions come to my mind if somebody could kindly answer:
1) Is copying $QUEUE to $OLDCHQUEUE really needed in line 36?
2) Even more, is the function “checkqueue” worth at all? Couldn’t we just call “regeneratequeue”? I see we save a couple of assignations not going directly to “regeneratequeue” but I don’t think that this represent a substantial gain. Is there any extra reason to justify having the “checkqueue” function or is it just a matter of clear design so to only regenerate the queue when it’s really necessary?
Thank you!
September 1, 2014 at 23:24
If you want to keep your process pool full, consider doing something like this inside the for loop: while `pgrep -c command` -gt $MAX_PROCESS`; do sleep $SLEEP_TIME done. Otherwise, you’ll be waiting on your slowest process to complete before spawning off NPROC threads again.
November 6, 2014 at 13:05
did someone notice that you misspelled “script” in the heading? ;)
thanks for the nice counting script anyway, cheers!
November 20, 2014 at 3:43
Ha! No. At least I haven’t noticed :)
December 25, 2014 at 3:39
[…] a bash script that looks as if it does something close to what you want to do — it starts up a number of […]
March 2, 2015 at 1:55
[…] a bash script that looks as if it does something close to what you want to do — it starts up a number of […]
October 6, 2016 at 9:16
[…] A srcipt for running processes in parallel in Bash […]
July 7, 2017 at 16:49
This is a great article! Tried a lot of different ways to throttle my parallel executions. I ended up using wait. Helped me immensely!
Thanks!!!
+ Joe