Running comet on a cluster

From CometWiki
Jump to: navigation, search

We highly recommend running 'comet' on a cluster of machines, rather than just one. The program takes some time to run (an hour or two for early events like symmetry breaking; overnight if you want to look at the later motility, pulsatile motion etc.) Running the program concurrently on at least 5 machines (I used 9) lets you efficiently test the effect of changing one parameter through a range of values. You can set it going then come back the following day and have the whole thing laid out for you. I've found this very useful, and have included a set of scripts to automate the process, including automatically generating the montage of images seen in the robustness section so you can quickly scan the effect of varying the parameter.


I recommend putting a default cometparams.ini into a main directory for the data e.g. ~/runs. In that directory run the varyset script (included with the source):

varyset <parameter> <startval> <endval> <number of steps>

This will create a subdirectory within runs that contains subdirectories numbered 1,2,3,etc. each containing a version of the cometparams.ini file with the <parameter> value varying in linear steps between <startval> and <endval>. It will also add information to run the individual comet jobs into ~/joblist.

If you have access to a cluster with a working job control system, you might want to use that. I had trouble with the job control system on the cluster I was using, and ended up writing my own:

On the head node, I have the

startnewjobs

script running as a cron job every 15 minutes. This checks to see if the worker nodes are idle (5 min load average below a certain threshold) and starts the next job if they are.

The script jobstat will list the progress, e.g.

mark@biostar01:/cluster/comet$ jobstat

  Machine  1mld  5mld       ID      R             Frame
       b1  3.69  3.15 05-14-09_0017 1 |T   16|S 109/700 *
       b2  3.39  2.93 05-14-09_0017 8 |T   15|S 116/700 *
       b3  3.38  2.87 05-14-09_0017 4 |T   15|S 112/700 *
       b4  3.55  3.07 05-14-09_0017 6 |T   15|S 114/700 *

0 nodes free
4 jobs waiting


If you don't have access to a cluster, there is a single computer version of startnewjobs

startjobsloop

which will check ~/joblist for new jobs and run them sequentially.


The script

makematrix

pulls together an image matrix (as seen in the robustness section) to summarize the effect of varying the parameter. The directory name, time, computer and main section of the competparams.ini file are converted into an image and included on the left hand side of the summary, to keep track of the details of the run.