Monitoring and Managing Your Jobs, etc.

Seeing what jobs are running/queued
When will my job run
Detailed information about your jobs
1. Detailed information about running/queued jobs
2. Detailed information about finished jobs
Viewing output of jobs in progress
Cancelling your jobs
Monitoring the cluster

Seeing what jobs are running/queued

The slurm command to list what jobs are running is squeue, e.g.

login-1: squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1243530 standard  test2.sh  payerle  R   18:47:23      2 compute-b18-[2-3]
           1244127 standard  slurm.sh    kevin  R    1:15:47      1 compute-b18-4
           1230562 standard  test1.sh  payerle PD       0:00      1 (Resources)
           1244242 standard  test1.sh  payerle PD       0:00      2 (Resources)
           1244095 standard  slurm2.sh   kevin PD       0:00      1 (ReqNodeNotAvail)

The ST column gives the state of the job, with the following codes:

R for Running
PD for PenDing
TO for TimedOut
PR for PReempted
S for Suspended
CD for CompleteD
CA for CAncelled
F for FAILED
NF for jobs terminated due to Node Failure

The NODELIST(REASON) field will tell you on which nodes jobs that are currently running are running on. If the job is pending (i.e. not running), it will give a short explanation for why the job is not running (as of the last time the scheduler examined the job). Typically one might see something like:

(Resources) if the scheduler is unable to find sufficient idle resources to run your job (i.e. the cluster is too busy to run your job at this time. The job should run once resources become available (i.e. some currently running jobs complete, freeing resources)
(Priority) if their are other jobs with higher priority ahead of yours in the queue. The job should run once the jobs ahead of it get scheduled.
(AssocGrpCPUMinsLimit) or (AssociationJobLimit): these generally mean that your allocation account has insufficient funds available to complete this job and all currently running jobs charging against that allocation account. See the relevant FAQ entry for more information. This job will only run if the currently running jobs complete using much less SUs than predicted (based on their wall time limit) and/or if the allocation account gets replenished.
(QOSResourceLimit) generally occur only if you have submitted a large number of jobs. Some of those jobs will be held in a pending state to prevent adverse impact on the rest of the cluster. These jobs will typically run once the job count is reduced (by currently running jobs completing). See the relevant FAQ entry for more information.

Typically, if you see something note in the above list, there is a problem and you will want to contact systems staff to assist.

The squeue command also takes a wide range of options, including options to control what is output and how. See the man page (man squeue) for more information.

For example: if you add the following to your ~/.aliases file (assuming you are using a C-shell variant):

alias sqp 'squeue -S -Q -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %Q %R"'

when you next log in the command sqp will list jobs in the queue in order of descending priority.

When will my job start?

The scheduler tries to schedule all jobs as quickly as possible, subject to cluster policies, available hardware, allocation priority (contributers to the cluster get higher priority allocations), etc. Typically jobs run within a day or so, but this can vary and usage of the cluster can vary widely at times.

The command squeue command, with the appropriate arguments, can show you the scheduler's estimate of when a pending/idle job will start running. It is, of course, just the scheduler's best estimate, given current conditions, and the actual time a job starts might be earlier or later than that depending on factors such as the behavior of currently running jobs, the submission of new jobs, and hardware issues, etc.

To see this, you need to request that squeue show the %S field in the output format option, e.g.

login-1> squeue -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S"
    JOBID PARTITION     NAME     USER ST       TIME  NODES START_TIME
      473  standard test1.sh  payerle PD       0:00      4 2014-05-08T12:44:34
      479  standard test1.sh    kevin PD       0:00      4 N/A
      489  standard tptest1.  payerle PD       0:00      2 N/A

Obviously, the times given are estimates. The job could start earlier if other jobs ahead of it in the queue do not use their full walltime, or could get delayed if jobs with a higher priority than yours are submitted before your start time.

Detailed information about your jobs

Detailed information about running/queued jobs

To get more detailed information about a job that is currently running or in the queue, you can use the scontrol show job JOBNUMBER command. This command provides much detail about your job, eg.

login-2> scontrol show job 486
JobId=486 Name=test1.sh
   UserId=payerle(34676) GroupId=glue-staff(8675)
   Priority=33 Account=test QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   SubmitTime=2014-05-06T11:20:20 EligibleTime=2014-05-06T11:20:20
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=standard AllocNode:Sid=pippin:31236
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2 NumCPUs=8 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/export/home/pippin/payerle/slurm-tests/test1.sh
   WorkDir=/home/pippin/payerle/slurm-tests

Detailed information about finished jobs

The scontrol command above will only display information about jobs which are either running or in the queue. Often it is useful to get information about jobs which have already finished. The sacct and seff commands allow one to inspect jobs which already completed (although you can look at queued/running jobs with the sacct command as well, most of the interesting information is not available until after the job completes).

The seff command is easier to use, but is rather limited in the information returned. However, as the information it returns is often what we are most interested in, it is a handy command. It takes a single argument, the job number of a job to report on, and returns some basic status information as well as the how well the job utilized the CPU and memory that it requested/was allocated to it. This can be used to fine tune your job submission parameters to make more efficient use of the cluster; something the AAC looks at when requesting more resources. For example:

login-2:~$ seff 10603094
Job ID: 10603094
Cluster: zaratan
User/Group: payerle/glue-staff
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:51:13
CPU Efficiency: 99.35% of 00:51:33 core-walltime
Job Wall-clock time: 00:51:33
Memory Utilized: 432.00 KB
Memory Efficiency: 0.01% of 3.91 GB

In this example, we see that job 10603094 requested a single CPU core and the default 4 GB per core of memory. If ran for 51 minutes, and made efficient use of the CPU. However, it did not use very much of the memory allocated to it. (For a job requesting the default 4 GB/core this does not matter too much, but it would be a problem if the job were requesting additional memory beyond the standard 4 GB/core).

A full listing of the options to the sacct command can be found using the man sacct command. We discuss only a small subset of the options here, but hopefully the more commonly used ones. Options can be divided into two broad categories: filtering which jobs/job steps to display, and controlling what information is displayed.

sacct options for filtering jobs

The following options are useful for filtering jobs with sacct:

-A ALLOCATION_ACCOUNT or --accounts=ALLOCATION_ACCOUNT: limit jobs/tasks to those charging against the specified ALLOCATION_ACCOUNT
-a or --allusers: Normally, sacct limits the jobs displayed to those owned by the user running the command. If this flag is given, allows viewing jobs for all users.
-E END_TIME or --endtime END_TIME: Restrict display to jobs from before END_TIME. END_TIME should be specified as YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS.
-j JOBLIST or --jobs=JOBLIST: Limit display to the specified job list. JOBLIST should be a comma delimited list of job ids.
-p or --parsable: The default output has fixed field sizes and sometimes truncates useful information. If you give this flag, the sacct command produces machine-parsable output which does not have fixed field lengths but instead delimits fields with the pipe ('|') character, which is useful to see the full fields.
-S START_TIME or --starttime START_TIME: Restrict display to jobs from after START_TIME. See --endtime for time format.
-s STATE or --state=STATE: Limit the jobs displayed to those in the given STATE between the specified start and end times. STATE can be abbreviated, e.g. R for running, CD for completed, PD for pending, etc.
-u USERID or --user USERID: Limit the jobs displayed to those owned by the specified user.
-X or --allocations: Normally the list produced by sacct consists of several lines per job (one for each job step, of which there are typically several per job). If this flag is given, it will only display the one line per job, for the main step of the job. Note: information in the main step might or might not be representative of the whole job; in particular the MaxRSS in the main step might be less than that in some of the child steps.

sacct options for controlling information displayed

There are a lot of fields that the sacct command is able to display for jobs. Generally you will wish to specify what is displayed. The flag -o FIELDS or --format FIELDS, where FIELDS is a comma delimited list of field names, with the following field names commonly used:

Account: The allocation account the job is charged against
AllocTRES: The TRES (trackable resources) allocated to the job. It includes information on the number of GPUs, CPU cores, nodes, and memory allocated to the job. The field billing is also of note; it indicates the hourly SU cost of the job.
Elapsed: This indicates how long (in walltime) that the job has run.
JobID: This contains the ID of the job.
MaxRSS: This is the maximum resident set size (i.e. the maximum amount of memory the job actually used on any node). This represents the amount of memory actually used on the node, as opposed to the value in AllocTRES which represents the amount of memory requested.
MaxRSSNode: This is the name of the node on which the maximum resident set size occurred (see MaxRSS).
Nodelist: This is a list of the node names the job ran on.
Partition: The name of the partition the job ran on.
State: The (final) state of the job.
User: The username of the user the job belongs to.

You can see the maximum memory needed by the job on any of the nodes assigned to the node by looking at the MaxRSS field. You will need to look for the largest value among all of the records displayed for the job (Note: You should not use the -X or --allocations flag in the sacct command since the maximum memory usage typically occurs in one of the child job steps and is larger than the number displayed for the main job step). For example, if you do

login-1: sacct -j 1004702 -o JobID,MaxRSS -p
JobID|MaxRSS|
1004702||
1004702.batch|1021268K|
1004702.extern|0|

We see that the maximum value of MaxRSS for job 1004702 is 1021268KiB or 1021268 KiB * 1 MiB/1024 KiB * 1 GiB/1024 MiB = 0.97 % INCLUDE glossary_term term="gib" text="GiB" %].

You can compute the SU cost for the job with the Elapsed and AllocTRES fields. In the AllocTRES field, look for the value associated with the TRES named billing --- this the SU cost per hour of walltime for the job. Then multiple by the elapsed walltime as given in the Elapsed field. You do not need to do this for every line/job step for the job but just for the main job step, as the main step includes the resources and walltime for the other job steps (i.e., for this purpose you can use the -X flag). E.g., If elapsed is 2:45:00 and billing shows 16, the SU cost of the job is 2.75 hours * 16 SU/hour = 44 SU.

Viewing output of jobs in progress

Slurm outputs the stdout and stderr streams for your job to the files you specified on the shared filesystem in real time. There is no need for an extra command like qpeek under the PBS/Moab/Torque environment.

Cancelling Your Jobs

Sometimes one needs to kill a job. To kill/cancel a job that is waiting in the queue, or is already running, use the scancel command:

login-1> scancel -i 122488
Cancel job_id=122488 name=test1.sh partition=standard [y/n]? y
login-1>

Monitoring the Cluster

Notices of scheduled and unscheduled outages, issues, etc on the clusters will be announced on the appropriate mailing lists (e.g. HPCC Announce for the Deepthought* clusters) --- users are automatically subscribed to these lists when they get access to the cluster.

Sometimes you want a broader overview of the cluster. The squeue command can give you information on what jobs are running on the cluster. The sinfo -N command can show you attributes of the nodes on the cluster. But both of these use a text orientated display, which while providing fairly dense amount of information, is often difficult to digest.

The sview command uses real (not text mode) graphics to show the status of the cluster. As such it requires an X server running on the computer you are sitting at. This will present a graphical overview of the nodes in the cluster and their state, as well as the job queue.

PLEASE SET THE REFRESH INTERVAL to something like 300 seconds (5 minutes). Select Options| Set Refresh Interval. The application default is far too frequent and causes performance issues.

For an even prettier view, there are online pages for monitoring for the clusters at:

Online Monitoring of Zaratan: This main page lists a number of dashboards providing summary and detailed information about the current and recent utilization, and other metrics, etc. of the Zaratan cluster. This is primarily based on hardware metrics. The Kiosk View provides a good summary of how much the cluster is being used at this moment, and the Host Overview, GPU Metrics, and Filesystems allow you to see recent utilization of specific hosts, GPUs, and filesystems.
XDMoD Reports Zaratan: The XDMoD portal allows one to view metrics on the Zaratan cluster more directly related to jobs, users, and allocations.