Allocations and Job Accounting

Table of Contents

  1. The Basics
  2. Choosing the Allocation to Use
  3. The Replenishing Process
  4. Monitoring Usage

The Basics of Allocations and Job Accounting

As an user of the cluster you have access to at least one allocation account, the one belonging to the group which requested your access to the HPC cluster. Some groups have normal and high priority allocations, and some users are in/have access to allocations from more than one group. You can see which allocations you have access to with the mybalance command.

All jobs that are submitted are associated with an allocation; this can be specified with the -A flag when the job is submitted, or will use the submitter's default allocation (typically their normal priority allocation if they have multiple allocations). With the exception of jobs in the serial queue, the CPU time for running your job (multiplied by the number of processor cores consumed) will be charged to that allocation. (Because serial queue jobs are ultra-low priority and can be preempted, we do not charge for CPU time on that queue.) The allocation gets debitted at the start of the run for the estimated cost of the job (based on the maximum walltime specified for the job); when the job actually completes any needed adjustments are made between the estimated and actual charges. If there are insufficient funds in the specified allocation to cover the job when the job is about to be scheduled, the job will not be scheduled and will become deferred until funds are available. NOTE that the scheduler is NOT smart enough to try another account you have access to if the initial allocation is depleted.

Groups can get allocations in one of two ways:

  1. If the group contributed equipment into the cluster, they get a normal and high priority paid allocation.
  2. If the group did not contribute equipment into the cluster, the HPC Allocations and Advisory Committee granted the group an one-time unpaid allocation.

Paid allocations come in pairs; for a group named GROUP there will be a normal priority allocations GROUP and a high priority allocation GROUP-hi. As the name implies, jobs submitted with the high priority allocation will be preferentially scheduled over jobs using the normal priority allocation (unless are using the serial queue, which ignores the allocation and runs all jobs with ultra-low priority). Certain queues which can potentially tie up much of the cluster will only accept jobs submitted with high priority allocations.

The HPC Allocations and Advisory Committee can grant one-time unpaid allocations to faculty and students for small projects, classes, feasibility tests, etc. Jobs submitted with these allocations run at normal priority (unless submitted to the serial queue, which ignores allocations and runs all jobs at ultra-low priority).

Choosing the Allocation to Use

If you only have a single allocation (check with the mybalance command), you can skip this section. You only have the one allocation, so there is nothing to chose.

If you have multiple allocations due to your membership in multiple groups, you may wish to choose which allocation you use based on your job. I.e., if the job is doing something for group A, you probably should only submit it using one of the group A allocations, even if you also have access to group B allocations. If the research areas of the two groups overlap, you will need to follow what ever group-specific policies may exist (contact your colleagues).

If you have access to normal and high priority allocations, you probably want to submit the job to the high priority allocation. These are replenished monthly, and funds do not carry over, so you might as well use it.

Of course, you need to ensure that the allocation you choose has sufficient funds for your job. If when your job is about to start running there are not sufficient funds to cover its expected cost (based on specified or queue specific maximum walltime and number of cores requested), your job will not run and instead be deferred until the time such funds are available. Note that the queuing system will NOT automatically select another allocation, if for example your high priority allocation is depleted but funds exist in your normal priority allocation. The job will just get deferred.

Note also that others in your group may have access to the same allocation, so just because funds were there when you submitted a job, someone else's jobs may have started since then and may reduce the funds in the allocation.

To specify the allocation to be used by a job, use the -A option with qsub. E.g., if you have access to the clfshpc-hi allocation and wish to submit a job myjob.csh using that allocation, the command would look something like qsub -A clfshpc-hi myjob.sh. Of course, you may need to include additional arguments as well. You can also add the line

#PBS -A clfshpc-hi
near the top of your myjob.csh script instead of giving the -A option on the command line.

The Replenishing Process

Unpaid allocations do not get automatically replenished. Jobs will deplete funds in the allocation until the allocation runs out of funds, or the time limit for the project, etc. for which the allocation was granted by the HPC Allocations and Advisory Committee expires and the allocation is deleted.

Paid allocations get refreshed every month. For each group which contributed equipment to the cluster, a raw quarterly value equal to the amount of computation that can be done on that equipment in a month is computed (currently just number of cores times number of hours in a quarter; no adjustment for CPU speed is currently made). From this, 20% is removed for OIT overhead --- this covers administrative and other downtime, and some may be used for unpaid allocations. This is the groups quarterly allotment.

Every quarter, on the first day of the month (e.g. 1 Jan, 1 Apr, 1 Jul, 1 Oct), the normal priority allocation for each group is reset to the quarterly allotment. Any amount left over from the previous quarter is lost.

On the first of every month (after the quarterly allotments done if it is also the start of the quarter), the high priority allocation for the group is replenished by transferring funds from the normal priority allocation. It will be brought up to one third of the quarterly allotment (e.g. a monthly allotment) provided that there are sufficient funds in normal allocation. If there are not sufficient funds in the normal allocation, whatever amount is left in the normal allocation is moved to the high priority allocation.

If your group completely uses up exactly their hi-priority allocation every month, and does not directly use their normal priority allocation, at the beginning of each month in a quarter one should see:

where X is your monthly allotment. I.e., at the start of the quarter, the normal priority allocation is reset to the quarterly allotment, and a monthly allotment is transferred immediately to the hi-priority allocation. At the start of second and third months, a monthly allotment is again transferred to the hi-priority, leaving the normal priority depleted at the start of month 3.

In practice, you will see some variation, due to the high priority allocation not being completely depleted at the end of the month (so less that a full monthly allotment is transferred out of the normal priority allocation, resulting in it having more funds), and jobs running against the normal priority allocation, reducing its funds. Note: there is no rollover of unused funds from quarter to quarter in the normal priority allocation, or month to month in the high priority allocation. (Although unused funds in high priority allocation will mean less funds will be transferred out of normal priority allocation to refresh it, resulting in extra normal priority funds).

Monitoring Allocations

You and your research group are responsible for ensuring proper rationing of your allocations. Excessive use of funds in the first month of a quarter could result in no funds at all for the next two months in either allocation. This can be desired, if you have important deadlines at the end of the first month a the quarter, an advantage of the Deepthought HPCC model is that you can use nearly 3 times the power of the computers you purchased in a single to rush out computations, at the cost of having very limitted usage the following two months (but since is after the deadlines, that may not be important). But if this is because some junior member of the group is sending an excessive number of very expensive jobs, this can be quite problematic, especially as you might not notice the impact of the errant user until too late.

OIT cannot tell which jobs are important and which are not, or what is good usage of your allocation funds and what is not. If we notice seriously problematic usage (e.g. a job reserving 10 nodes but only running processes on 1 node), we will do our best to notify and instruct the relevant users. But you are responsible for monitoring your own jobs, and it behooves you to monitor jobs of other users of your allocations. We will provide the necessary tools to do such, but we strongly advise all research groups to have at least one person monitor the usage of their allocations' funds regularly to ensure there are no problems, or at least catch any problems early.

The first level of monitoring of your allocations is with the mybalance command, or the very similar gbalance command. E.g.

payerle:f20-l1:~>mybalance
Project   Machines Balance      
--------- -------- ------------ 
test      ANY          72000000 
test-hi   ANY 	       14399061

payerle:f20-l1:~>gbalance -u payerle
Id Name      Amount       Reserved Balance      CreditLimit Available  
-- --------- ------------ -------- ------------ ----------- ---------- 
33 test          72000000        0     72000000           0   72000000 
34 test-hi       14399061        0     14399061           0   14399061 

By default, both commands return balances in CPU-seconds. Yuo can give either the -h flag to return in more tractable CPU-hours. All allocations you have access to and their balances are listed. The numbers listed under Reserved if any are for jobs currently running (an amount equal to expected cost based on specified or queue-limit walltime is reserved when job starts, when job finished the reservation is lifted and actual usage is charged).

For a history of usage by you and others in your group, in either tabular or graphical form, there is a web form you can use to query the jobs database; you can access it via http://deepthought.umd.edu/stats. There are many options available.

To view the combined normal/high priority usage for the quarter for a group, the script quarterly_project_usage is available, e.g.

payerle:f20-l1:~>quarterly_project_usage -p myproject
Quarterly usage summary for allocations myproject/myproject-hi
For quarter beginning Jul-2010
Quarterly allocation is    616.70 kSU or    205.57 kSU per month

User       kSU used (number of jobs)                                             
           Quarterly Total      Jul-2010             Aug-2010             Sep-2010             
user001    294.30 kSU (4308)    160.05 kSU (2529)    121.08 kSU (1499)    13.17 kSU (280)      
user002    27.88 kSU (874)       5.04 kSU (181)      21.31 kSU (636)       1.53 kSU (57)       
user003    21.20 kSU (620)       7.66 kSU (193)      13.55 kSU (427)       0.00 kSU (0)        
user004    14.27 kSU (1585)      0.00 kSU (0)         5.09 kSU (784)       9.17 kSU (801)      
user005    12.13 kSU (720)       9.00 kSU (517)       2.49 kSU (182)       0.64 kSU (21)       
user006     9.34 kSU (254)       0.00 kSU (0)         9.34 kSU (254)       0.00 kSU (0)        
user007     1.74 kSU (207)       1.42 kSU (196)       0.00 kSU (0)         0.32 kSU (11)       
user008     0.54 kSU (9)         0.54 kSU (9)         0.00 kSU (0)         0.00 kSU (0)        
TOTALS     381.41 kSU (8577)    183.71 kSU (3625)    172.87 kSU (3782)    24.83 kSU (1170)     
% of alloc 61.85 %              89.37 %              84.09 %              12.08 %        

As indicated in sample output, the usage is reported in kSU (1000 CPU-hour) units, and is compared to the monthly/quarterly allocation. You should show concern if the percent used for the month is significantly in excess of the portion of the month that has been past; e.g. if your monthly allocation is 40% used and it is only 1 week into the month. Similarly, if percent of the quarterly allocation consumed is significantly in excess of the fraction of the quarter past; e.g. if quarterly allocation is 40% consumed and only halfway through the first month of the quarter, there is likely to be problems.

For generating reports of which members of your group used how much of the allocation, the script usage_report may prove more useful. Usage is:

usage_report [-p ] [-s ] [-e ] [-h]

Times should be given as YYYY-MM-DD. This will give summary of fund usage during the time period, e.g.

f20-l1:~: usage_report -p myProj -s 2010-02-01 -e 2010-03-01
# Statement for project myProj
# Generated on Thu Feb 25 11:00:19 2010.
# Reporting account activity from 2010-02-01 to now.

############################### Debit Summary ##################################

Object  Action   Project  User    Machine     Amount    Count 
------- -------- -------- ------- ----------- --------- ----- 
Job     Charge   myProj   user1   deepthought -21142.45 6     
Job     Charge   myProj   user2   deepthought     -0.09 2     
Job     Charge   myProj   user3   deepthought   -964.37 4     

Total Debits: -22106.91
Total Jobs:   12

If even more detail is desired, the gstatement command can be used,

gstatement [[-a] ] [-p ] [-u ] \\
	[-m ] [-s ] [-e ] [--summarize] \\
	[-h, --hours] [--debug] [-?, --help] [--man] [-V, --version]