As an user of the cluster you belong to at least one project , and each project contains of one or more allocations , at least one of which you must also be a member of. Each allocation represents an allotment of resources on an HPC cluster. These resources include compute time (measured in SUs (or more commonly kSUs), as well as storage space (measured in TB ) on both the scratch and the SHELL tiers.
The reason that there may be multiple allocations within a project is because the allocations can come from different resource pools and have different expiration dates and replenishment schedules. Allocations from the Allocations and Advisory Committee (AAC) typically have an duration of one year, although they may be renewed via via an application showing reasonable past use. In most cases such allocations are awarded a fixed amount of resources for the duration of the allocation, i.e. for the year. Allocations coming from college or departmental pools will be subject to the policies of the college/department granting the allocation, but usually these will also be for one year terms, but allocated and replenished quarterly. Allocations purchased from DIT will be governed by the MOU signed at the time of purchase, but typically will be for one year, allocated and replenished quarterly.
The storage allotments for all of the different allocations within a project are typically summed, individually for each tier, to get the effective storage limits for the entire project (the group of members in the project) on that storage tier. I.e., typically the storage limits apply across all allocations, so you do not have to assign specific files to specific allocations. Storage allotments are for the duration of the allocation; they do not increase automatically with time. Note that if an allocation expires, the effective limits on storage for the members of the project may be reduced, which could potentially lead to to the disk usage exceeding the limit. In this case, members will be notified of this issue, and given a week to resolve the situation, either by reducing the amount of disk storage used (deleting unneeded files, moving files off the cluster, etc), or increasing your storage limit (by renewing the expired allocation, obtaining an increase in the storage allotment on a remaining allocation, or obtaining (perhaps purchasing) a new allocation with additional storage).
Compute allotments are distinct across allocations. Each allocation
with an compute allotment has its own
Slurm allocation account
,
and when submitting a job you can
specify which allocation account the job should charge against.
Allocations awarded by the
AAC typically award a fixed amount of
compute resources for the duration of the allocation (e.g. one year).
Allocations from college or departmental pools will typically have their
SUs allotted quarterly; e.g. if an allocation is granted 800 kSU/year, this
will be meted out at 200 kSU/quarter for each of four quarters.
All jobs that are submitted are associated with an account; this can be
specified with the -A
flag when the job is submitted, or will
use the submitter's default account.
For more information on
specifying the account, or changing your default account.
Generally, the allocation account will get charged a number of SUs for the amount of resources used while the job is running. This SU cost is based on the amount of time the job ran in hours (the walltime of the job) This factor is the maximum of: The hourly SU cost for a job is the maximum of:
NOTE: The SU costs above are based on the amount of resources allocated to a job, not what is actually used by the job, as the requested and allocated resources are not available to any other jobs while your job is running. So if you request 100 GB of RAM for a job that only used 40 GB of RAM, you are wasting both cluster resources and the SUs in your allocation. Similarly, requesting nodes in exclusive mode will cause you to be charged for all of the resources on the allocated nodes.
However, you are charged based on the actual time the job ran, not the requested walltime. So if you submit a job with a requested walltime of 1 day and it terminated after only 30 minutes, your job is only charged for 0.5 hour. So you should always set the walltime to the longest time you expect the job to run, perhaps with a little padding. But you should not set the requested walltime excessively long, as that could penalize your job in scheduling (plus there can be situations wherein the job stops running usefully but does not terminate --- in those situations, you will be charged for the full walltime until the job terminates).
Job in the scavenger paritition are an exception --- jobs in this low priority allocation are not charged, but are subject to preemption .
|
You are charged for cores consumed, not used.
I.e., if you request 1 core on a node, but also request no other jobs be
run on the node, you will be charged for ALL cores on the node assigned, since
no one else can use them while your job is running. See
for more information.
|
The scheduler keeps track of all jobs running against a given account, and keeps
track of how many SUs are required to complete these jobs (using the
walltime requirements
requested when the job was submitted). Before a new job charging against that
account is started, the scheduler makes sure that there are sufficient funds
to complete it AND all currently running jobs charging against that account.
If there are, the job can be started; otherwise, it is left in the pending
state with a reason code AssociationJobLimit
.
Research groups can get allocations in one of several ways:
This section discusses some general concepts related to allocations which were purchased from DIT, which includes allocations awarded by departments and/or colleges from pools of HPC resources which they have purchased from DIT. Allocations purchased directly from DIT are governed by the MOU between the purchaser and DIT which was signed at the time of purchase. Allocations which were granted from departmental and/or college pools are subject to whatever policies the department and/or college wish to impose. The information below generally holds for such allocations, but the aforementioned MOUs and departmental/college policies have precedence.
These allocations have an expiration date; typically for one year from the start date, but this is negotiable in the MOU, and shorter terms (down to even a single quarter) are available on request. For departmental/college allocations, the expiration date is still nominally set to one year, but the allocation will persist until the contact for the department/college tells us otherwise.
SUs are meted out quarterly; if your allocation is for 800 kSU per year, you will normally get 200 kSU per quarter for 4 quarters. This can be modified beforehand if needed; if you know you will need more compute time in the first two quarters you could set up the same 800 kSU per year as e.g. 250 kSU/quarter for the first two quarters and 150 kSU/quarter for the next two quarters. SUs which are not used in one quarter do not roll over to future quarters --- at the end of a quarter, all unused SUs simply disappear, and (if the allocation did not expire), the allocation will be replenished with the SUs for the next quarter.
Storage (both scratch and SHELL tier) are also allocated quarterly, but as files generally persist across quarters, this is usually not as noticeable. If the storage quota for your project decreases at some point (e.g. an allocation expires or is revoked, causing the loss of the contribution of that allocation to your storage quotas) resulting in your project being over quota on one or more storage tier, the PI of the allocation will receive email from HPC staff informing them of the issue, and requesting that the storage be brought under quota in a timely fashion. This can be either by reducing the storage footprint, or increasing the storage quota (e.g. getting more quota from the AAC, department/college pools, and/or purchasing from DIT). If this is not done in a reasonable time period, we may be forced to bill the PI for the excessive storage used.
The HPC Allocations and Advisory Committee (AAC) can grant one-time unpaid allocations to faculty and students for small projects, classes, feasibility tests, etc. These allocations are granted out of computing resources purchased by funds from the Provost's Office and the Division of Information Technology.
Faculty members can apply for such allocations by submitting an application to the AAC. This can be submitted by the faculty member themselves, or by a post-doc or student on behalf of the faculty member (in the latter case, the faculty member will be required to consent to assuming "ownership" of any resulting allocation). The review of the application by the AAC is more rigorous as more resources are requested; faculty members are eligible for up to 50 kSU per year with very little information. With proper justification and benchmarking and approval by the AAC, faculty members can get up to 550 kSU per year (including the aforementioned 50 kSU) for free from the AAC.
Generally, the AAC will not grant more than 50 kSU/year to faculty members without the faculty member having run jobs on the cluster to (and summarizing such in the application):
If the above criteria have not been met, the AAC will generally limit a researcher to an initial grant of 50 kSU/year. That allocation can be used to start the research, and while doing so collect the aforementioned benchmarks which can be used in a renewal request for additional resources (which can be submitted anytime; you do not have to wait for the initial allocation to expire).
The 50 kSU/year allocation size is also useful if you are new to HPC or the cluster. While the form is the same for all allocation sizes, the review of the application for the first 50 kSU/year for a faculty member is relaxed .
In general, when applying, answer all fields to the best of your ability, and HPC staff will get back to you with questions if more information is needed.
If you only have a single account (check with the sbalance
command), you can skip this section. You only have the one account, so
there is nothing to choose.
If you have multiple accounts due to your membership in multiple research groups and/or projects, you may wish to choose which account you use based on your job. I.e., if the job is doing something for group A, you probably should only submit it using one of the group A accounts, even if you also have access to group B accounts. If the research areas of the two groups overlap, you will need to follow what ever group-specific policies may exist (contact your colleagues).
If you have access to multiple allocation accounts within the same research group/project, then there is a choice to be made. If your research group has group-specific policies about which allocation to use, follow those. Otherwise, you will normally see an allocation from the Allocations and Advisory Committee (AAC) plus one or more allocations from college and/or departmental resource pools, and maybe an allocation purchased from the Division of IT. Again, the sbalance command is an easy way to see this, e.g.
login.zaratan.umd.edu:~$ sbalance
Account: smith-prj-aac (DEFAULT)
Limit: 250.00 kSU
Unused: 126.50
Used: 123.50 (49.4% of limit)
Account: smith-prj-eng
Limit: 200.00 kSU
Unused: 110.00 kSU
Used: 90.00 kSU (45.0 % of limit)
Account: smith-prj-ipst
Limit: 175.00 kSU
Unused: 160.00 kSU
Used: 15.00 kSU (8.7 % of limit)
Account: smith-prj-paid
Limit: 275.00 kSU
Unused: 180.40 kSU
Used: 94.60 kSU (34.4 % of limit)
login.zaratan.umd.edu:~$
In the above example, the user is a member of 4 allocations in the smith-prj research group/project; the first being awarded from the AAC, the second from the School of Engineering, the third from the IPST department, and the last was purchased from DIT.
In such a case, we generally encourage users to use the AAC allocation as a last resort. Paid and College/Departmental allocations are typically awarded quarterly, meaning that at the start of each new quarter year (1 Jan, 1 Apr, 1 Jul, 1 Oct), any unused SUs in that allocation disappear, and the allocation is replenished at its nominal quarterly level. Since the SUs in these allocation typically have the shortest lifespan, you generally want to use those SUs first.
Allocations granted from the AAC, on the other hand, represent an one-time grant of resources, and although these SUs will also expire and vanish at the end of the term of the AAC allocation, this is typical on a timescale of about one year. Also, AAC allocations do not automatically renew --- you (or your advisor) must apply to the AAC for any renewals, etc.
Thus, we normally recommend that you set your default allocation to a paid and/or departmental allocation, and normally charge jobs against those allocations. If and when you encounter a situation wherein your workload for a given quarter is exceeding the quarterly allotment from your paid and/or college/departmental allocations, then you can dip into the AAC allocations to make up the difference. Although exceptions can arise, we find that this type of arrangement is likely to maximize your benefit from the allocations.
Another consider to be accounted for in this decision is the amount of SUs left in the allocation, and the number of running and pending jobs charging against the allocation. The scheduler will not start a job unless it determines that there are sufficient SUs in the allocation to complete the job in question along with all currently running jobs charging against that allocation. In order to estimate the number of SUs needed to complete a job, the scheduler uses the maximum walltime requested for the job.
For example, if you have 5 kSU unused in your allocA allocation, and the job you wish to submit has a walltime of 50 hours and an SU billing rate of 96 SU/hour, the scheduler will assume the job needs 4.8 kSU to complete. If there are no other running jobs charging against the allocA allocation (this includes jobs by other people in your group), the scheduler will consider the job able to start if sufficient compute resources are available. However, if there are 5 jobs already running and charging against the allocA allocation which have an SU billing rate of 50 SU/hour and are halfway through their requested 6 hours of walltime, the scheduler will estimate that each job will run for 3 more hours and so consume 150 SU each, or 0.75 kSU for all 5 such jobs. In that case, the new job will not start because 0.75 kSU for the running jobs plus 4.8 kSU for the new job will exceed the 5 kSU unused in the allocation. This calculation will change over time; e.g. if 4 of those jobs finish right after this calculation (so effectively do not consume any of the 5 kSU unused), the next time the scheduler looks at the job it will see 0.15 kSU needed to finish the remaining job, and 4.8 kSU to finish the new job, or 4.95 kSU total, and the job will be able to start.
Unfortunately, the schedule cannot handle things like "charge this job to allocA, unless there are not enough SUs in which case charge it against allocB". So once the allocations are nearing depletion, you will need to more closely monitor the usage and make such determinations as to which allocation to schedule a job against. But this is generally only an issue when the allocation is close to being depleted.
|
Note that the queuing system will NOT
automatically select another account if there are insufficient funds
in the account specified for the job. E.g., if you have access to both
allocA and allocB and you specify that a job should charge against
allocA (either explicitly or via the default account), the scheduler
will not change that to allocB if allocA is depleted. The job will just
wait in the queue until such time that additional SUs are available in
allocA.
|
Note also that others in your group may have access to the same account, so just because funds were there when you submitted a job, someone else's jobs may have started since then and may reduce the funds in the account.
See here for more information about specifying the account to be charged when submitting a job.
This section discusses various topics related to the management of allocations. It is broken into two sections depending on what level of management:
Most of the tools for managing allocations, etc. for PIs are through the ColdFront allocation management portal. A full list of these tools can be found at the aforementioned pages, these including:
You can also get a bit of useful information from the followimg commands:
Certain units may have pools of resources on the cluster that they can allocate to researchers within their units. These pools are typically granted in return for contributions of hardware to the cluster. Whereas on the Deepthought2 cluster these pools were often arranged as one large allocation account, that proved to have several problems. Such an arrangement made it difficult for different members of the same research group to share files without sharing them with the entire department or college. It also made it difficult for system administrators to contact faculty members regarding student accounts, forcing the departmental/college contact to have to act as the "middleman" in all such communications. It also meant that the departmental contacts had to handle all of the requests from users from the department wishing to get access to cluster, and conversely to handle the removal of all users who should no longer have access (and unfortunately, that typically was ignored). This is all made more complicated as the departmental contact usually needs to contact the actual PI/faculty advisor to determine the eligibilty.
On Zaratan, the addition of storage tiers as allocated resources will just make that even more problematic, especially when the usage exceeds the quota. We are hoping to converting such large departmental/college allocations into pools of resources which can be suballocated to to researchers in the unit. Individual PIs in the unit can be granted allocations of resources from the pool. The PI can then manage which users have access to the allocation, which then removes some of the burden from the departmental manager and places it in the research group with more knowledge of the situation.
From the PI's perspective, they will typically already have a project , containing all of the allocations for the research group. This will typically include an allocation from the campus Allocation and Advisory Committee, and if you grant them an allocation from your pool, that will appear as another allocation under the project. There could be additional allocations as well, if the PI has allocations from another department or unit, or if they have purchased resources. The PI could also have multiple projects --- this could be because they have multiple research groups, or more typically if they have a project for a class they are teaching. If they are also a pool manager, the pool will also appear as a separate project. Generally, all users belonging to allocations within the same project belong to the same group, and scratch and SHELL storage are organized by project.
The compute resources for each allocation in the project will appear as distinct Slurm allocation accounts , and when a job is submitted it will need to specify which account to charge against (or charge against the default allocation account). Storage resources are handled differently --- because it is difficult to classify a file as belonging to one allocation or another, and even messier to have to assign it in such a fashion, we sum up the storage allotments (separately for each storage tier) for all allocations in a project, and use that to set the storage limit for the project's storage directory on that tier.
Pool managers are responsible for allocating the resources in the pool to the individual researchers. Unfortunately, system administrators are not a this time (or in the foreseeable future) able to delegate the actual ability to create/modify allocations to the pool managers, so pool managers will need to send email to hpcc-help@umd.edu to request your changes. This will be handled by people, so you can send multiple requests in a single email.
You are allowed to oversubscribe the compute and/or storage resources in your pool; that means it is permissible for the sum of the compute and/or storage limits (separately for compute and each tier of storage) alloted to each of the suballocations from this pool to exceed the size of these resources in the pool. This was not initially allowed on Deepthought2, which is one reason many units adopted the large departmental/etc. pools --- there were some units that had a fair number of modest HPC users who had a modest average quarterly usage of compute resources, but might need double their average usage in occasional quarters. Without oversubscription, one would need to set the suballocation size to double their average usage, which means on average the allocation would only be half utilized, and (since we are not oversubscribing), the other half of their allocation could not be used by anyone else. With oversubscribing, the unused half of that suballocation could be doubly (or more) allocated, increasing the effective utilization.
While this is certainly advantageous, it needs to used carefully, because to be fair to other users on the cluster, the total usage from suballocations from your pool are restricted to the available compute resources in pool (on a quarterly basis). E.g., if you have a pool of 100 kSU/quarter and you have two suballocations A and B to which you assign 75 kSU/quarter each, then if A uses 75 kSU in a given quarter, B is limited to 25 kSU. So this will work if both A and B only use 50 kSU/quarter on average, and when one uses more than average, the other uses correspondingly less. But clearly their will be complaints if there is a quarter where users of both A and B suballocations want/need to use more than 50 kSU.
You can also oversubscribe storage. This is more problematic, since while SUs are somehwat ephemeral (every quarter the quarterly SU usage is reset), files tend to be more permanent --- once created they remain until someone deletes it. Furthermore, there are various mechanisms by which individual projects can end up going over there filesystem limits. For instance, the project limit is contributed from various allocations, which can expire. E.g., consider a project with a 3 TB limit, with 1 TB coming from an AAC allocation, one from your pool, and one from a project level purchase agreement with DIT, and assume that the project is using 2.9 TB of storage. If one of these allocations expires, suddenly the project is consuming 2.9 TB but only has a 2 TB limit. Assuming the allocation from your pool has not expired, your pool will now be considered to be consuming 1.45 TB from that suballocation (despite your only authorizing 1 TB to that suballocation). It is even more complicated than that, as the enforcement of storage limits has some technical limitations. The Zaratan scratch storage has some delays in the quota enforcement stage, so users can in some cases continue to write data over the limit for a fraction of an hour after exceeding the limit. And the SHELL storage for each project is divided into multiple volumes (at least one per user, and perhaps more), each of which has a size limit, but does not have quotas as such. The sum of these maximum volume sizes can exceed the SHELL storage limit for the project. So it is very possible for individual projects to exceed their storage limits, at which point they will be notified and instructed to rectify the situation, But such can also impact the usage of your pool.
It is recommended that pool managers use care if/when oversubscribing, especially for storage. Ideally, you should avoid oversubscribing initially if possible, and wait until you have several quarters worth of data showing actual utilization of the resources in the pool. If there is a consistent history of under-utilization, then it might be reasonable to allow for some oversubscription if needed, but even then you probably should be a bit conservative and allow some room for fluctuations in the average usage. For storage, you should also remember that storage usage, especially on the SHELL tier, is likely to monotonically increase.
The are several things that pool managers can do in ColdFront
Storage Quota (TB) - HPFS
, but for
SHELL (medium-term) storage
.
Basically, it displays the number of
TB
of SHELL storage
that users in the various suballocations are currently consuming,
versus the total amount available for suballocations of the pool.
It is useful if you oversubscribe the SHELL storage in your pool; if the used SHELL
storage exceeds the limit, you will be requested to quickly rectify the situation.
Show all projects
button
You and your research group are responsible for ensuring proper rationing of the funds in your account(s). Excessive use of funds for a non-AAC allocation at the beginning of a quarter could result in no funds being available for jobs at the end of the semester.
This can be deliberate and beneficial, e.g. if you have important deadlines in the middle of the quarter and are willing to "borrow ahead" to get computations for that completed before the deadlines. This is an advantage of the model used by the UMD HPC clusters; you can use nearly 3 times the power of the computers you purchased in a single month to rush out computations, at the cost of having very limitted usage the following two months (but since it is after the deadlines, that might not be important).
But if such a shortage of funds occurs because some junior member of the group is sending an excessive number of very expensive jobs, this can be quite problematic, especially as you might not notice the impact of the errant user until too late.
The Division of Information Technology cannot tell which jobs are important and which are not, nor what is good usage of your allocation funds and what is not. If we notice seriously problematic usage (e.g. a job reserving 10 nodes but only running processes on 1 node), we will do our best to notify and instruct the relevant users. But you are responsible for monitoring your own jobs, and it behooves you to monitor jobs of other users of your allocations. We will provide the necessary tools to do such, but we strongly advise all research groups to have at least one person monitor the usage of their allocations' funds regularly to ensure there are no problems, or at least catch any problems early.
The first level of monitoring of your allocations is with the
sbalance
command. E.g.
payerle:login-1:~>sbalance
Account: test-hi (dt)
Limit: 163.52 kSU
Available: 163.47 kSU
Used: 0.05 kSU (0.0 % of limit)
Account: test (dt)
Limit: 327.04 kSU
Available: 325.33 kSU
Used: 1.71 kSU (0.5 % of limit)
Without any arguments, it will list usage metrics for all accounts to which you have access to. The above listing is from early in the quarter for a co-op type project; note that both accounts are nearly full, and that the test account has nearly double the amount of the test-hi account. The line starting with "Used" not only gives the number of kSU used, but also the usage as the percentage of the limit. If this percentage is significantly higher than the percentage of time between now and the start of the month (for your high-priority account), or the start of the quarter (for normal-priority accounts), you might need to get concerned. I.e., if at one week into the month you see the usage on your high-priority account is over 30% of the limit, your group is burning your SUs faster than they will be renewed, and you might have some time at the end of the month with nothing in your high-priority account.
For AAC grant type accounts, there is no monthly or quarterly replenishment. The "Limit" should reflect the amount of compute time the AAC granted you, and the percentage is how much of that you have used. If the percentage used is significantly greater than the percent of your work which is complete, you should consider working on an update to your proposal to request more time.
If you are tasked with monitoring the usage of the accounts by your
colleagues in the project (or have taken said task upon yourself), you
can use the -all
flag to sbalance
to see who is
using the funds in the account. You might also wish to use the
-account
flag to limit the output to a single allocation account, e.g.:
login-1: sbalance -account smith-prj-aac -all
Account: smith-prj-aac (DEFAULT)
Limit: 163.52 kSU
Available: 102.07 kSU
Used: 61.45 kSU (37.6 % of limit)
User jtl used 17.6044 kSU (28.6 % of total usage)
User kevin used 13.3456 kSU (21.7 % of total usage)
User payerle used 30.5000 kSU (49.6 % of total usage)
This lists the same information as before, with the addition of showing
every user who has used the account for the current quarter, showing not only
the number of kSU they consumed, but what percentage of the total usage for
the account. E.g., in the example above, you can see that user
payerle
is using almost as much as users kevin
and
jtl
combined. You can add the flag --nosuppress0
if you want to also see lines for everyone with access to the allocation but
who did not consume any time since the start of the quarter.
The --help
option to sbalance
will display
usage options, most of which were discussed above.
The scratch_quota
command will show the quota and usage for the
high-performance or scratch tier of storage on the cluster. Without any arguments,
it will display the usage and quotas on the scratch filesystem for all groups
to which you belong. This should include a group named zt-PROJECTNAME
for each project to which you belong. Currently, the list might also include other
groups which do not correspond to projects; these will be shown with no usage or quota.
After that, it will show the total usage (across all projects/groups to which you belong)
for you.
You can use the --group
flag to specify a group to display usage and quotas
for. You can also use the --user
flag to specify users:
If you specify one or more groups, it will display the usage and quotas for those
groups.
You can also use the --user
flag. If no groups are explicitly provided,
the script will list usages and quotas for all groups the named users belong to
(if neither users nor groups are explicitly given, the code will act as if your username
was explicitly given).
If any users were given (explicitly, or if your username was added by default),
after the usages and quotas for all groups are listed, the script will then list the
total scratch usage by each of the specified users. If the flag --all-users
is provided, this will be done for all members of the groups being displayed.
NOTE: when listing usages for users, it is always the total usage
for that user across all projects to which they belong (or once belonged). It is
Other options are documented in the help option of the script; use --help
to see them.
The shell_quota
command will show the quota and usage for the
medium term SHELL tier of storage on the cluster. Without any arguments,
it will display the usage and quotas on the SHELL filesystem for all projects
to which you belong.
You can use the --project
flag to specify a project to display usage and quotas
for. Projects should be specified by the name of their root directory under
/afs/shell.umd.edu/project
.
Remember that SHELL storage is
volume based and individual volumes are subject to per volume caps. You can use
the flag --show_volumes
to see a list of all volumes belonging to the
project and their respective usages and volume caps.
NOTE: the information returned by this command is updated in a cron job which runs about every six hours or so. Therefore it will take time for changes to be visible.
Other options are documented in the help option of the script; use --help
to see them.
--user
flag. If no groups are explicitly provided,
the script will list usages and quotas for all groups the named users belong to
(if neither users nor groups are explicitly given, the code will act as if your username
was explicitly given).
If any users were given (explicitly, or if your username was added by default),
after the usages and quotas for all groups are listed, the script will then list the
total scratch usage by each of the specified users. If the flag --all-users
is provided, this will be done for all members of the groups being displayed.
NOTE: when listing usages for users, it is always the total usage
for that user across all projects to which they belong (or once belonged). It is
Other options are documented in the help option of the script; use --help
to see them.
The home_quota
command will show your your usage and quota on
the home space filesystem. It will also show information on the "grace"
period; we normally allow users to exceed their 10 GB homespace quota by
up to 10 GB for about 1 week; if you see something other than "[n/a]" in the
"Grace Ends" field, you have exceeded your homespace quota and you will be
unable to save any data to your home directory once the grace period ends
(or you go over the 20 GB hard limit).
The sacct
command can be used to view the accounting records of jobs,
both past and currently running. It takes some time to run, and can display
a fair amount of information (which is documented in its man page). You will
almost always wish to restrict it to a time range, so to see the usage of
account foo
for the month of November 2014, one could use
login-1> sacct --format=JobID,User,Account,ReqCPUs,AllocCPUS,Elapsed,CPUTime \
-a -X -S 2014-11-01 -E 2014-11-30 -A foo
JobID User Account ReqCPUS AllocCPUS Elapsed CPUTime
------------ --------- ---------- -------- ---------- ---------- ----------
2717747 payerle foo 16 20 1-00:00:09 20-00:03:00
2717748 payerle foo 16 20 1-00:00:09 20-00:03:00
2717749 payerle foo 16 20 1-00:00:09 20-00:03:00
2717750 payerle foo 16 20 1-00:00:08 20-00:02:40
2717751 payerle foo 16 20 1-00:00:08 20-00:02:40
2717752 payerle foo 16 20 1-00:00:08 20-00:02:40
2717753 payerle foo 16 20 1-00:00:17 20-00:05:40
2717754 payerle foo 16 20 1-00:00:17 20-00:05:40
2717755 payerle foo 16 20 1-00:00:17 20-00:05:40
2717756 payerle foo 16 20 1-00:00:12 20-00:04:00
2718384 payerle foo 10 0 00:00:00 00:00:00
2718385 payerle foo 10 0 00:00:00 00:00:00
2718386 payerle foo 10 0 00:00:00 00:00:00
Here,
foo
account.
The command man sacct
will give a very complete manual for
the sacct command. A set of flags which are often useful is
--format=JobID,User,State,AllocTRES,Elapsed -P -X
.
The -P
flag caused the output to be in pipe (|) delimitted
fields, and the -X
flag causes only allocations (and not
job steps) to be displayed. The --format
flag controls which
fields are output. With these flags (and others to filter which jobs are
displayed) you might get something like:
login-1> sacct --format=JobID,User,State,AllocTRES,Elapsed -S 2024-11-01 -u payerle -P -X
JobID|User|State|AllocTRES|Elapsed
8517724|payerle|COMPLETED|billing=1,cpu=1,energy=45108,mem=4000M,node=1|00:01:05
8533735|payerle|COMPLETED|billing=7,cpu=1,energy=63180,gres/gpu:a100_1g.5gb=1,gres/gpu=1,mem=4000M,node=1|00:01:00
8533736|payerle|COMPLETED|billing=7,cpu=1,energy=42900,gres/gpu:a100_1g.5gb=1,gres/gpu=1,mem=4000M,node=1|00:01:00
8533737|payerle|COMPLETED|billing=7,cpu=1,energy=53040,gres/gpu:a100_1g.5gb=1,gres/gpu=1,mem=4000M,node=1|00:01:01
8533741|payerle|COMPLETED|billing=7,cpu=6,energy=46800,gres/gpu:a100_1g.5gb=1,gres/gpu=1,mem=24000M,node=1|00:01:02
8569479|payerle|COMPLETED|billing=7,cpu=6,energy=51642,gres/gpu:a100_1g.5gb=1,gres/gpu=1,mem=24000M,node=1|00:01:01
8653934|payerle|COMPLETED|billing=144,cpu=6,energy=47945170,gres/gpu:h100=1,gres/gpu=1,mem=24000M,node=1|00:01:00
8659518|payerle|COMPLETED|billing=7,cpu=6,energy=64057,gres/gpu:a100_1g.5gb=1,gres/gpu=1,mem=24000M,node=1|00:01:00
8659521|payerle|COMPLETED|billing=7,cpu=6,energy=60840,gres/gpu:a100_1g.5gb=1,gres/gpu=1,mem=24000M,node=1|00:01:00
8661531|payerle|COMPLETED|billing=6,cpu=6,energy=48454,mem=24000M,node=1|00:01:00
8708751|payerle|TIMEOUT|billing=12,cpu=12,energy=1172654,mem=48000M,node=1|01:00:21
9244826|payerle|TIMEOUT|billing=4,cpu=4,energy=131026,mem=16G,node=1|00:05:18
9299430|payerle|COMPLETED|billing=144,cpu=1,energy=162727,gres/gpu:h100=1,gres/gpu=1,mem=4000M,node=1|00:01:00
In particular, you can use the information above to calculate the SU cost per job: the
value of the billing
trackable resource (TRES) in the AllocTRES field represents
the hourly SU cost for the job. Multiplying that by the elapsed time in hours yields the SU
cost of the job. E.g., job 8653934 in the above example has an hourly billing rate of 144 SU/hour
(due to the use of an H100 GPU), and it ran for 1 minute = 0.0166 hour, so the SU cost is 2.4 SU.
The sbalance command returns the usage for the current quarter, and the scratch_quota, shell_quota, and home_quota commands all report the current usage status. While this is probably what most users are concerned with most of the time (e.g., if I want to figure out if there are enough kSUs to run my job now, usage from previous quarters is irrelevant), but sometimes one needs information regarding usage over longer time scales. This is especially useful for people who manage "pools" of resources for departments or colleges.
There are a couple of tools available to get more historic information regarding allocation use:
The Zaratan XDMoD website is a web page running the Open XDMoD (Open XD Metrics on Demand) web application. This can present in graphical form many metrics pertaining to the Zaratan cluster. One can see how many kSUs were consumed by a given allocation as a function of time, or what the average job length for an allocation over the past year. Although some features are available without logging in, for the best experience we recommend that you log into the site using your standard UMD username and password.
The ColdFront allocation management portal allows one to command runs from the login nodes of the cluster. This command examines all the job records related to the allocation account(s) specified, and provides summaries. (As opposed to the sacct command which lists details for each job, but does not summarize.). Because it has to go through all the job records, it does tend to be a bit slow.
We only discuss some of the more commonly used options below; the
command supports a --help
or -h
option which
provides more information on its usage (including some options to provide
even more usage information). The commonly used options are: