bchkpnt - checkpoint one or more unfinished batch jobs
bchkpnt [ -h ] [ -V ] [ -f ] [ -k ] [ -q queue_name ] [ -m host_name ] [ -u
user_name | all ]
[ -J job_name ] [ -p period ] [ jobId ... ]
Checkpoint one or more started jobs (i.e. running or suspended jobs). bchkpnt can operate only on checkpointable jobs (see bsub(1) and brestart(1) ). A user can only checkpoint his or her own jobs. Only root or LSF administrator can checkpoint jobs submitted by other users. The root and LSF administrator may issue the command `bchkpnt -m host_name -k -u all 0' to checkpoint all the jobs on the host host_name prior to shutting down the host and use brestart to continue the execution from where the jobs were checkpointed when the server host is available (see brestart(1) ). User level checkpointing requires linking programs with a special library (see libckpt.a(3) ).
If a checkpoint request fails to reach the job execution host, lsbatch will retry the operation later when the host becomes reachable. lsbatch retries the most recent checkpoint request.
A bmig request is cancelled if bchkpnt is issued before lsbatch is able to send the migration checkpoint request to the execution host (see bmig(1) ).
For a description of the possible states for batch jobs, see bjobs(1) and the LSF User's Guide.
-h Print command usage to stderr and exit.
-V Print LSF release version to stderr and exit.
-f Checkpoint a job even if non-checkpointable conditions exist (noncheckpointable conditions are operating system-specific).
-k The job is checkpointed and killed atomically by the system. The job is not killed if the checkpoint is unsuccessful. The default is to continue the execution from where the job is checkpointed.
-q queue_name
Checkpoint only those jobs in the queue specified by queue_name (see
bqueues(1)
). If jobId is not specified, only the most recently submitted
qualifying job is checkpointed. The -q option is ignored if a
job ID other than 0 is specified in the jobId option.
-m host_name
Checkpoint only those jobs dispatched to the host or host group that
is specified by host_name (see bhosts(1)
and bmgroup(1)
). If jobId is
not specified, only the most recently submitted qualifying job is
checkpointed. The -m option is ignored if a job ID other than 0 is
specified in the jobId option.
-u user_name | -u all
Checkpoint the jobs submitted either by the user specified by
user_name, or by all users if the reserved user name all is given. If
jobId is not specified, only the most recently submitted qualifying
job is checkpointed. The -u option is ignored if a job ID other than
0 is specified in the jobId option.
-J job_name
Checkpoint the jobs that have the specified job_name. If jobId is not
specified, only the most recently submitted qualifying job is checkpointed.
The -J option is ignored if a job ID other than 0 is specified
in the jobId option.
-p period
Checkpoint the job and change the checkpoint period of the job to
period minutes. The period of 0 disables periodic checkpointing.
Because checkpointing is a heavy weight operation, it is suggested
that the checkpoint period be greater than half an hour. If this
option is not specified, then the job is checkpointed and its checkpoint
period is not changed (see also bsub(1)
and bqueues(1)
).
jobId ...
Checkpoint only those jobs that are specified by jobId. Jobs submitted
by any user can be specified here without using the -u option. If
you use the reserved job ID 0, the operation is applied to all the
jobs that satisfy other options (that is, -m, -q, -u and -J, see previous
sections), and all other job IDs are ignored. The options -u,
-q, -m and -J have no effect if a job ID other than 0 is specified.
If no jobId is specified, the job that satisfies all the other
options, and was submitted most recently, is checkpointed. Job IDs
are returned at job submission time (see bsub(1)
) and may be obtained
with the bjobs command (see bjobs(1)
).
% bchkpnt -q night 0
Checkpoint all of the invoker's jobs that were submitted to queue
night.
% bchkpnt -u all
Checkpoint the last submitted job in the lsbatch system.
% bchkpnt -u all 0
Checkpoint all the batch jobs in the lsbatch system.
% bchkpnt -p 10
Checkpoint the last job submitted by the invoker and change its checkpoint
period to 10 minutes.
% bchkpnt -p 0
Checkpoint the last job submitted by the invoker and disable periodic
checkpointing.
bkill(1) , bsub(1) , brestart(1) , bjobs(1) , bqueues(1) , bhosts(1) , libckpt.a(3) , kill(1) , signal(2) , mbatchd(8)