Developer.TACC.cloud

Developer documentation for Agave, Abaco, and other TACC APIs

Job Lifecycle Management

Agave handles all of the end-to-end details involved with managing a job lifecycle for you. This can seem like black magic at times, so here we detail the overall lifecycle process every job goes through.

  1. Job request is made, validated, and saved.
  2. Job is queued up for execution. Job stays in a pending state until there are resources to run the job. This means that the target execution system is online, the storage system with the app assets is online, and neither the user nor the system are over quota. a) Resource do not become available with 7 days, the job is killed. b) Resources are available, the job moves on.
  3. When resources are available to run the job on the execution system, a work directory is created on the execution system. The job work directory is created based on the following logic:
if (executionSystem.scratchDir exists) 
then
    $jobDir = executionSystem.scratchDir
else if (executionSystem.workDir exists)
then
    $jobDir = system.workDir  
else 
    $jobDir = system.storage.homeDir
endif

$jobDir = $jobDir + "/" + job.owner + "/job-" + job.uuid
  1. The job inputs are staged to the job work directory, job status is updated to "INPUTS_STAGING" a) All inputs succeed and the job is updated to "STAGED" b) One or more inputs fail to transfer. Job status is set back to "PENDING" and staging will be attempted up to 2 more times. c. User does not have permission to access one or more inputs. The job is set to "FAILED" and exists.
  2. The job again waits until the resources are available to run the job. Usually this is immediately after the inputs finish staging. a) Resource do not become available with 7 days, the job is killed. b) Resources are available, the job moves on.
  3. The app deploymentPath is copied from the app.deploymentSystem to a temp dir on the API server. The jobs API then processes the app.deploymentDir + "/" + app.templatePath file to create the .ipcexe file. The process goes as follows:
    1. Script headers are written. This includes scheduler directives if a batch system, shbang if a forked app.
    2. Additional executionSystem[job.batchQueue].customDirectives are written
    3. "RUNNING" callback written
    4. Module commands are written
    5. executionSystem.environment is written
    6. wrapper script is filtered
      1. blacklisted commands are removed
      2. app parameter template variables are resolved against job parameter values.
      3. app input template variables are resolved against job input values
      4. blacklisted commands are removed again
    7. "CLEANING_UP" callback written
    8. All template macros are resolved.
    9. job.name.slugify + ".ipcexe" file written to temp directory
  4. App assets with wrapper template are copied to remote job work directory.
  5. Directory listing of job work directory is written to a .agave.archive manifest file in the remote job work directory.
  6. Command line is generated to invoke the *.ipcexe file by the appropriate method for the execution system.
  7. Command line is run on the remote system. a. The command succeeds and the scheduler/process/job id is captured and stored with the job record. b. The command fails, return the job to "STAGED" status and try up to 2 more times.
  8. Job is updated to "QUEUED"
  9. Job waits for a "RUNNING" callback and adds a background process to monitor the job in case the callback never comes.
  10. Callback checks the job status according the the following schedule:
* every 30 seconds for the first 5 minutes
* every minute for the next 30 minutes
* every 5 minutes for the next hour
* every 15 minutes for the next 12 hours
* every 30 minutes for the next 24 hours
* every hour for the next 14 days 
  1. Job either calls back with a "CLEANING_UP" status update or the monitoring process discovers the job no longer exists on the remote system.

  2. If job.archive is true, send job to archiving queue to stage outputs to job.archiveSystem
    1. Resource do not become available with 7 days, the job is killed.
    2. Resources are available, the job moves on.
      1. Read the .agave.archive manifest file from the job work directory
      2. Begin a breadth first directory traversal of the job work directory
      3. If a file/folder is not in the .agave.archive manifest, copy it to the job.archivePath on the job.archiveSystem
      4. Delete the job work directory
  3. Update job status to "FINISHED"