sge_status(5) Grid Engine job status values

DESCRIPTION

Job state

The following table lists the job states shown by and returned by The DRMAA state corresponds to the DRMAA_PS_state value that may be returned by

CategoryStateSGEDRMAA state
Pendingpendingqw, RqQUEUED_ACTIVE
pending, user holdhqwUSER_ON_HOLD
pending, system holdhqwSYSTEM_ON_HOLD
pending, user and system hold hqwUSER_SYSTEM_ON_HOLD
pending, user hold, re-queue hRwqUSER_ON_HOLD
pending, system hold, re-queue hRwqSYSTEM_ON_HOLD
pending, user and system hold, re-queue hRwqUSER_SYSTEM_ON_HOLD
Running / transferring running, transferringr, hr, tRUNNING
running, re-run / transferring Rr, RtRUNNING
Suspendedjob suspendeds, tsUSER_SUSPENDED
queue suspendedS, tSSYSTEM_SUSPENDED
queue suspended by alarm T, tTSYSTEM_SUSPENDED
all suspended with re-run Rs, Rts, RS, RtS, RT, RtT SYSTEM_SUSPENDED
Error all pending states with error Eqw, Ehqw, EhRqw FAILED
Deleting all running and suspended states with deletion dr, dt, dRr, dRt, ds, dS, dT, dRs, dRS, dRT same as equivalent DRMAA states without the "d"
Finished job finished normally zDONE
Unkown status cannot be determined UNDETERMINED

The following table lists the "failed" values reported by

(see their description, also reported by qacct, whether the resource usage accounting data are valid for the job ("OK"), and an explanation. The host's messages file or the shepherd trace file (preserved with execd_params KEEP_ACTIVE in may provide more information about errors.
CodeDescriptionOKExplanation
0no failureYran and exited normally
1assumedly before jobNfailed early in execd
3before writing configNfailed before execd set up local spool
4before writing PIDNshepherd failed to record its pid
6setting processor setNfailed setting up processor set
7before prologNfailed before prolog
8in prologNfailed in prolog
9before pestartNfailed before starting PE
10in pestartNfailed in PE starter
11before jobN failed in shepherd before starting job
12before pestopY ran, but failed before calling PE stop procedure
13in pestopY ran, but PE stop procedure failed
14before epilogY ran, but failed before calling epilog script
15in epilogY ran, but failed in epilog script
16releasing processor setY ran, but processor set could not be released
17through signalY job killed by signal (possibly qdel)
18shepherd returned errorNshepherd died
19before writing exit_statusN shepherd didn't write reports correctly
20found unexpected error file? shepherd encountered a problem
21in recognizing jobN qmaster asked about an unknown job (not in accounting?)
24 migrating (checkpointing jobs) Yran, will be migrated
25reschedulingY ran, will be rescheduled
26opening output fileN failed opening stderr/stdout file
27searching requested shellNfailed finding specified shell
28 changing to working directory N failed changing to start directory
29AFS setupNfailed setting up AFS security
30application error returnedY ran and exited 100 - maybe re-scheduled
31accessing sgepasswd fileN failed because sgepasswd not readable (MS Windows)
32 entry is missing in password file N failed because user not in sgepasswd (MS Windows)
33wrong passwordN failed because of wrong password against sgepasswd (MS Windows)
34 communicating with Grid Engine Helper Service N failed because of failure of helper service (MS Windows)
35 before job in Grid Engine Helper Service N failed because of failure running helper service (MS Windows)
36checking configured daemonsN failed because of configured remote startup daemon
37 qmaster enforced h_rt, h_cpu, or h_vmem limit Y ran, but killed due to exceeding run time limit
38adding supplementary groupN failed adding supplementary gid to job
100assumedly after jobY ran, but killed by a signal (perhaps due to exceeding resources), task died, shepherd died (e.g. node crash), etc.
See for the effect of non-zero return codes from the various methods (prolog etc.) executed by the shepherd.