,
whether the resource usage accounting data are valid for the job
("OK"), and an explanation. The host's messages file or the shepherd
trace file (preserved with
in
may provide more information about errors.
Code | Description | OK | Explanation
|
0 | no failure | Y | ran and exited normally
|
1 | assumedly before job | N | failed early in execd
|
3 | before writing config | N | failed before execd set up local spool
|
4 | before writing PID | N | shepherd failed to record its pid
|
6 | setting processor set | N | failed setting up processor set
|
7 | before prolog | N | failed before prolog
|
8 | in prolog | N | failed in prolog
|
9 | before pestart | N | failed before starting PE
|
10 | in pestart | N | failed in PE starter
|
11 | before job | N |
failed in shepherd before starting job
|
12 | before pestop | Y |
ran, but failed before calling PE stop procedure
|
13 | in pestop | Y |
ran, but PE stop procedure failed
|
14 | before epilog | Y |
ran, but failed before calling epilog script
|
15 | in epilog | Y |
ran, but failed in epilog script
|
16 | releasing processor set | Y |
ran, but processor set could not be released
|
17 | through signal | Y |
job killed by signal (possibly qdel)
|
18 | shepherd returned error | N | shepherd died
|
19 | before writing exit_status | N |
shepherd didn't write reports correctly
|
20 | found unexpected error file | ? |
shepherd encountered a problem
|
21 | in recognizing job | N |
qmaster asked about an unknown job (not in accounting?)
|
24 |
migrating (checkpointing jobs)
| Y | ran, will be migrated
|
25 | rescheduling | Y |
ran, will be rescheduled
|
26 | opening output file | N |
failed opening stderr/stdout file
|
27 | searching requested shell | N | failed finding specified shell
|
28 |
changing to working directory
| N |
failed changing to start directory
|
29 | AFS setup | N | failed setting up AFS security
|
30 | application error returned | Y |
ran and exited 100 - maybe re-scheduled
|
31 | accessing sgepasswd file | N |
failed because sgepasswd not readable (MS Windows)
|
32 |
entry is missing in password file
| N |
failed because user not in sgepasswd (MS Windows)
|
33 | wrong password | N |
failed because of wrong password against sgepasswd (MS Windows)
|
34 |
communicating with Grid Engine Helper Service
| N |
failed because of failure of helper service (MS Windows)
|
35 |
before job in Grid Engine Helper Service
| N |
failed because of failure running helper service (MS Windows)
|
36 | checking configured daemons | N |
failed because of configured remote startup daemon
|
37 |
qmaster enforced h_rt, h_cpu, or h_vmem limit
| Y |
ran, but killed due to exceeding run time limit
|
38 | adding supplementary group | N |
failed adding supplementary gid to job
|
100 | assumedly after job | Y |
ran, but killed by a signal (perhaps due to exceeding resources), task
died, shepherd died (e.g. node crash), etc.
|
See
for the effect of non-zero return codes from the various methods
(prolog etc.) executed by the shepherd.