Troubleshooting jobs on an SGE cluster
Oracle Grid Engine, previously Sun Grid Engine (SGE), is a software system which is used on a computer farm or high-performance computing (HPC) cluster in order to accept, schedule, dispatch, and manage the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs.
Submitting jobs to SGE
- SGE submission script
|
|
The script can then be submitted using the qsub
command which has useful arguments:
qsub arguments
-q <queue>
: set the queue. Often you will use the standard queue.-S /bin/bash
: Use shell as the shell/interpreter to run the script-V
: will pass all environment variables to the job-v var[=value]
: pass specific environment variable ‘var’ to the job-wd <dir>
: Set theworking directory for this job as<dir>
-cwd
: run in the current working directory (in which “qsub” command was issued)-o <output_logfile>
: name of the output log file-e <error_logfile>
: name of the error log file-eo
to send STDOUT and STDERR to the same file-l h_vmem=size
: the amount of maximum memory required (e.g. 4G) 1-l h_rt=<hh:mm:ss>
: specify the maximum run time-l s_rt=hh:mm:ss
: specify the soft run time limit-b y
: allow command to be a binary file instead of a script.-w e
: verify options and abort if there is an error-N <jobname>
: name of the job. This you will see when you useqstat
.-l hostname=HOESTNAME
to request specific host-pe cores <n_slots>
: This specifies the parallel environment.2-m ea
: Will send email when job ends or aborts-P <projectName>
: set the job’s project-M <emailaddress>
: Email address to send email to1 This is memory per processor slot. e.g for 2 processors, total memory will be
2 * hvmem
2 The name of parallel environmnet
cores
differs from one setting to another.
You can see the full list of arguments and explanations here.
How to diagnose a failed SGE job
- The job died instantly
- Check log files of the specific job
- Check the
.o
and.e
files in the job directory - Check
.po
and.pe
files for parallel MPI jobs
- Check the
- Check
qmaster
spool messages and nodeexecd
messages - Run the job using
qsub -w v <full job request>
- will ignore all load values
- Check log files of the specific job
- The job pending forever
- Check
qstat -j <job_id>
- Check
$SGE_ROOT/default/spool/qmaster/schedd/messages
- Check
|
|
SGE errors codes
9
: Ran out of CPU time.64
: The job ended nicely for but your job was running out of CPU time. The solution is to submit the job to a queue with more resources (bigger CPU time limit).125
: An ErrMsg(severe) was reached in your job.127
: Something wrong with the machine?130
: The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.131
: The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.134
: The job is killed with an abort signal, and you probably got core dumped. Often this is caused either by an assert() or an ErrMsg(fatal) being hit in your job. There may be a run-time bug in your code. Use a debugger like gdb or Totalview to find out what’s wrong.137
: The job was killed because it exceeded the time limit.139
: Segmentation violation. Usually indicates a pointer error.140
: The job exceeded the “wall clock” time limit (as opposed to the CPU time limit).
Typically,
- exit code
0
means successful completion. 1-127
are generated from the job calling exit() with a non-zero value to indicate an error.129-255
represent jobs terminated by Unix signals.- You add the signal, e.g.
12
to the base128
to get the error code140
.
- You add the signal, e.g.
Use qstat -j 3223559
to get info about a running/pending job
|
|
Use qstat -f
for a list of jobs including their status and host
|
|
Use qstat -F
for detailed report about each host
Only one host is listed here as an example
|
|
Use qstat -t
for info on array jobs
|
|
Use qstat -ext
to get extended job information
|
|
where
job-ID
identify the job running underall.q
queue on computernode-1-2
ntckts
total # of tickets in a normalised fashioncpu
current accumulated CPU usage of the job in secondsmem
current accumulated MEM usage of the job in Gbytes per secondsio
current accumulated IO usage of the job
Use qacct -j 3223559
for information about past jobs
|
|
maxvmem
the absolute value of virtual memory a job needed. It is the virtual memory of the sum of all processes belonging to a job.
This is useful, as it allows us to define how many memory in the future.
|
|
Use qhost -h node-0-6.local -F h_vmem
for current h_vmem value
|
|
mem_free
define how much memory the host needs to have in order to accept this job.h_vmem
is the limit of the peak memory a job can consume and which will crash if it goes beyond.- one should request the max memory using
qsub -l h_vmem=
and it has to match the memory label (hereh_vmeme
) in thecomplex_values
Use qconf -sc | grep -e '^#' -e _rt
to check for hard/soft time limit
|
|
If default
is not 0:0:0
, then a limit is enforced on run time, which can by overridden by qsub -l s_rt=8:0:0 command
- Checking the status of SGE job
- Job status list
t
transferringr
runningd
deletedR
restarteds
suspendedS
suspended by the queueT
suspended queue threshold reachedw
waitingh
holde
error
- Job status list
Use qconf -sconf
for general configuration
|
|
Use qconf -ssconf
to list server configuration
|
|
Use qconf -shgrpl
to list available host groups
|
|
Use qconf -shgrp @allhosts
to show hosts in a specific host group
|
|
Use qhost
to check all hosts
|
|
Use qconf -sql
to list all queues
|
|
Use qconf -sq all.q
to list specific queue configuration
|
|
Use qalter
to alter SGE jobs
qalter -h u <JOB-ID>
to hold job 1qalter -h U <JOB-ID>
to release the held job