Skip to Content

Troubleshooting jobs on an SGE cluster

Oracle Grid Engine, previously Sun Grid Engine (SGE), is a software system which is used on a computer farm or high-performance computing (HPC) cluster in order to accept, schedule, dispatch, and manage the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs.

Submitting jobs to SGE

  • SGE submission script
#!/bin/sh 
# -- SGE ARGUMENTS --
#$ -S /bin/sh
#$ -o %DIR%/.%JOBID%.qlog.out
#$ -e %DIR%/.%JOBID%.qlog.err

your commands goes here ..

The script can then be submitted using the qsub command which has useful arguments:

  • qsub arguments
    • -q <queue> : set the queue. Often you will use the standard queue.
    • -S /bin/bash : Use shell as the shell/interpreter to run the script
    • -V : will pass all environment variables to the job
    • -v var[=value] : pass specific environment variable ‘var’ to the job
    • -wd <dir> : Set theworking directory for this job as <dir>
    • -cwd : run in the current working directory (in which “qsub” command was issued)
    • -o <output_logfile> : name of the output log file
    • -e <error_logfile> : name of the error log file
    • -eo to send STDOUT and STDERR to the same file
    • -l h_vmem=size : the amount of maximum memory required (e.g. 4G) 1
    • -l h_rt=<hh:mm:ss> : specify the maximum run time
    • -l s_rt=hh:mm:ss : specify the soft run time limit
    • -b y : allow command to be a binary file instead of a script.
    • -w e : verify options and abort if there is an error
    • -N <jobname> : name of the job. This you will see when you use qstat.
    • -l hostname=HOESTNAME to request specific host
    • -pe cores <n_slots> : This specifies the parallel environment.2
    • -m ea : Will send email when job ends or aborts
    • -P <projectName> : set the job’s project
    • -M <emailaddress> : Email address to send email to

    • 1 This is memory per processor slot. e.g for 2 processors, total memory will be 2 * hvmem
    • 2 The name of parallel environmnet cores differs from one setting to another.

    You can see the full list of arguments and explanations here.

How to diagnose a failed SGE job

  • The job died instantly
    • Check log files of the specific job
      • Check the .o and .e files in the job directory
      • Check .po and .pe files for parallel MPI jobs
    • Check qmaster spool messages and node execd messages
    • Run the job using qsub -w v <full job request>
      • will ignore all load values
  • The job pending forever
    • Check qstat -j <job_id>
    • Check $SGE_ROOT/default/spool/qmaster/schedd/messages
cat /opt/gridengine/default/spool/qmaster/messages

SGE errors codes

  • 9: Ran out of CPU time.
  • 64: The job ended nicely for but your job was running out of CPU time. The solution is to submit the job to a queue with more resources (bigger CPU time limit).
  • 125: An ErrMsg(severe) was reached in your job.
  • 127: Something wrong with the machine?
  • 130: The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.
  • 131: The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.
  • 134: The job is killed with an abort signal, and you probably got core dumped. Often this is caused either by an assert() or an ErrMsg(fatal) being hit in your job. There may be a run-time bug in your code. Use a debugger like gdb or Totalview to find out what’s wrong.
  • 137 : The job was killed because it exceeded the time limit.
  • 139 : Segmentation violation. Usually indicates a pointer error.
  • 140: The job exceeded the “wall clock” time limit (as opposed to the CPU time limit).

Typically,

  • exit code 0 means successful completion.
  • 1-127 are generated from the job calling exit() with a non-zero value to indicate an error.
  • 129-255 represent jobs terminated by Unix signals.
  • You add the signal, e.g. 12 to the base 128 to get the error code 140.

Use qstat -j 3223559 to get info about a running/pending job

==============================================================
job_number:     3223559
exec_file:      job_scripts/3223559
submission_time:Sun Oct 21 09:26:54 2018
owner:          userNAME
uid:14163
group:          groupNAME
gid:15021
sge_o_home:     /cluster/home/userNAME
sge_o_log_name: userNAME
sge_o_path:     /cluster/home/userNAME/workDir/Apps/nextflow:/cluster/home/userNAME/.local/bin:/cluster/home/userNAME/workDir/apps/ataqv-1.0.0/bin:/cluster/home/userNAME/anaconda3/bin:/cluster/home/userNAME/workDir/Apps/nextflow:/opt/openmpi/bin:/cluster/apps/clusterflow/0.4_devel:/cluster/home/userNAME/bin:/usr/bin:/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/gridengine/bin/lx26-amd64
sge_o_shell:    /cluster/home/userNAME/bin/zsh
sge_o_workdir:  /cluster/group/groupNAME/workDir/workDir_Pipelines/pipelines/atacPipeline/work/91/e8a79a73303f58258874437b02a535
sge_o_host:     rocks1
account:        sge
cwd:/cluster/group/groupNAME/workDir/workDir_Pipelines/pipelines/atacPipeline/work/91/e8a79a73303f58258874437b02a535
merge:          y
hard resource_list:         virtual_free=100G,h_vmem=10G
mail_list:      userNAME@rocks1.local
notify:         TRUE
job_name:       nf-deeptools
stdout_path_list:           NONE:NONE:/cluster/group/groupNAME/workDir/workDir_Pipelines/pipelines/atacPipeline/work/91/e8a79a73303f58258874437b02a535/.command.log
jobshare:       0
shell_list:     NONE:/bin/bash
env_list:       
script_file:    .command.run
parallel environment:  cores range: 10
usage    1:     cpu=00:21:20, mem=1064.38683 GBs, io=84.15673, vmem=1.564G, maxvmem=1.674G

Use qstat -f for a list of jobs including their status and host


queuename          qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@node-0-0.local        BP    0/58/64        38.47    lx26-amd64    
3223558 0.56500 nf-ngsplot userNAME     r     10/21/2018 09:26:54    10        
3223566 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:55    10        
3223572 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:04    10        
3223578 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:24    10        
---------------------------------------------------------------------------------
all.q@node-0-1.local        BP    0/44/64        17.55    lx26-amd64    
3223560 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:54    10        
3223574 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:04    10        
3223579 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:24    10        
---------------------------------------------------------------------------------
all.q@node-0-10.local       BP    0/55/64        19.26    lx26-amd64    
3223561 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:54    10        
3223577 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:22    10        
3223581 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:25    10        
3223582 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:25    10        
---------------------------------------------------------------------------------
all.q@node-0-2.local        BP    0/17/64        17.10    lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-0-3.local        BP    0/16/64        16.00    lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-0-4.local        BP    0/14/64        14.04    lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-0-5.local        BP    0/26/64        17.43    lx26-amd64    
3223563 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:54    10        
---------------------------------------------------------------------------------
all.q@node-0-6.local        BP    0/56/64        20.56    lx26-amd64    
3223559 0.56500 nf-deeptoo userNAME     r     10/21/2018 09:26:54    10        
3223570 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:59    10        
3223571 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:59    10        
3223576 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:22    10        
---------------------------------------------------------------------------------
all.q@node-0-7.local        BP    0/55/64        19.45    lx26-amd64    
3223564 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:54    10        
3223565 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:54    10        
3223569 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:59    10        
3223575 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:22    10        
---------------------------------------------------------------------------------
all.q@node-0-8.local        BP    0/36/64        18.28    lx26-amd64    
3223562 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:54    10        
3223580 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:25    10        
---------------------------------------------------------------------------------
all.q@node-0-9.local        BP    0/53/64        19.62    lx26-amd64    
3223557 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:53    10        
3223567 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:55    10        
3223568 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:26:55    10        
3223573 0.56500 nf-nucleoa userNAME     r     10/21/2018 09:27:04    10        
---------------------------------------------------------------------------------
all.q@node-1-0.local        I     0/2/8          0.03     lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-1-1.local        I     0/1/8          1.75     lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-1-2.local        I     0/1/8          0.00     lx26-amd64    
3223282 0.50500 QRLOGIN    userNAME     r     10/21/2018 07:48:08     1        
---------------------------------------------------------------------------------
all.q@node-1-3.local        I     0/2/8          0.00     lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-1-4.local        I     0/1/8          0.01     lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-1-5.local        I     0/0/8          1.00     lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-1-6.local        I     0/1/8          0.02     lx26-amd64    
---------------------------------------------------------------------------------
all.q@node-1-7.local        I     0/0/8          0.02     lx26-amd64    

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
3223583 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223584 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223585 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223586 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223587 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223588 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223589 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223590 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223591 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223592 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223593 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223594 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:55    10        
3223595 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:56    10        
3223596 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:56    10        
3223597 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:56    10        
3223598 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:56    10        
3223599 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:56    10        
3223600 0.56500 nf-nucleoa userNAME     qw    10/21/2018 09:26:56    10

Use qstat -F for detailed report about each host

Only one host is listed here as an example

all.q@node-0-6.local BP 0/56/64 20.54 lx26-amd64
hl:arch=lx26-amd64
hl:num_proc=64
hl:mem_total=504.866G
hl:swap_total=999.996M
hl:virtual_total=505.843G
hl:load_avg=20.540000
hl:load_short=20.400000
hl:load_medium=20.540000
hl:load_long=19.620000
hl:mem_free=498.202G
hl:swap_free=980.570M
hl:virtual_free=499.160G
hl:mem_used=6.664G
hl:swap_used=19.426M
hl:virtual_used=6.683G
hl:cpu=32.100000
hl:m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT
hl:m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT
hl:m_socket=4
hl:m_core=32
hl:np_load_avg=0.320937
hl:np_load_short=0.318750
hl:np_load_medium=0.320937
hl:np_load_long=0.306563
hc:h_vmem=64.000G
qf:qname=all.q
qf:hostname=node-0-6.local
qc:slots=8
qf:tmpdir=/tmp
qf:seq_no=0
qf:rerun=0.000000
qf:calendar=NONE
qf:s_rt=infinity
qf:h_rt=infinity
qf:s_cpu=infinity
qf:h_cpu=infinity
qf:s_fsize=infinity
qf:h_fsize=infinity
qf:s_data=infinity
qf:h_data=infinity
qf:s_stack=infinity
qf:h_stack=infinity
qf:s_core=infinity
qf:h_core=infinity
qf:s_rss=infinity
qf:h_rss=infinity
qf:s_vmem=infinity
qf:min_cpu_interval=00:05:00
3223559 0.56500 nf-deeptoo userNAME r 10/21/2018 09:26:54 10
3223570 0.56500 nf-nucleoa userNAME r 10/21/2018 09:26:59 10
3223571 0.56500 nf-nucleoa userNAME r 10/21/2018 09:26:59 10
3223576 0.56500 nf-nucleoa userNAME r 10/21/2018 09:27:22 10

Use qstat -t for info on array jobs

job-ID  prior   name       user         state submit/start at     queue  master ja-task-ID task-ID state cpu        mem     io      stat failed 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
3223282 0.50500 QRLOGIN    userNAME     r     10/21/2018 07:48:08 all.q@node-1-2.local        MASTER        
3223611 0.56500 nf-nucleoa userNAME     r     10/21/2018 11:27:50 all.q@node-0-8.local        MASTER        
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
      all.q@node-0-8.local        SLAVE         
3223612 0.56500 nf-nucleoa userNAME     r     10/21/2018 11:27:50 all.q@node-0-5.local        MASTER        
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
      all.q@node-0-5.local        SLAVE         
3223613 0.56500 nf-nucleoa userNAME     r     10/21/2018 11:27:50 all.q@node-0-7.local        MASTER        
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE         
      all.q@node-0-7.local        SLAVE

Use qstat -ext to get extended job information

job-ID  prior   ntckts  name       user         project          department state cpu        mem     io      tckts ovrts otckt ftckt stckt share queue  slots ja-task-ID 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3223282 0.50500 0.50000 QRLOGIN    userNAME     NA   defaultdep r     0:00:00:00 0.00000 0.00000     0     0     0     0     0 0.00  all.q@node-1-2.local1        
3223557 0.56500 0.50000 nf-nucleoa userNAME     NA   defaultdep r     0:00:33:42 1394.14120 9.40318     0     0     0     0     0 0.00  all.q@node-0-9.local           10        
3223558 0.56500 0.50000 nf-ngsplot userNAME     NA   defaultdep r     0:07:06:20 22919.46127 36.92374     0     0     0     0     0 0.00  all.q@node-0-0.local           10        
3223559 0.56500 0.50000 nf-deeptoo userNAME     NA   defaultdep r     0:00:30:43 1547.71585 115.01620     0     0     0     0     0 0.00  all.q@node-0-6.local           10

where

  • job-ID identify the job running under all.q queue on computer node-1-2
  • ntckts total # of tickets in a normalised fashion
  • cpu current accumulated CPU usage of the job in seconds
  • mem current accumulated MEM usage of the job in Gbytes per seconds
  • io current accumulated IO usage of the job

Use qacct -j 3223559 for information about past jobs

==============================================================
qname        all.q   
hostname     node-0-6.local   
group        groupNAME        
owner        userNAME
project      NONE    
department   defaultdepartment   
jobname      nf-deeptools        
jobnumber    3223559 
taskid       undefined
account      sge     
priority     0       
qsub_time    Sun Oct 21 09:26:54 2018
start_time   Sun Oct 21 09:26:54 2018
end_time     Sun Oct 21 11:20:10 2018
granted_pe   cores   
slots        10      
failed       0    
exit_status  2       
ru_wallclock 6796         
ru_utime     28393.550    
ru_stime     769.661      
ru_maxrss    1157344 
ru_ixrss     0       
ru_ismrss    0       
ru_idrss     0       
ru_isrss     0       
ru_minflt    94041146
ru_majflt    3       
ru_nswap     0       
ru_inblock   15534552
ru_oublock   18351592
ru_msgsnd    0       
ru_msgrcv    0       
ru_nsignals  0       
ru_nvcsw     1230690 
ru_nivcsw    663995  
cpu          29163.211    
mem          12694.826         
io           2279.955          
iow          0.000 
maxvmem      5.371G
arid         undefined

maxvmem the absolute value of virtual memory a job needed. It is the virtual memory of the sum of all processes belonging to a job.

This is useful, as it allows us to define how many memory in the future.

#!/bin/bash -l
#$ -P my_project
#$ -N my_job
#$ -l h_vmem=2G
#$ -pe cores 1

Use qhost -h node-0-6.local -F h_vmem for current h_vmem value

HOSTNAME    ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global      -   -     -       -       -       -       -
node-0-6 lx26-amd64     64 20.41  504.9G    6.6G 1000.0M   19.4M
    Host Resource(s):      hc:h_vmem=64.000G

        `qconf -se node-0-6.local` where h_vmem in complex_values shows the allocated h_vmem value.

hostname  node-0-6.local
load_scaling          NONE
complex_values        h_vmem=512G
load_values           arch=lx26-amd64,num_proc=64,mem_total=516982.832031M, \
          swap_total=999.996094M,virtual_total=517982.828125M, \
          load_avg=20.360000,load_short=20.620000, \
          load_medium=20.360000,load_long=18.700000, \
          mem_free=510212.945312M,swap_free=980.570312M, \
          virtual_free=511193.515625M,mem_used=6769.886719M, \
          swap_used=19.425781M,virtual_used=6789.312500M, \
          cpu=31.900000, \
          m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT, \
          m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT, \
          m_socket=4,m_core=32,np_load_avg=0.318125, \
          np_load_short=0.322188,np_load_medium=0.318125, \
          np_load_long=0.292187
processors64
user_listsNONE
xuser_lists           NONE
projects  NONE
xprojects NONE
usage_scaling         NONE
report_variables      NONE
  • mem_free define how much memory the host needs to have in order to accept this job.
  • h_vmem is the limit of the peak memory a job can consume and which will crash if it goes beyond.
  • one should request the max memory using qsub -l h_vmem= and it has to match the memory label (here h_vmeme) in the complex_values

Use qconf -sc | grep -e '^#' -e _rt to check for hard/soft time limit

#name shortcut type relop requestable consumable default urgency
#----------------------------------------------------------------------------------------
h_rt h_rt TIME <= YES NO 0:0:0 0
s_rt s_rt TIME <= YES NO 0:0:0 0

If default is not 0:0:0, then a limit is enforced on run time, which can by overridden by qsub -l s_rt=8:0:0 command

  • Checking the status of SGE job
    • Job status list
      • t transferring
      • r running
      • d deleted
      • R restarted
      • s suspended
      • S suspended by the queue
      • T suspended queue threshold reached
      • w waiting
      • h hold
      • e error

Use qconf -sconf for general configuration

#global:
execd_spool_dir  /opt/gridengine/default/spool
mailer           /bin/mail
xterm/usr/bin/xterm
load_sensor      none
prolog           none
epilog           none
shell_start_mode posix_compliant
login_shells     sh,ksh,csh,tcsh
min_uid          0
min_gid          0
user_lists       none
xuser_lists      none
projects         none
xprojects        none
enforce_project  false
enforce_user     auto
load_report_time 00:00:40
max_unheard      00:05:00
reschedule_unknown           00:00:00
loglevel         log_warning
administrator_mail           none
set_token_cmd    none
pag_cmd          none
token_extend_timenone
shepherd_cmd     none
qmaster_params   none
execd_params     none
reporting_params accounting=true reporting=true \
     flush_time=00:00:15 joblog=true sharelog=00:00:00
finished_jobs    100
gid_range        20000-20100
qlogin_command   /cluster/apps/ssh_wrapper/qlogin_wrapper
qlogin_daemon    /usr/sbin/sshd -i
rlogin_command   /usr/bin/ssh
rlogin_daemon    /usr/sbin/sshd -i
rsh_command      /usr/bin/ssh
rsh_daemon       /usr/sbin/sshd -i
max_aj_instances 2000
max_aj_tasks     75000
max_u_jobs       0
max_jobs         0
max_advance_reservations     0
auto_user_oticket0
auto_user_fshare 0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize     false
jsv_url          script:/opt/gridengine/util/resources/jsv/h_vem_jsv.pl
jsv_allowed_mod  ac,h,i,e,o,j,M,N,p,w

Use qconf -ssconf to list server configuration

algorithm default
schedule_interval 0:0:15
maxujobs 0
queue_sort_method load
job_load_adjustments np_load_avg=0.50
load_adjustment_decay_time 0:7:30
load_formula np_load_avg
schedd_job_info true
flush_submit_sec 1
flush_finish_sec 1
params none
reprioritize_interval 0:0:0
halftime 168
usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor 5.000000
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
weight_tickets_functional 0
weight_tickets_share 0
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 200
report_pjob_tickets TRUE
max_pending_tasks_per_job 50
halflife_decay_list none
policy_hierarchy OFS
weight_ticket 0.010000
weight_waiting_time 0.000000
weight_deadline 3600000.000000
weight_urgency 0.100000
weight_priority 1.000000
max_reservation 0
default_duration INFINITY

Use qconf -shgrpl to list available host groups

@allhosts
@compute
@interactive

Use qconf -shgrp @allhosts to show hosts in a specific host group

group_name @allhosts
hostlist node-1-0.local node-1-1.local node-1-2.local \
node-1-3.local node-0-10.local node-1-4.local \
node-1-5.local node-1-6.local node-1-7.local \
node-0-9.local node-0-8.local node-0-7.local \
node-0-4.local node-0-3.local node-0-5.local \
node-0-6.local node-0-2.local node-0-1.local node-0-0.local

Use qhost to check all hosts


HOSTNAME    ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global      -   -     -       -       -       -       -
node-0-0 lx26-amd64     64 27.48  504.9G   15.9G 1000.0M   28.1M
node-0-1 lx26-amd64     64 21.03  504.9G    6.4G 1000.0M  392.1M
node-0-10lx26-amd64     64 29.29  504.9G    7.1G 1000.0M  757.7M
node-0-2 lx26-amd64     64 17.00  126.1G    2.8G 1000.0M   25.8M
node-0-3 lx26-amd64     64 17.00  126.1G    2.7G 1000.0M   23.9M
node-0-4 lx26-amd64     64 13.02  252.4G    3.4G 1000.0M   21.6M
node-0-5 lx26-amd64     64 18.37  504.9G    6.1G 1000.0M   21.8M
node-0-6 lx26-amd64     64 30.83  504.9G    6.9G 1000.0M   19.4M
node-0-7 lx26-amd64     64 15.44  504.9G    6.2G 1000.0M  339.7M
node-0-8 lx26-amd64     64 25.54  504.9G   52.0G 1000.0M  978.5M
node-0-9 lx26-amd64     64 19.67  504.9G   40.6G 1000.0M  670.6M
node-1-0 lx26-amd64      8  0.00  126.1G   74.1G 1000.0M 1000.0M
node-1-1 lx26-amd64      8  1.70  126.1G   54.0G 1000.0M 1000.0M
node-1-2 lx26-amd64      8  0.00  126.1G   27.1G 1000.0M 1000.0M
node-1-3 lx26-amd64      8  0.00  126.1G   51.8G 1000.0M 1000.0M
node-1-4 lx26-amd64      8  0.01  126.1G   28.3G 1000.0M 1000.0M
node-1-5 lx26-amd64      8  1.00  126.1G   78.5G 1000.0M 1000.0M
node-1-6 lx26-amd64      8  0.02  126.1G   87.9G 1000.0M 1000.0M
node-1-7 lx26-amd64      8  0.00  126.1G   40.8G 1000.0M 1000.0M

Use qconf -sql to list all queues

all.q

Use qconf -sq all.q to list specific queue configuration

qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE,[node-0-3.local=BATCH], \
[node-0-0.local=BATCH],[node-0-1.local=BATCH], \
[node-0-2.local=BATCH],[node-0-4.local=BATCH], \
[node-1-0.local=INTERACTIVE], \
[node-1-1.local=INTERACTIVE], \
[node-1-2.local=INTERACTIVE], \
[node-1-3.local=INTERACTIVE], \
[node-0-5.local=BATCH],[node-0-6.local=BATCH], \
[node-1-4.local=INTERACTIVE], \
[node-1-5.local=INTERACTIVE], \
[node-1-6.local=INTERACTIVE], \
[node-1-7.local=INTERACTIVE], \
[node-0-7.local=BATCH],[node-0-8.local=BATCH], \
[node-0-9.local=BATCH],[node-0-10.local=BATCH]
ckpt_list NONE
pe_list cores orte,[node-1-0.local=NONE], \
[node-1-1.local=NONE],[node-1-2.local=NONE], \
[node-1-3.local=NONE],[node-1-4.local=NONE], \
[node-1-5.local=NONE],[node-1-6.local=NONE], \
[node-1-7.local=NONE]
rerun FALSE
slots 1,[node-0-0.local=64],[node-0-1.local=64], \
[node-0-2.local=64],[node-0-3.local=64], \
[node-0-4.local=64],[node-1-0.local=8], \
[node-1-2.local=8],[node-1-3.local=8], \
[node-1-1.local=8],[node-0-5.local=64], \
[node-0-6.local=64],[node-1-4.local=8], \
[node-1-5.local=8],[node-1-6.local=8], \
[node-1-7.local=8],[node-0-7.local=64], \
[node-0-8.local=64],[node-0-9.local=64], \
[node-0-10.local=64]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY

Use qalter to alter SGE jobs

  • qalter -h u <JOB-ID> to hold job 1
  • qalter -h U <JOB-ID> to release the held job