Job Troubleshooting & FAQs

This page lists common questions and issues that users experience when using the HPC cluster.

Network issues#

HPC compute nodes do not have access to the Internet by default. However, we do have an environment module that you can load to enable Internet access:

1	`$ module load webproxy`

If you need Internet access for your jobs, don't forget to add the module load webproxy line to your submit scripts.

Using Firefox in Open OnDemand?

See our guide to making Firefox work in Open OnDemand.

Common reasons that jobs take longer to start#

When you submit a job to the HPC cluster, the Slurm scheduler assigns it a job priority number that determines how soon the system will attempt to start the job. Many factors affect the job priority, include resources requested, how those resources are distributed on the cluster, and how many jobs you have submitted recently¹.

There are several reasons that your job may not start running as soon as you would expect. Typically, you can solve these issues by tuning your submission script parameters. In other cases, you will need to wait for other jobs to finish before yours will start.

Slurm Account/Queue congestion#

Each Slurm Account (e.g.; genacc_q, backfill, etc.) uses a queuing system to manage access to compute resources on the cluster. A certain number of jobs can be running concurrently, and after that, jobs wait in-queue to start. If your job doesn't start immediately, it is placed in line with other jobs, and put into pending status. How quickly your job starts depends on a number of factors, some of which are discussed above.

If you run squeue --me on the terminal, you may see something like this:

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           6248227 backfill2 my_job__   abc12a PD       0:00      1 (Priority)
           6248226 backfill2 my_job__   abc12a PD       0:00      1 (Priority)
           6248225 backfill2 my_job__   abc12a PD       0:00      1 (Priority)
           6248223 backfill2 my_job__   abc12a  R       0:03      1 hpc-i36-3
           6248224 backfill2 my_job__   abc12a  R       0:03      1 hpc-i36-3
           6248222 backfill2 my_job__   abc12a  R       0:16      1 hpc-i36-3

Notice the NODLIST(REASON) column. If your job is in the running state (R), this column shows the node(s) that your job is currently running on.

If your job is in the pending state (PD), it shows the reason that your job is waiting in-queue to start:

Common pending states and reasons#

Reason	Explanation
`(Priority)`	Your job is not high enough priority to start yet².
`(Resources)`	Your job is waiting for resources to become available, and will immediately start when they are.
`AssocMaxJobsLimit`	The configured maximum concurrent jobs for the Slurm Account/partition that you selected are in-use; your job will start after other jobs finish.
`AssocGrpCpuLimit`	The Slurm Account/partition that you submitted your job to has reached its purchased core limit. Your job will start when after other jobs finish.
`ReqNodeNotAvail`	Some node(s) specifically requested are in `DOWN`, `DRAIN`, or non-responsive state³. If you did not request specific nodes for your job, we recommend cancelling and resubmitting your job(s)

For a full list of Slurm job state reason codes, refer to the Slurm documentation.

Requesting too many cores on a single node#

This usually happens when you specify the number of tasks (-n/--ntasks) and the number of nodes (-N/--nodes) in your submit script.

Most nodes in the HPC cluster contain between 16 and 64 cores. If you request more cores than a single node has, the job will not fail immediately, but instead remain in pending (PD) status indefinitely until cancelled.

Similarly, if you request a large number of cores on a single node, even if the job is able to run, it may take a very long time to allocate resources for it.

Requesting a single core on too many nodes#

The more nodes your job requires, the harder time the Slurm scheduler will have finding resources to run it. For example, refer to the following code block:

1 2	`#SBATCH -n 1 # One task per node... #SBATCH --nodes=50 # ...times 50 nodes`

It is far more efficient to run your jobs on an arbitrary number of cores, and not specify the node parameter at all:

1	`#SBATCH -n 100 # 50 tasks distributed as-avialable in the cluster`

Not optimally configuring memory parameters#

Typically, each processor bus in a node has memory (RAM) in multiples of 4GB. However, a small portion of this RAM is used for overhead processes, so a single job can never occupy all of it without crashing the node⁴. The Slurm scheduler has been tuned to take this into account.

If you need to explicitly specify how much memory your job needs using the --mem or --mem-per-cpu parameters, it is best to use multiples of 3.9GB This ensures that your job doesn't request more memory than the nodes are able to allocate.

Explicitly specifying the number of nodes using the `-N`/`--nodes` parameter for MPI jobs#

It is far more efficient to let the Slurm scheduler allocate cores for you across an arbitrary number of nodes than to wait for the specific number of nodes you specify to become available.

Some jobs have no choice but to tune the number of nodes due to software or algorithmic performance requirements, but for all other MPI jobs, it is best to omit this parameter in your submit scripts.

Jobs from owner Slurm Accounts/queues may be occupying a node#

Most compute nodes in the HPC cluster are shared between free, general access Slurm Accounts (genacc, condor, etc.) and owner-based Slurm accounts. Owner-based Slurm accounts belong to research groups that have purchased HPC resources. These groups get priority access to nodes in our cluster, and my cause jobs in general access accounts to be delayed.

Additionally, jobs submitted to the backfill2 Slurm Account may occasionally be cancelled due to preemption. The reason for this is that jobs running in backfill2 run in free time slots on owner nodes⁵. When an owner job is submitted, the job running in the backfill2 Slurm account will be terminated with the status PREEMPTED. If you want to avoid preemption, you can submit your jobs to the backfill (not backfill2) Slurm account. Jobs submitted to this account may take a bit longer to start, but they are no subject to preemption.

Running too many small jobs#

Generally speaking, the HPC cluster is tuned to optimize start times for larger jobs, rather than a large number of smaller jobs. This varies depending on which Slurm account you submit your job(s) to and on a number of other factors.

In addition, we utilize a fair share algorithm for determining job priority. Your priority score for a given job is dependent upon the number of jobs you submit in a given time period. This helps to ensure that a single HPC user running 1,000s of jobs doesn't crowd out users that submit fewer jobs.

Why is my job using more cores than I requested?#

As of Slurm v20.11, the scheduler attempts to auto-provision resources even if you have requested a higher --mem-per-cpu value than the queue to which you are submitting your job allows⁶. Most queues have a limit of 4GB per CPU. There are some exceptions; to see details for all queues, refer to the list in our self-service portal.

If you exceed the --mem-per-cpu limit on the queue to which you submit a job, Slurm automatically updates the --cpus-per-task parameter to accommodate the increased memory requirements. This can result in the scheduler provisioning a greater number of CPUs than you requested.

The remedy is to either explicitly request more CPUs (--ntasks) and decrease the --mem-per-cpu beneath the maximum for your queue.

Avoiding Infinite Recursion in Slurm Submit Scripts#

When writing Slurm submit scripts, it's essential to avoid structures that can lead to infinite recursion. One common pitfall is adding loops around sbatch or srun statements directly within the submit script. This can cause your script to submit itself recursively, resulting in an overwhelming number of jobs and potential system overload.

Do not add loops that surround sbatch or srun commands within your Slurm submit script. An example of this is shown below:

×  Don't do this
#!/bin/bash
#SBATCH --job-name=infinite_loop_example
#SBATCH --output=output.txt
#SBATCH --error=error.txt
#SBATCH --time=01:00:00
#SBATCH --ntasks=1

#DO NOT DO THIS! 
#This script contains an infinite loop that will cause issues
for i in {1..10}; do
    sbatch my_slurm_script.sh
done

In the above example, the Slurm submit script (my_slurm_script.sh) is attempting to submit itself multiple times using a loop. This will create a recursive submission that can quickly get out of control.

Instead, use job arrays per the example below:

✔  Do this instead
  #!/bin/bash
  #SBATCH --job-name=array_example
  #SBATCH --time=01:00:00
  #SBATCH --ntasks=1
  #SBATCH --array=1-10

  sbatch my_slurm_script.sh

Please refer to Submitting jobs to the HPC for specific details on job submission.

Note: This algorithm is called fairshare, and it used heavily on the free, general access resources and less on the owner resources. ↩
For a deep-dive into how Slurm calculates priority for pending jobs, refer to this excellent guide by Uppsala University. ↩
Because node names periodically change, we discourage requesting specific nodes (i.e., using the --nodelist parameter) for your jobs. Instead, we recommend using constraints ↩
For more details, see our resource planning guide ↩
Technically, backfill2 has access to ALL nodes in the cluster, not just owner ones. ↩
In past versions, Slurm would reject the job submission. ↩