Job Troubleshooting & FAQs
This page lists common questions and issues that users experience when using the HPC cluster.
HPC compute nodes do not have access to the Internet by default. However, we do have an environment module that you can load to enable Internet access:
If you need Internet access for your jobs, don't forget to add the
module load webproxy line to your
Common reasons that jobs take longer to start#
When you submit a job to the HPC cluster, the Slurm scheduler assigns it a job priority number that determines how soon the system will attempt to start the job. Many factors affect the job priority, include resources requested, how those resources are distributed on the cluster, and how many jobs you have submitted recently1.
There are several reasons that your job may not start running as soon as you would expect. Typically, you can solve these issues by tuning your submission script parameters. In other cases, you will need to wait for other jobs to finish before yours will start.
Slurm Account/Queue congestion#
Each Slurm Account (e.g.;
backfill, etc.) uses a queuing system to manage access to compute resources on
the cluster. A certain number of jobs can be running concurrently, and after that, jobs wait in-queue to start. If your job
doesn't start immediately, it is placed in line with other jobs, and put into pending status. How quickly your job starts
depends on a number of factors, some of which are discussed above.
If you run
squeue --me on the terminal, you may see something like this:
NODLIST(REASON) column. If your job is in the running state (
R), this column shows the node(s) that your
job is currently running on.
If your job is in the pending state (
PD), it shows the reason that your job is waiting in-queue to start:
Common pending states and reasons#
||Your job is not high enough priority to start yet2.|
||Your job is waiting for resources to become available, and will immediately start when they are.|
||The configured maximum concurrent jobs for the Slurm Account/partition that you selected are in-use; your job will start after other jobs finish.|
||The Slurm Account/partition that you submitted your job to has reached its purchased core limit. Your job will start when after other jobs finish.|
||Some node(s) specifically requested are in
Requesting too many cores on a single node#
This usually happens when you specify the number of tasks (
--ntasks) and the number of nodes (
--nodes) in your
Most nodes in the HPC cluster contain between 16 and 64 cores. If you request more cores than a single node has, the job will not fail immediately, but instead remain in pending (PD) status indefinitely until cancelled.
Similarly, if you request a large number of cores on a single node, even if the job is able to run, it may take a very long time to allocate resources for it.
Requesting a single core on too many nodes#
The more nodes your job requires, the harder time the Slurm scheduler will have finding resources to run it. For example, refer to the following code block:
It is far more efficient to run your jobs on an arbitrary number of cores, and not specify the node parameter at all:
Not optimally configuring memory parameters#
Typically, each processor bus in a node has memory (RAM) in multiples of 4GB. However, a small portion of this RAM is used for overhead processes, so a single job can never occupy all of it without crashing the node4. The Slurm scheduler has been tuned to take this into account.
If you need to explicitly specify how much memory your job needs using the
--mem-per-cpu parameters, it is best
to use multiples of 3.9GB This ensures that your job doesn't request more memory than the nodes are able to allocate.
Explicitly specifying the number of nodes using the
--nodes parameter for MPI jobs#
It is far more efficient to let the Slurm scheduler allocate cores for you across an arbitrary number of nodes than to wait for the specific number of nodes you specify to become available.
Some jobs have no choice but to tune the number of nodes due to software or algorithmic performance requirements, but for all other MPI jobs, it is best to omit this parameter in your submit scripts.
Jobs from owner Slurm Accounts/queues may be occupying a node#
Most compute nodes in the HPC cluster are shared between free, general access Slurm Accounts (genacc, condor, etc.) and owner-based Slurm accounts. Owner-based Slurm accounts belong to research groups that have purchased HPC resources. These groups get priority access to nodes in our cluster, and my cause jobs in general access accounts to be delayed.
Additionally, jobs submitted to the
backfill2 Slurm Account may occasionally be cancelled due to preemption. The reason
for this is that jobs running in
backfill2 run in free time slots on owner nodes5. When an owner job is submitted,
the job running in the
backfill2 Slurm account will be terminated with the status
PREEMPTED. If you want to avoid preemption,
you can submit your jobs to the
backfill2) Slurm account. Jobs submitted to this account may take a bit longer
to start, but they are no subject to preemption.
Running too many small jobs#
Generally speaking, the HPC cluster is tuned to optimize start times for larger jobs, rather than a large number of smaller jobs. This varies depending on which Slurm account you submit your job(s) to and on a number of other factors.
In addition, we utilize a fair share algorithm for determining job priority. Your priority score for a given job is dependent upon the number of jobs you submit in a given time period. This helps to ensure that a single HPC user running 1,000's of jobs don't crowd out users that submit fewer jobs.
Note: This algorithm is called fairshare, and it used heavily on the free, general access resources and less on the owner resources. ↩
For a deep-dive into how Slurm calculates priority for pending jobs, refer to this excellent guide by Uppsala University. ↩
Because node names periodically change, we discourage requesting specific nodes (i.e., using the
--nodelistparameter) for your jobs. Instead, we recommend using constraints ↩
For more details, see our resource planning guide ↩
backfill2has access to ALL nodes in the cluster, not just owner ones. ↩