Usage Guide

Typical Workflow

  1. Run a Jupyter notebook or an interactive job to develop and test your job script. Develop using a limited number of GPU’s.
  2. Create a Job Request and submit your job(s). Your job will use more GPUs, CPUs and memory.
  3. Check your results and repeat as needed.

Store your scripts, data and logs on the shared project space.

Do’s

  1. As GPULab is a shared system, it is in everyones best interest to use the GPU and CPU resources efficiently:
    • Make sure you do not have idle jobs running. Cancel jobs if you are not currently using the GPU’s and CPU’s.
    • Make sure you actually use the resources you requested. (For now, you need to check this manually inside your job, GPULab will later collect these statistics itself.)
    • Be economical with the disk space on the shared project space. (For the moment, the disk is too full to ignore disk space usage. This should change in the future.)
  2. For long running jobs, we strongly advice you to use checkpoints so that your job is resumable.
  3. GPULab only stores the first few MB of the output of your job. So make sure you log to the shared storage if you need more logging.
  4. Contact the GPULab admins if there are not enough resources, if you want to reserve a large number in advance, if you spot bugs, or if you have any questions.

Maximum Simultanious Jobs

To ensure a fair distribution of the available resources on GPULab, there is a cap on the number of simultanious jobs that can run system-wide:

  • 10 jobs per user
  • 20 jobs per project

Additionally, on some smaller clusters the number of concurrent jobs per user/project is reduced even further to ensure fair access for everyone.

These limits enables you and your project partners to submit many jobs, while not blocking all resources for other users.

These limits can be changed via a motivated request to the GPULab admins They can be lowered (if you want less simultanious jobs to run), or increased (in which case you’ll need to take care not to block other users).

CPU and GPU usage monitoring

It is usefull to check if your jobs are fully using the resources (CPU/GPU) you have requested. Otherwise, you are both blocking resources for others, as well as waiting much longer for your jobs than needed.

There are multiple methods to check the resource usage of your jobs.

  1. First of all, the GPULab website has a “Usage Graphs” tab for each job. The CPU, GPU and memory usage are plotted over time here.

    Be warned that there is currenly a bug that for some jobs sets all GPU statistics incorrectly to 0. If your GPU usage is 0, you can check for this bug, by looking at the “GPU Mem Limit” statistic. This is the memory available on the GPU, and it should be a fixed number of GB (depending on the GPU) at all time. If this is 0, that means that all GPU statistics are incorrectly set to 0. (We hop to fix this bug soon.)

  2. When a job ends, GPULab will put aggregated statistics in the “General Info” tab on the site. These statistics do not suffer from the same bug, but are only available when the job has ended.

  3. You can also execute “nvidia-smi” manually on the job container, to verify the current usage. To do this, log into the job using gpulab-cli ssh <job_id> or if it is a jupyterhub job, use the jupyter notebook interface to start a new terminal.