The Job Request

Note about Backward Compatibility

This page uses version 2 of the Job Definition format (current version).

Version 2 can be identified by the presence of the "request" field. Version 1 can be identified by the presence of the "jobDefinition" field.

Version 1 of the Job Definition, is still supported by GPULab: GPULab will automatically convert version 1 to version 2 when needed.

You can also make GPULab convert version 1 JobDefinition. This can be done both on the website (using “Create Job”) and with CLI (using gpulab-cli convert).

This page gives detailed information on some of the fields available in the Job Request JSON.

You can find full examples in the tutorials section.

 {
    "name": "HelloWorld",
    "description": "Hello world!",
    "request": {
        "resources": {
            "clusterId": 1,
            "gpus": 1,
            "cpus": 2,
            "cpuMemoryGb": 1
        },
        "docker": {
            "image": "debian:stable",
            "command": "echo 'Hello World!'",
            "environment": { },
            "storage": [ ],
            "portMappings": [ ]
        },
        "scheduling": {
            "interactive": true
        }
    }
}

Running your job on a specific cluster of slave

clusterId

Example

{
    "request": {
        "resources": {
            "clusterId": 1,
            ...
        }
    }
}

In request resources, you can optionally specify a clusterId. A “cluster” corresponds to one or more nodes used by GPULab to execute the jobs.

To request info about the available worker nodes and clusters, use the following command:

$ gpulab-cli clusters --short
+---------+---------+----------------------+---------+---------+---------+-----------------+
| Cluster | Version | Host                 | Workers | GPUs    | CPUs    | Memory (GB)     |
+---------+---------+----------------------+---------+---------+---------+-----------------+
| 1       | stable  | gpu2                 |   2/2   |   2/2   |  16/16  |   31.18/31.18   |
| 1       | stable  | gpu1                 |   2/2   |   2/2   |  16/16  |   31.18/31.18   |
+---------+---------+----------------------+---------+---------+---------+-----------------+

Ommitting --short results in more info, including the GPU model etc.

When you do not specify a clusterId, GPULab will schedule your job on any available worker node which has the requested resources available. You typically want to specify the gpuModel in this case.

slaveName

Example

{
    "request": {
        "resources": {
            "slaveName": "slave7A",
            ...
        }
    }
}

In request resources, you can optionally specify a slaveName. A “slave” corresponds to exactly one machine. You can find a list of the current GPULab Slaves on the GPULab website under Live > Slaves

You typically only need to bind a job to a specific slave if you want to retrieve files the slave-specific /project_scratch storage.

Specifying the necessary resources (CPU’s, GPU’s)

Example

{
    "request": {
       "resources": {
           "gpus": 1,
           "cpus": 1,
           "cpuMemoryGb": 2,
           "gpuModel": [ "V100" ],
           "minCudaVersion": 10
       },
       ...
    }
}

The resources-part of the job request contains some required and optional fields.

Required fields:

Jobs will not run if the requested amount of GPUs, CPUs and memory is not available. They will stay QUEUED until your request can be fulfilled.

Optional fields:

  • "minCudaVersion": the minimum CUDA version installed on the GPULab slave machine, specified as an integer. For example:

    "resources": {
        "gpus": 2,
        "cpus": 1,
        "cpuMemoryGb": 4,
        "minCudaVersion": 10
    }
    

    Will match CUDA version 10.1.105 and 11.0.5, but not 9.1.85.

  • "gpuModel": (partial) name of the required GPU model type. This is matched against the GPULab models, which can be seen in the output of gpulab-cli clusters. Partial matches also match, so:

    "resources": {
        "cpus": 1,
        "gpus": 1,
        "cpuMemoryGb": 2,
        "gpuModel": [ "V100" ]
    }
    

    will match a GPU with model Tesla V100-PCIE-32GB.

    You can specify multiple filters, of which only a single is required to match. This is usefull if you want to allow your job to run on any of a number of spcific GPUs, but not on all others. For example:

    "resources": {
        "cpus": 1,
        "gpus": 1,
        "cpuMemoryGb": 2,
        "gpuModel": [ "V100", "1080" ]
    }
    

    will match for both a Tesla V100 and a GeForce GTX 1080 GPU.

As minCudaVersion and gpuModel work in addition to clusterId, you typicaly use them when omiting clusterId, to pick any matching cluster.

Specifying which Docker image and command must be run

image

Example

{
    "request": {
        "docker": {
            "image": "debian:stable",
            ...
        }
    }
}

You can specify the Docker image that needs to be executed here.

This image can be specified in one of 3 formats:

  1. For Docker Hub images: Use <image_name>:<tag>

    Example: ubuntu:bionic, debian:stable, nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 or osrf/ros:melodic-desktop-full-bionic

  2. For images in public docker registries: Use <reg_url>:<reg_port>/<image_name>:<tag>

    Example: gitlab.ilabt.imec.be:4567/ilabt/gpulab-examples/nvidia/sample:nvidia-smi

  3. For images in private docker registries: Use <username>:<password>@<reg_url>:<reg_port>/<image_name>:<tag>.

    Example: gitlab+deploy-token-3:XXXXXXXXXXXXX@gitlab.ilabt.imec.be:4567/ilabt/gpulab/sample:nvidia-smi

    Note: this 3rd format is not a standard docker format: it’s a GPULab extension. The 1st and 2nd format are default docker formats.

Warning

Never use your Gitlab username/password combination directly. But always create a Deploy Token in your repository to allow GPULab to fetch your private image.

Where to store your custom Docker Images

We advice you to use a private docker registry.

Here are some free options:

  • If you work at IDLab, use the private repository that is available to you on the iLab.t GitLab for each repository.
  • If you only need 1 repository (that can house multiple images), you can use a free dockerhub account.
  • You can use the global GitLab to store docker images. It offers far more than just storage for your containers. For each repository, container storage is available.
  • Canister.io offers a limited number of free, private docker repositories.

If you use gitlab, you can find instructions below. Instructions are similar for the other platforms.

You can find all instruction on how to use this is the “Registry” section accessible from the left toolbar in gitlab. If this is missing for your project, you first need to enable it: In your gitlab project, go to Settings - General - Permissions. Here enable “Container registry”.

Typically, your project and repository are private on Gitlab. To use images that you push to this private registry in GPULab, you’ll need to setup a read-only deploy token for the registry (in Settings - Repository - Deploy Token). Use the 3rd format described above to pass the image and the deploy key to GPULab, for example: gitlab+deploy-token-3:XXXXXXXXXXXXX@gitlab.ilabt.imec.be:4567/my-private-proj/sample:v1

If your Gitlab project and repository are public, you do not need to specify username and password when using the image in GPULab: gitlab.ilabt.imec.be:4567/my-public-proj/sample:v1

Pushing images to a private docker registry

You need to specify username ans password to docker when pushing your images. This is done using docker login.

Example of building an image and pushing it to this repository:

docker build -t gitlab.ilabt.imec.be:4567/myproj/sample:v1 .
docker login gitlab.ilabt.imec.be:4567
push gitlab.ilabt.imec.be:4567/myproj/sample:v1

Deprecated shared registry

GPULab used to have a shared docker registry (gpulab.ilabt.imec.be:5000). This registry has been made read-only at this stage, and will be taken offline on a later date.

We are moving away from this for security reasons. We stongly advice you not to use it anymore. If you are using it, please move your images.

The reasons not to use this repository:

  • This is a shared repository, anyone can access the image stored in it (full read and write access!). So you should never store sensitive data inside images on this repository.
  • There are no backups for this repository. You are responsible to keep your docker images backed up.

command

Example

{
    "request": {
        "docker" {
            "command": "bash -c 'nvidia-smi; for i in `seq 1 5`; do echo $i; sleep 1; done;'",
            ...
        }
    }
}

Example

{
    "request": {
        "docker" {
            "command": "/root/run-my-job.sh > /project_ghent/job-log-${GPULAB_JOB_ID}.log 2>&1",
            ...
        }
    }
}

Example

{
    "request": {
        "docker" {
            "command": [ "/project_ghent/experiment-executable", "--data", "/project_scratch/data/", "--log", "/project_scratch/logs/" ],
            ...
        }
    }
}

This command is passed to the docker container to run. When empty, the CMD specified in the Dockerfile used for building the specified docker image will run.

Note that docker does not like complex commands. To run these, call bash and pass the complex commands to it (as in the first example).

You can either specify the command in a single string, or you can specify it as an array of strings.

Storage

Example

{
    "request": {
        "docker" {
           "storage": [ { "containerPath": "/project_ghent" } ],
           ...
        }
    }
}

storage allows you to specify which volumes must be attached to the Docker container. All files which are not saved within one of the attached volumes are ephemeral, and will thus disappear when the job stops.

The volumes which are attached to your job are specific to the project within which they are run. ie. All jobs run within one project will see the same files in a specific volume.

The hostPath specifies which location on the host must be mounted, the containerPath specifies on which dir inside the container it must be mounted. Each storage entry normally requires both a containerPath and a hostPath. However, when specifying the root of a storage location (like /project_scratch), the hostPath can be omitted.

"storage": [
    {
       "containerPath": "/project_ghent"
    }
],

This will cause a directory /project_ghent to be bound inside your docker container.

You can mount /project_ghent to the “legacy” /project dir if needed:

"storage": [
    {
       "hostPath": "/project_ghent/",
       "containerPath": "/project/"
    }
],

You can also mount only sub directories of the /project_ghent dir this way:

"storage": [
    {
       "hostPath": "/project_ghent/mycode/",
       "containerPath": "/work/code/"
    }
],

To learn which storages are available on GPULab, please refer to the Storage page.

.ssh

If you need access to the ~/.ssh dir used by the SSH server that gives you access to the container (for example, to manually change the authorized_keys file), you need to mount it like this:

"storage": [
    {
       "hostPath": ".ssh",
       "containerPath": "/root/.ssh/"
    }
],

Opening TCP ports with portMappings

Example

{
    "request": {
        "docker" {
           "portMappings": [ { "containerPort": 80 } ]
           ...
        }
    }
}

Sometimes, you want to access network services that run on the docker container. For example, a webserver showing status info might be running. To access web services, the ports need to be “exposed” by docker. You need to specify this in the job definition.

You can specify zero, one or more port mappings. An example:

"portMappings": [ { "containerPort": 80 }, { "containerPort": 21 } ]

This will map port 80 to a port on the host machine. The output of the jobs <job id> command will show which port.

You can also choose the port of the host machine, but this might cause the job to fail if the port is already in use:

"portMappings": [ { "containerPort": 80, "hostPort" : 8080 } ]

On connectivity

The GPULab-slaves in iGent have no public IPv4 addresses. To access the exposed ports you need to access them in one of these ways:

  • If you have IPv6 connectivity, you will automatically access them using IPv6.
  • If your IDLab iGent VPN is active, you will automatically access them that way.
  • Otherwise, use the IDLab Bastion Proxy

The Antwerp DGX-2 (cluster 7) is situated in a different datacenter than the other GPULab slaves. IPv6 is not availble. To access the exposed ports you need to access them in one of these ways:

  • If your IDLab Antwerpen VPN is active, you will automatically access them that way.
  • Otherwise, use the IDLab Bastion Proxy

Environment variables

Example

{
    "request": {
        "docker" {
           "environment" : {
               "DEMO_A": 1,
               "DEMO_B": "two"
           },
           "projectGidVariableName": "DEMO_PROJECT_GID",
           ...
        }
    }
}

The environment variables inside the container, can be set from the job request.

The "environment" expects a map of key/value pairs that will be added.

Additionally, "projectGidVariableName" can be used to specify the name of the enviroment variables that will be set to the project GID (the unix group ID used on the NFS shared storage, for the Job project).

GPULab also automatically sets a lot of environment variables, which can be used to find info about the running job: GPULAB_CLUSTER_ID, GPULAB_CONTAINER_NAME, GPULAB_CPUS_RESERVED, GPULAB_DEPLOYMENT_ENVIRONMENT, GPULAB_DOCKER_IMAGE, GPULAB_GPUS_RESERVED, GPULAB_JOB_ID, GPULAB_MEM_RESERVED_MB, GPULAB_MEM_RESERVED_PROCESSES_MB, GPULAB_MEM_RESERVED_TMPFS_MB, GPULAB_PROJECT_NAME, GPULAB_PROJECT_URN, GPULAB_RESTART_COUNT, GPULAB_RESTART_INITIAL_JOB_ID, GPULAB_SLAVE_DNSNAME, GPULAB_SLAVE_HOSTNAME, GPULAB_SLAVE_INSTANCE_ID, GPULAB_SLAVE_PID, GPULAB_USERURN_AUTH, GPULAB_USERURN_NAME, GPULAB_USER_EMAIL, GPULAB_USER_MINI_ID, GPULAB_USER_URN and GPULAB_WORKER_ID.

Here is an example of the environment variables for the job request example at the start of this section:

GPULAB_CLUSTER_ID="7"
GPULAB_CONTAINER_NAME="twalcari-ilabt_d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_CPUS_RESERVED="20"
GPULAB_DEPLOYMENT_ENVIRONMENT="production"
GPULAB_DOCKER_IMAGE="sha256:cf0be059e923d77c308dd7ad33328de2677be7d43b796aaaf15f4daf8811b37b"
GPULAB_GPUS_RESERVED=""
GPULAB_JOB_ID="d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_MEM_RESERVED_MB="4096"
GPULAB_MEM_RESERVED_PROCESSES_MB="4096"
GPULAB_MEM_RESERVED_TMPFS_MB="0"
GPULAB_PROJECT_NAME="ilabt-dev"
GPULAB_PROJECT_URN="urn:publicid:IDN+ilabt.imec.be+project+ilabt-dev"
GPULAB_RESTART_COUNT="0"
GPULAB_RESTART_INITIAL_JOB_ID="d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_SLAVE_DNSNAME="dgx2.idlab.uantwerpen.be"
GPULAB_SLAVE_HOSTNAME="slave7A"
GPULAB_SLAVE_INSTANCE_ID="inst-30"
GPULAB_SLAVE_PID="91788"
GPULAB_USERURN_AUTH="ilabt.imec.be"
GPULAB_USERURN_NAME="twalcari"
GPULAB_USER_EMAIL="Thijs.Walcarius@UGent.be"
GPULAB_USER_MINI_ID="twalcari@ilabt"
GPULAB_USER_URN="urn:publicid:IDN+ilabt.imec.be+user+twalcari"
GPULAB_WORKER_ID="28"
DEMO_A=1
DEMO_B=two
DEMO_PROJECT_GID=6978