The Job Request¶
Note about Backward Compatibility
This page uses version 2 of the Job Definition format (current version).
Version 2 can be identified by the presence of the "request" field.
Version 1 can be identified by the presence of the "jobDefinition" field.
Version 1 of the Job Definition, is still supported by GPULab: GPULab will automatically convert version 1 to version 2 when needed.
You can also make GPULab convert version 1 JobDefinition. This can be done both on the website (using “Create Job”) and with CLI (using gpulab-cli convert).
This page gives detailed information on some of the fields available in the Job Request JSON.
You can find full examples in the tutorials section.
{
"name": "HelloWorld",
"description": "Hello world!",
"request": {
"resources": {
"clusterId": 1,
"gpus": 1,
"cpus": 2,
"cpuMemoryGb": 1
},
"docker": {
"image": "debian:stable",
"command": "echo 'Hello World!'",
"environment": { },
"storage": [ ],
"portMappings": [ ]
},
"scheduling": {
"interactive": true
}
}
}
Running your job on a specific cluster of slave¶
clusterId¶
Example
{
"request": {
"resources": {
"clusterId": 1,
...
}
}
}
In request resources, you can optionally specify a clusterId.
A “cluster” corresponds to one or more nodes used by GPULab to execute the jobs.
To request info about the available worker nodes and clusters, use the following command:
$ gpulab-cli clusters --short
+---------+---------+----------------------+---------+---------+---------+-----------------+
| Cluster | Version | Host | Workers | GPUs | CPUs | Memory (GB) |
+---------+---------+----------------------+---------+---------+---------+-----------------+
| 1 | stable | gpu2 | 2/2 | 2/2 | 16/16 | 31.18/31.18 |
| 1 | stable | gpu1 | 2/2 | 2/2 | 16/16 | 31.18/31.18 |
+---------+---------+----------------------+---------+---------+---------+-----------------+
Ommitting --short results in more info, including the GPU model etc.
When you do not specify a clusterId, GPULab will schedule your job on any available worker node which has the
requested resources available. You typically want to specify the gpuModel in this case.
slaveName¶
Example
{
"request": {
"resources": {
"slaveName": "slave7A",
...
}
}
}
In request resources, you can optionally specify a slaveName.
A “slave” corresponds to exactly one machine. You can find a list of the current GPULab Slaves on the GPULab website
under Live > Slaves
You typically only need to bind a job to a specific slave if you want to retrieve files the slave-specific /project_scratch storage.
Specifying the necessary resources (CPU’s, GPU’s)¶
Example
{
"request": {
"resources": {
"gpus": 1,
"cpus": 1,
"cpuMemoryGb": 2,
"gpuModel": [ "V100" ],
"minCudaVersion": 10
},
...
}
}
The resources-part of the job request contains some required and optional fields.
Required fields:
"gpus": allows you to specify the amount of GPU’s needed (0 or more)"cpus": allows you to specify the amount of logical CPUs needed. (1 or more) [What are logical CPU’s and what is HyperThreading?]"cpuMemoryGb": amount of system memory (= “CPU memory”) needed, specified in GB
Jobs will not run if the requested amount of GPUs, CPUs and memory is not available. They will stay QUEUED until your request can be fulfilled.
Optional fields:
"minCudaVersion": the minimum CUDA version installed on the GPULab slave machine, specified as an integer. For example:"resources": { "gpus": 2, "cpus": 1, "cpuMemoryGb": 4, "minCudaVersion": 10 }
Will match CUDA version
10.1.105and11.0.5, but not9.1.85.
"gpuModel": (partial) name of the required GPU model type. This is matched against the GPULab models, which can be seen in the output ofgpulab-cli clusters. Partial matches also match, so:"resources": { "cpus": 1, "gpus": 1, "cpuMemoryGb": 2, "gpuModel": [ "V100" ] }
will match a GPU with model
Tesla V100-PCIE-32GB.You can specify multiple filters, of which only a single is required to match. This is usefull if you want to allow your job to run on any of a number of spcific GPUs, but not on all others. For example:
"resources": { "cpus": 1, "gpus": 1, "cpuMemoryGb": 2, "gpuModel": [ "V100", "1080" ] }
will match for both a
Tesla V100and aGeForce GTX 1080GPU.
As minCudaVersion and gpuModel work in addition to clusterId, you typicaly use them when omiting clusterId, to pick any matching cluster.
Specifying which Docker image and command must be run¶
image¶
Example
{
"request": {
"docker": {
"image": "debian:stable",
...
}
}
}
You can specify the Docker image that needs to be executed here.
This image must:
- be of the “amd64” architecture (see note below)
- be hosted on a public or private docker registry that GPULab can access (see below for more info)
- be CUDA-capable if you request GPUs (see GPU Software Compatibility for more info on CUDA compatibility)
This image can be specified in one of 3 formats:
For Docker Hub images: Use
<image_name>:<tag>Example:
ubuntu:bionic,debian:stable,nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04orosrf/ros:melodic-desktop-full-bionicFor images in public docker registries: Use
<reg_url>:<reg_port>/<image_name>:<tag>Example:
gitlab.ilabt.imec.be:4567/ilabt/gpulab-examples/nvidia/sample:nvidia-smiFor images in private docker registries: Use
<username>:<password>@<reg_url>:<reg_port>/<image_name>:<tag>.Example:
gitlab+deploy-token-3:XXXXXXXXXXXXX@gitlab.ilabt.imec.be:4567/ilabt/gpulab/sample:nvidia-smiNote: this 3rd format is not a standard docker format: it’s a GPULab extension. The 1st and 2nd format are default docker formats.
Warning
Never use your Gitlab username/password combination directly. But always create a Deploy Token in your repository to allow GPULab to fetch your private image.
Docker image compatibility
As GPULab runs on x64 machines, only Docker images for the “amd64” architecture are supported.
When building images on an ARM64 machine (like Macbooks based on the Apple M1/M2 chipset), make sure to build images that (also) support “amd64”.
This can be achieved by using the –platform parameter of Docker Buildx to build multi-architecture images.
A full tutorial on how to build and use your own Docker images for GPULab can be found here:
Where to store your custom Docker Images¶
We advice you to use a private docker registry.
Here are some free options:
- If you work at IDLab, use the private repository that is available to you on the iLab.t GitLab for each repository.
- You can use the global GitLab or GitHub <https://github.com/> to store docker images. It offers far more than just storage for your containers. For each repository, container storage is available. More information: GitHub: Working with the Container registry and GitLab: Container Registry.
- Canister.io offers a limited number of free, private docker repositories.
- We advice against using DockerHub to store your images: DockerHub has aggressive rate limits on pulling images, which we work around by aggressive caching. This might cause us to run outdated images from time to time.
If you use gitlab, you can find instructions below. Instructions are similar for the other platforms.
You can find all instruction on how to use this is the “Registry” section accessible from the left toolbar in gitlab. If this is missing for your project, you first need to enable it: In your gitlab project, go to Settings - General - Permissions. Here enable “Container registry”.
Typically, your project and repository are private on Gitlab. To use images that you push to this private registry in
GPULab, you’ll need to setup a read-only deploy token for the registry (in Settings - Repository - Deploy Token).
Use the 3rd format described above to pass the image and the deploy key to GPULab, for example:
gitlab+deploy-token-3:XXXXXXXXXXXXX@gitlab.ilabt.imec.be:4567/my-private-proj/sample:v1
If your Gitlab project and repository are public, you do not need to specify username and password when using the image
in GPULab: gitlab.ilabt.imec.be:4567/my-public-proj/sample:v1
Pushing images to a private docker registry
You need to specify username ans password to docker when pushing your images. This is done using docker login.
Example of building an image and pushing it to this repository:
docker build -t gitlab.ilabt.imec.be:4567/myproj/sample:v1 .
docker login gitlab.ilabt.imec.be:4567
push gitlab.ilabt.imec.be:4567/myproj/sample:v1
command¶
Example
{
"request": {
"docker" {
"command": "bash -c 'nvidia-smi; for i in `seq 1 5`; do echo $i; sleep 1; done;'",
...
}
}
}
Example
{
"request": {
"docker" {
"command": "/root/run-my-job.sh > /project_ghent/job-log-${GPULAB_JOB_ID}.log 2>&1",
...
}
}
}
Example
{
"request": {
"docker" {
"command": [ "/project_ghent/experiment-executable", "--data", "/project_scratch/data/", "--log", "/project_scratch/logs/" ],
...
}
}
}
This command is passed to the docker container to run. When empty, the CMD specified in the Dockerfile used for
building the specified docker image will run.
Note that docker does not like complex commands. To run these, call bash and pass the complex commands to it (as in the first example).
You can either specify the command in a single string, or you can specify it as an array of strings.
Storage¶
Example
{
"request": {
"docker" {
"storage": [ { "containerPath": "/project_ghent" } ],
...
}
}
}
storage allows you to specify which volumes must be attached to the Docker container. All files which are not saved
within one of the attached volumes are ephemeral, and will thus disappear when the job stops.
The volumes which are attached to your job are specific to the project within which they are run. ie. All jobs run within one project will see the same files in a specific volume.
The hostPath specifies which location on the host must be mounted, the containerPath specifies on which dir inside the container it must be mounted.
Each storage entry normally requires both a containerPath and a hostPath. However, when specifying the root of
a storage location (like /project_scratch), the hostPath can be omitted.
"storage": [
{
"containerPath": "/project_ghent"
}
],
This will cause a directory /project_ghent to be bound inside your docker container.
You can mount /project_ghent to the “legacy” /project dir if needed:
"storage": [
{
"hostPath": "/project_ghent/",
"containerPath": "/project/"
}
],
You can also mount only sub directories of the /project_ghent dir this way:
"storage": [
{
"hostPath": "/project_ghent/mycode/",
"containerPath": "/work/code/"
}
],
To learn which storages are available on GPULab, please refer to the Storage page.
.ssh¶
If you need access to the ~/.ssh dir used by the SSH server that gives you access to the container
(for example, to manually change the authorized_keys file), you need to mount it like this:
"storage": [
{
"hostPath": ".ssh",
"containerPath": "/root/.ssh/"
}
],
Opening TCP ports with portMappings¶
Example
{
"request": {
"docker" {
"portMappings": [ { "containerPort": 80 } ]
...
}
}
}
Sometimes, you want to access network services that run on the docker container. For example, a webserver showing status info might be running. To access web services, the ports need to be “exposed” by docker. You need to specify this in the job definition.
You can specify zero, one or more port mappings. An example:
"portMappings": [ { "containerPort": 80 }, { "containerPort": 21 } ]
This will map port 80 to a port on the host machine. The output of the
jobs <job id> command will show which port.
You can also choose the port of the host machine, but this might cause the job to fail if the port is already in use:
"portMappings": [ { "containerPort": 80, "hostPort" : 8080 } ]
On connectivity
The GPULab-slaves in iGent have no public IPv4 addresses. To access the exposed ports you need to access them in one of these ways:
- If you have IPv6 connectivity, you will automatically access them using IPv6.
- If your IDLab iGent VPN is active, you will automatically access them that way.
- Otherwise, use the IDLab Bastion Proxy
The Antwerp DGX-2 (cluster 7) is situated in a different datacenter than the other GPULab slaves. IPv6 is not availble. To access the exposed ports you need to access them in one of these ways:
- If your IDLab Antwerpen VPN is active, you will automatically access them that way.
- Otherwise, use the IDLab Bastion Proxy
Environment variables¶
Example
{
"request": {
"docker" {
"environment" : {
"DEMO_A": 1,
"DEMO_B": "two"
},
"projectGidVariableName": "DEMO_PROJECT_GID",
...
}
}
}
The environment variables inside the container, can be set from the job request.
The "environment" expects a map of key/value pairs that will be added.
Additionally, "projectGidVariableName" can be used to specify the name of the enviroment variables that will be set to the project GID (the unix group ID used on the NFS shared storage, for the Job project).
GPULab also automatically sets a lot of environment variables, which can be used to find info about the running job:
GPULAB_CLUSTER_ID, GPULAB_CONTAINER_NAME, GPULAB_CPUS_RESERVED, GPULAB_DEPLOYMENT_ENVIRONMENT, GPULAB_DOCKER_IMAGE,
GPULAB_GPUS_RESERVED, GPULAB_JOB_ID, GPULAB_MEM_RESERVED_MB, GPULAB_MEM_RESERVED_PROCESSES_MB, GPULAB_MEM_RESERVED_TMPFS_MB,
GPULAB_PROJECT_NAME, GPULAB_PROJECT_URN, GPULAB_RESTART_COUNT, GPULAB_RESTART_INITIAL_JOB_ID, GPULAB_SLAVE_DNSNAME,
GPULAB_SLAVE_HOSTNAME, GPULAB_SLAVE_INSTANCE_ID, GPULAB_SLAVE_PID, GPULAB_USERURN_AUTH, GPULAB_USERURN_NAME,
GPULAB_USER_EMAIL, GPULAB_USER_MINI_ID, GPULAB_USER_URN and GPULAB_WORKER_ID.
Here is an example of the environment variables for the job request example at the start of this section:
GPULAB_CLUSTER_ID="7"
GPULAB_CONTAINER_NAME="twalcari-ilabt_d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_CPUS_RESERVED="20"
GPULAB_DEPLOYMENT_ENVIRONMENT="production"
GPULAB_DOCKER_IMAGE="sha256:cf0be059e923d77c308dd7ad33328de2677be7d43b796aaaf15f4daf8811b37b"
GPULAB_GPUS_RESERVED=""
GPULAB_JOB_ID="d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_MEM_RESERVED_MB="4096"
GPULAB_MEM_RESERVED_PROCESSES_MB="4096"
GPULAB_MEM_RESERVED_TMPFS_MB="0"
GPULAB_PROJECT_NAME="ilabt-dev"
GPULAB_PROJECT_URN="urn:publicid:IDN+ilabt.imec.be+project+ilabt-dev"
GPULAB_RESTART_COUNT="0"
GPULAB_RESTART_INITIAL_JOB_ID="d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_SLAVE_DNSNAME="dgx2.idlab.uantwerpen.be"
GPULAB_SLAVE_HOSTNAME="slave7A"
GPULAB_SLAVE_INSTANCE_ID="inst-30"
GPULAB_SLAVE_PID="91788"
GPULAB_USERURN_AUTH="ilabt.imec.be"
GPULAB_USERURN_NAME="twalcari"
GPULAB_USER_EMAIL="Thijs.Walcarius@UGent.be"
GPULAB_USER_MINI_ID="twalcari@ilabt"
GPULAB_USER_URN="urn:publicid:IDN+ilabt.imec.be+user+twalcari"
GPULAB_WORKER_ID="28"
DEMO_A=1
DEMO_B=two
DEMO_PROJECT_GID=6978