The Job Request¶
Note about Backward Compatibility
This page uses version 2 of the Job Definition format (current version).
Version 2 can be identified by the presence of the "request"
field.
Version 1 can be identified by the presence of the "jobDefinition"
field.
Version 1 of the Job Definition, is still supported by GPULab: GPULab will automatically convert version 1 to version 2 when needed.
You can also make GPULab convert version 1 JobDefinition. This can be done both on the website (using “Create Job”) and with CLI (using gpulab-cli convert
).
This page gives detailed information on some of the fields available in the Job Request JSON.
You can find full examples in the tutorials section.
{ "name": "HelloWorld", "description": "Hello world!", "request": { "resources": { "clusterId": 1, "gpus": 1, "cpus": 2, "cpuMemoryGb": 1 }, "docker": { "image": "debian:stable", "command": "echo 'Hello World!'", "environment": { }, "storage": [ ], "portMappings": [ ] }, "scheduling": { "interactive": true } } }
Running your job on a specific cluster of slave¶
clusterId¶
Example
{
"request": {
"resources": {
"clusterId": 1,
...
}
}
}
In request resources, you can optionally specify a clusterId
.
A “cluster” corresponds to one or more nodes used by GPULab to execute the jobs.
To request info about the available worker nodes and clusters, use the following command:
$ gpulab-cli clusters --short
+---------+---------+----------------------+---------+---------+---------+-----------------+
| Cluster | Version | Host | Workers | GPUs | CPUs | Memory (GB) |
+---------+---------+----------------------+---------+---------+---------+-----------------+
| 1 | stable | gpu2 | 2/2 | 2/2 | 16/16 | 31.18/31.18 |
| 1 | stable | gpu1 | 2/2 | 2/2 | 16/16 | 31.18/31.18 |
+---------+---------+----------------------+---------+---------+---------+-----------------+
Ommitting --short
results in more info, including the GPU model etc.
When you do not specify a clusterId, GPULab will schedule your job on any available worker node which has the
requested resources available. You typically want to specify the gpuModel
in this case.
slaveName¶
Example
{
"request": {
"resources": {
"slaveName": "slave7A",
...
}
}
}
In request resources, you can optionally specify a slaveName
.
A “slave” corresponds to exactly one machine. You can find a list of the current GPULab Slaves on the GPULab website
under Live > Slaves
You typically only need to bind a job to a specific slave if you want to retrieve files the slave-specific /project_scratch storage.
Specifying the necessary resources (CPU’s, GPU’s)¶
Example
{
"request": {
"resources": {
"gpus": 1,
"cpus": 1,
"cpuMemoryGb": 2,
"gpuModel": [ "V100" ],
"minCudaVersion": 10
},
...
}
}
The resources
-part of the job request contains some required and optional fields.
Required fields:
"gpus"
: allows you to specify the amount of GPU’s needed (0 or more)"cpus"
: allows you to specify the amount of logical CPUs needed. (1 or more) [What are logical CPU’s and what is HyperThreading?]"cpuMemoryGb"
: amount of system memory (= “CPU memory”) needed, specified in GB
Jobs will not run if the requested amount of GPUs, CPUs and memory is not available. They will stay QUEUED until your request can be fulfilled.
Optional fields:
"minCudaVersion"
: the minimum CUDA version installed on the GPULab slave machine, specified as an integer. For example:"resources": { "gpus": 2, "cpus": 1, "cpuMemoryGb": 4, "minCudaVersion": 10 }
Will match CUDA version
10.1.105
and11.0.5
, but not9.1.85
.
"gpuModel"
: (partial) name of the required GPU model type. This is matched against the GPULab models, which can be seen in the output ofgpulab-cli clusters
. Partial matches also match, so:"resources": { "cpus": 1, "gpus": 1, "cpuMemoryGb": 2, "gpuModel": [ "V100" ] }
will match a GPU with model
Tesla V100-PCIE-32GB
.You can specify multiple filters, of which only a single is required to match. This is usefull if you want to allow your job to run on any of a number of spcific GPUs, but not on all others. For example:
"resources": { "cpus": 1, "gpus": 1, "cpuMemoryGb": 2, "gpuModel": [ "V100", "1080" ] }
will match for both a
Tesla V100
and aGeForce GTX 1080
GPU.
As minCudaVersion
and gpuModel
work in addition to clusterId
, you typicaly use them when omiting clusterId
, to pick any matching cluster.
Specifying which Docker image and command must be run¶
image¶
Example
{
"request": {
"docker": {
"image": "debian:stable",
...
}
}
}
You can specify the Docker image that needs to be executed here.
This image can be specified in one of 3 formats:
For Docker Hub images: Use
<image_name>:<tag>
Example:
ubuntu:bionic
,debian:stable
,nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
orosrf/ros:melodic-desktop-full-bionic
For images in public docker registries: Use
<reg_url>:<reg_port>/<image_name>:<tag>
Example:
gitlab.ilabt.imec.be:4567/ilabt/gpulab-examples/nvidia/sample:nvidia-smi
For images in private docker registries: Use
<username>:<password>@<reg_url>:<reg_port>/<image_name>:<tag>
.Example:
gitlab+deploy-token-3:XXXXXXXXXXXXX@gitlab.ilabt.imec.be:4567/ilabt/gpulab/sample:nvidia-smi
Note: this 3rd format is not a standard docker format: it’s a GPULab extension. The 1st and 2nd format are default docker formats.
Warning
Never use your Gitlab username/password combination directly. But always create a Deploy Token in your repository to allow GPULab to fetch your private image.
Where to store your custom Docker Images¶
We advice you to use a private docker registry.
Here are some free options:
- If you work at IDLab, use the private repository that is available to you on the iLab.t GitLab for each repository.
- If you only need 1 repository (that can house multiple images), you can use a free dockerhub account.
- You can use the global GitLab to store docker images. It offers far more than just storage for your containers. For each repository, container storage is available.
- Canister.io offers a limited number of free, private docker repositories.
If you use gitlab, you can find instructions below. Instructions are similar for the other platforms.
You can find all instruction on how to use this is the “Registry” section accessible from the left toolbar in gitlab. If this is missing for your project, you first need to enable it: In your gitlab project, go to Settings - General - Permissions. Here enable “Container registry”.
Typically, your project and repository are private on Gitlab. To use images that you push to this private registry in
GPULab, you’ll need to setup a read-only deploy token for the registry (in Settings - Repository - Deploy Token).
Use the 3rd format described above to pass the image and the deploy key to GPULab, for example:
gitlab+deploy-token-3:XXXXXXXXXXXXX@gitlab.ilabt.imec.be:4567/my-private-proj/sample:v1
If your Gitlab project and repository are public, you do not need to specify username and password when using the image
in GPULab: gitlab.ilabt.imec.be:4567/my-public-proj/sample:v1
Pushing images to a private docker registry
You need to specify username ans password to docker when pushing your images. This is done using docker login.
Example of building an image and pushing it to this repository:
docker build -t gitlab.ilabt.imec.be:4567/myproj/sample:v1 .
docker login gitlab.ilabt.imec.be:4567
push gitlab.ilabt.imec.be:4567/myproj/sample:v1
Deprecated shared registry
GPULab used to have a shared docker registry (gpulab.ilabt.imec.be:5000
). This registry has been made
read-only at this stage, and will be taken offline on a later date.
We are moving away from this for security reasons. We stongly advice you not to use it anymore. If you are using it, please move your images.
The reasons not to use this repository:
- This is a shared repository, anyone can access the image stored in it (full read and write access!). So you should never store sensitive data inside images on this repository.
- There are no backups for this repository. You are responsible to keep your docker images backed up.
command¶
Example
{
"request": {
"docker" {
"command": "bash -c 'nvidia-smi; for i in `seq 1 5`; do echo $i; sleep 1; done;'",
...
}
}
}
Example
{
"request": {
"docker" {
"command": "/root/run-my-job.sh > /project_ghent/job-log-${GPULAB_JOB_ID}.log 2>&1",
...
}
}
}
Example
{
"request": {
"docker" {
"command": [ "/project_ghent/experiment-executable", "--data", "/project_scratch/data/", "--log", "/project_scratch/logs/" ],
...
}
}
}
This command is passed to the docker container to run. When empty, the CMD
specified in the Dockerfile used for
building the specified docker image will run.
Note that docker does not like complex commands. To run these, call bash
and pass the complex commands to it (as in the first example).
You can either specify the command in a single string, or you can specify it as an array of strings.
Storage¶
Example
{
"request": {
"docker" {
"storage": [ { "containerPath": "/project_ghent" } ],
...
}
}
}
storage
allows you to specify which volumes must be attached to the Docker container. All files which are not saved
within one of the attached volumes are ephemeral, and will thus disappear when the job stops.
The volumes which are attached to your job are specific to the project within which they are run. ie. All jobs run within one project will see the same files in a specific volume.
The hostPath
specifies which location on the host must be mounted, the containerPath
specifies on which dir inside the container it must be mounted.
Each storage entry normally requires both a containerPath
and a hostPath
. However, when specifying the root of
a storage location (like /project_scratch
), the hostPath
can be omitted.
"storage": [
{
"containerPath": "/project_ghent"
}
],
This will cause a directory /project_ghent
to be bound inside your docker container.
You can mount /project_ghent
to the “legacy” /project
dir if needed:
"storage": [
{
"hostPath": "/project_ghent/",
"containerPath": "/project/"
}
],
You can also mount only sub directories of the /project_ghent
dir this way:
"storage": [
{
"hostPath": "/project_ghent/mycode/",
"containerPath": "/work/code/"
}
],
To learn which storages are available on GPULab, please refer to the Storage page.
.ssh
¶
If you need access to the ~/.ssh
dir used by the SSH server that gives you access to the container
(for example, to manually change the authorized_keys
file), you need to mount it like this:
"storage": [
{
"hostPath": ".ssh",
"containerPath": "/root/.ssh/"
}
],
Opening TCP ports with portMappings¶
Example
{
"request": {
"docker" {
"portMappings": [ { "containerPort": 80 } ]
...
}
}
}
Sometimes, you want to access network services that run on the docker container. For example, a webserver showing status info might be running. To access web services, the ports need to be “exposed” by docker. You need to specify this in the job definition.
You can specify zero, one or more port mappings. An example:
"portMappings": [ { "containerPort": 80 }, { "containerPort": 21 } ]
This will map port 80 to a port on the host machine. The output of the
jobs <job id>
command will show which port.
You can also choose the port of the host machine, but this might cause the job to fail if the port is already in use:
"portMappings": [ { "containerPort": 80, "hostPort" : 8080 } ]
On connectivity
The GPULab-slaves in iGent have no public IPv4 addresses. To access the exposed ports you need to access them in one of these ways:
- If you have IPv6 connectivity, you will automatically access them using IPv6.
- If your IDLab iGent VPN is active, you will automatically access them that way.
- Otherwise, use the IDLab Bastion Proxy
The Antwerp DGX-2 (cluster 7) is situated in a different datacenter than the other GPULab slaves. IPv6 is not availble. To access the exposed ports you need to access them in one of these ways:
- If your IDLab Antwerpen VPN is active, you will automatically access them that way.
- Otherwise, use the IDLab Bastion Proxy
Environment variables¶
Example
{
"request": {
"docker" {
"environment" : {
"DEMO_A": 1,
"DEMO_B": "two"
},
"projectGidVariableName": "DEMO_PROJECT_GID",
...
}
}
}
The environment variables inside the container, can be set from the job request.
The "environment"
expects a map of key/value pairs that will be added.
Additionally, "projectGidVariableName"
can be used to specify the name of the enviroment variables that will be set to the project GID (the unix group ID used on the NFS shared storage, for the Job project).
GPULab also automatically sets a lot of environment variables, which can be used to find info about the running job:
GPULAB_CLUSTER_ID
, GPULAB_CONTAINER_NAME
, GPULAB_CPUS_RESERVED
, GPULAB_DEPLOYMENT_ENVIRONMENT
, GPULAB_DOCKER_IMAGE
,
GPULAB_GPUS_RESERVED
, GPULAB_JOB_ID
, GPULAB_MEM_RESERVED_MB
, GPULAB_MEM_RESERVED_PROCESSES_MB
, GPULAB_MEM_RESERVED_TMPFS_MB
,
GPULAB_PROJECT_NAME
, GPULAB_PROJECT_URN
, GPULAB_RESTART_COUNT
, GPULAB_RESTART_INITIAL_JOB_ID
, GPULAB_SLAVE_DNSNAME
,
GPULAB_SLAVE_HOSTNAME
, GPULAB_SLAVE_INSTANCE_ID
, GPULAB_SLAVE_PID
, GPULAB_USERURN_AUTH
, GPULAB_USERURN_NAME
,
GPULAB_USER_EMAIL
, GPULAB_USER_MINI_ID
, GPULAB_USER_URN
and GPULAB_WORKER_ID
.
Here is an example of the environment variables for the job request example at the start of this section:
GPULAB_CLUSTER_ID="7"
GPULAB_CONTAINER_NAME="twalcari-ilabt_d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_CPUS_RESERVED="20"
GPULAB_DEPLOYMENT_ENVIRONMENT="production"
GPULAB_DOCKER_IMAGE="sha256:cf0be059e923d77c308dd7ad33328de2677be7d43b796aaaf15f4daf8811b37b"
GPULAB_GPUS_RESERVED=""
GPULAB_JOB_ID="d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_MEM_RESERVED_MB="4096"
GPULAB_MEM_RESERVED_PROCESSES_MB="4096"
GPULAB_MEM_RESERVED_TMPFS_MB="0"
GPULAB_PROJECT_NAME="ilabt-dev"
GPULAB_PROJECT_URN="urn:publicid:IDN+ilabt.imec.be+project+ilabt-dev"
GPULAB_RESTART_COUNT="0"
GPULAB_RESTART_INITIAL_JOB_ID="d97ab0e0-4ff0-4549-ac99-a62a88781fa3"
GPULAB_SLAVE_DNSNAME="dgx2.idlab.uantwerpen.be"
GPULAB_SLAVE_HOSTNAME="slave7A"
GPULAB_SLAVE_INSTANCE_ID="inst-30"
GPULAB_SLAVE_PID="91788"
GPULAB_USERURN_AUTH="ilabt.imec.be"
GPULAB_USERURN_NAME="twalcari"
GPULAB_USER_EMAIL="Thijs.Walcarius@UGent.be"
GPULAB_USER_MINI_ID="twalcari@ilabt"
GPULAB_USER_URN="urn:publicid:IDN+ilabt.imec.be+user+twalcari"
GPULAB_WORKER_ID="28"
DEMO_A=1
DEMO_B=two
DEMO_PROJECT_GID=6978