The Job Definition

This page gives detailed information on some of the fields available in the Job Definition JSON.

You can find full examples in the tutorials section.

{
  "jobDefinition": {
    "name": "helloworld",
    "description": "Hello world!",

    "clusterId": 1,
    "resources": {
      "gpus": 1,
      "systemMemory": 2000,
      "cpuCores": 2
    },

    "dockerImage": "gpulab.ilabt.imec.be:5000/sample:nvidia-smi",
    "command": "",

    "jobDataLocations": [ ],
    "portMappings": [ ],

    "environment": { },
    "userUidVariableName": "USER_UID",
    "projectGidVariableName": "PROJ_GID"
  }
}

clusterId

Example

{
    "jobDefinition": {
        "clusterId": 1,
        ...
    }
}

The jobDefinitions allows you to optionally specify a clusterId. This corresponds to one or more slave nodes used by GPULab to execute the jobs.

To request info about the available worker nodes and clusters, use the following command:

$ gpulab-cli clusters --short
+---------+---------+----------------------+---------+---------+---------+-----------------+
| Cluster | Version | Host                 | Workers | GPUs    | CPUs    | Memory (GB)     |
+---------+---------+----------------------+---------+---------+---------+-----------------+
| 1       | stable  | gpu2                 |   2/2   |   2/2   |  16/16  |   31.18/31.18   |
| 1       | stable  | gpu1                 |   2/2   |   2/2   |  16/16  |   31.18/31.18   |
+---------+---------+----------------------+---------+---------+---------+-----------------+

Ommitting --short results in more info, including the GPU model etc.

When you do not specify a clusterId, then GPULab will schedule your job on any available worker node which has the requested resources available. You typically want to specify the gpuModel in this case.

resources

Example

{
    "jobDefinition": {
       "resources": {
           "cpuCores": 1,
           "gpus": 1,
           "systemMemory": 2000,
           "gpuModel": "V100",
           "minCudaVersion": 10
       },
       ...
    }
}

The resources-part of the jobDefinition contains some required and optional fields.

Required fields:

  • "gpus": allows you to specify the amount of GPU’s needed (0 or more)
  • "cpuCores": allows you to specify the amount of dedicated CPU cores needed (use 0 for a shared CPU core)
  • "systemMemory": amount of system memory needed, specified in MB

If more than one GPU or CPU is requested, the job will never run when less than requested is available.

The amount of systemMemory needs to be specified in MB (so systemMemory: 2000 means 2GB). If more memory is requested than is available, the job will not run.

Optional fields:

  • "minCudaVersion": the minimum CUDA version installed on the GPULab slave machine, specified as an integer. For example:

    "resources": {
        "cpuCores": 1,
        "gpus": 1,
        "systemMemory": 2000,
        "minCudaVersion": 10
    }
    

    Will match CUDA version 10.1.105, but not 9.1.85.

  • "gpuModel": (partial) name of the required GPU model type. This is matched against the GPULab models, which can be seen in the output of gpulab-cli clusters. Partial matches match, so:

    "resources": {
        "cpuCores": 1,
        "gpus": 1,
        "systemMemory": 2000,
        "gpuModel": "V100"
    }
    

    will match a GPU with model Tesla V100-PCIE-32GB.

As minCudaVersion and gpuModel work in addition to clusterId, you typicaly use them when ommiting clusterId, to pick any matching cluster.

Advanced: Implementation information

resources.cpuCores in the jobDefinition refers to “logical” CPU cores, as returned by /proc/cpuinfo in Linux. This means that hyper-threading doubles the number of cores. For example, a system with 2 CPUs with 2 cores and hyperthreading, has 8 “logical processors”. So in GPULab, this system has 8 cpuCores available.

GPULab will use the ID’s found in /proc/cpuinfo and manage them so there are no conflicts between jobs. It passes these ID’s to docker when starting the containers for the job, using the docker --cpuset-cpus option. This uses the cpuset functionality of the linux kernel.

Inside a container, you’ll see that /proc/cpuinfo and lscpu return all CPU’s in on the host. If you need to know the IDs of the CPU’s your job is restricted to, check either /sys/fs/cgroup/cpuset/cpuset.cpus or $GPULAB_CPUS_RESERVED. An example of this:

$ gpulab-cli --dev interactive --project ilabt-dev --cluster-id 1 --duration-minutes 10 --docker-image gpulab.ilabt.imec.be:5000/sample:nvidia-smi --cpus 2
cf137c7c-a89f-11e9-93a1-db841704c121
2019-07-17 16:33:14 +0200 -            - Waiting for Job to start running...
2019-07-17 16:33:16 +0200 - 2 seconds  - Job is in state RUNNING
2019-07-17 16:33:16 +0200 - 2 seconds  - Job is now running
root@6d902d7ac334:/# cat /sys/fs/cgroup/cpuset/cpuset.cpus
1-2
root@6d902d7ac334:/# echo $GPULAB_CPUS_RESERVED
1,2

Tip: Application Threads

How many threads should the application in a job use, if it has X cores?

This depends on the application. Normally, if there is no I/O or synchronisation, but just pure calculation, X threads is optimal. If there is I/O involved, you will gain by using more threads. If there’s only very light I/O, X+1 threads is probably good, a heavy I/O bottleneck might require many more threads for optimal speed.

In the end, you’ll need to benchmark this, if you really need to know for you application.

dockerImage

Example

{
    "jobDefinition": {
        "dockerImage": "gpulab.ilabt.imec.be:5000/sample:nvidia-smi",
        ...
    }
}

You can specify the Docker image that needs to be executed here.

This image can be specified in one of 3 formats:

  1. For Docker Hub images: Use <image_name>:<tag>

    Example: ubuntu:bionic or osrf/ros:melodic-desktop-full-bionic

  2. For images in public docker registries: Use <reg_url>:<reg_port>/<image_name>:<tag>

    Example: gpulab.ilabt.imec.be:5000/jupyter/tensorflow-notebook:latest

  3. For images in private docker registries: Use <username>:<password>@<reg_url>:<reg_port>/<image_name>:<tag>.

    Example: gitlab+deploy-token-3:XXXXXXXXXXXXX@gitlab.ilabt.imec.be:4567/ilabt/gpulab/sample:nvidia-smi

    Note: this 3rd format is not a standard docker format: it’s a GPULab extension. The 1st and 2nd format are default docker formats.

iLab.t offers 2 options to store your docker images:

  • The iLab.t GitLab provides a private docker registry for each project.

    You can find all instruction on how to use this is the “Registry” section accessible from the left toolbar in gitlab. If this is missing for your project, you first need to enable it: In your gitlab project, go to Settings - General - Permissions. Here enable “Container registry”.

    To use the images you push to this registry in GPULab, you’ll need to setup a read-only deploy key for the registry (in Settings - Repository - Deploy Key). Use the 3rd format described above to pass the image and the deploy key to GPULab, for example: gitlab+deploy-token-3:XXXXXXXXXXXXX@gitlab.ilabt.imec.be:4567/ilabt/gpulab/sample:nvidia-smi

    Example of building an image and pushing it to this repository:

    docker build -t gitlab.ilabt.imec.be:4567/ilabt/myproj/sample:v1 .
    docker login gitlab.ilabt.imec.be:4567
    docker push gitlab.ilabt.imec.be:4567/ilabt/myproj/sample:v1
    
  • GPULab has a shared docker repository gpulab.ilabt.imec.be:5000. You can freely use it to store your custom docker images.

    Be aware that as this is a shared repository, anyone can access the image stored in it (full read and write access!). So do not store sensitive data inside images on this repository. Also note that there are no backups for this repository. You are responsible to keep your docker images backed up.

    Example of building an image and pushing it to this repository:

    docker build -t gpulab.ilabt.imec.be:5000/myname/sample:v1 .
    docker push gpulab.ilabt.imec.be:5000/myname/sample:v1
    

command

Example

{
    "jobDefinition": {
        "command": "bash -c 'nvidia-smi; for i in `seq 1 5`; do echo $i; sleep 1; done;'",
        ...
    }
}
{
    "jobDefinition": {
        "command": "/root/run-my-job.sh > /project/job-log-${GPULAB_JOB_ID}.log 2>&1",
        ...
    }
}

This command is passed to the docker container to run. When empty, the CMD specified in the Dockerfile used for building the specified docker image will run.

jobDataLocations

Example

{
    "jobDefinition": {
        "jobDataLocations": [ { "mountPoint": "/project" } ],
        ...
    }
}

jobDataLocations allows you to specify which volumes must be attached to the Docker container.

The sharePath specifies which location on the host must be mounted, the mountPoint specifies on which dir inside the container it must be mounted. Each jobDataLocation normally requires both a mountPoint and a sharePath. However, you may specify only one of them for the special locations .ssh and /project (+ subdirs).

GPULab containers can access the same storage as Virtual Wall 2 projects . To mount the project-folder to /project, you specify it as mountPoint:

"jobDataLocations": [
    {
       "mountPoint": "/project"
    }
],

This will cause a directory /project to be bound inside your docker container.

It will contain the same data as in /groups/wall2-ilabt-iminds-be/MyProject/. As the same NFS share is mounted behind the scenes, the data is instantly shared and never deleted.

You can also mount only sub directories of the /project dir this way:

"jobDataLocations": [
    {
       "sharePath": "/project/mycode/",
       "mountPoint": "/work/code/"
    }
],

If you need access to the ~/.ssh dir used by the SSH server that gives you access to the container (for example, to manually change the authorized_hosts file), you need to mount it like this:

"jobDataLocations": [
    {
       "sharePath": ".ssh",
       "mountPoint": "/root/.ssh/"
    }
],

portMappings

Example

{
    "jobDefinition": {
        "portMappings": [ { "containerPort": 80 } ]
        ...
    }
}

Sometimes, you want to access network services that run on the docker container. For example, a webserver showing status info might be running. To access web services, the ports need to be “exposed” by docker. You need to specify this in the job definition.

You can specify zero, one or more port mappings. An example:

"portMappings": [ { "containerPort": 80 }, { "containerPort": 21 } ]

This will map port 80 to a port on the host machine. The output of the jobs <job id> command will show which port.

You can also choose the port of the host machine, but this might cause the job to fail if the port is already in use:

"portMappings": [ { "containerPort": 80, "hostPort" : 8080 } ]

environment, userUidVariableName and projectGidVariableName

Example

{
    "jobDefinition": {
        "environment" : {
            "DEMO_A": 1,
            "DEMO_B": "two"
        },
        "userUidVariableName": "DEMO_USER_UID",
        "projectGidVariableName": "DEMO_PROJECT_GID",
        ...
    }
}

The environment variables inside the container, can be set from the jobDefinition.

The "environment" expects a map of key/value pairs that will be added.

Additionally, "userUidVariableName" and "projectGidVariableName" can be user to specify the name of the enviroment variables that will container the user UId and the project GID respectivly.

GPULab also automatically sets a lot of environment variables, which can be used to find info about the running job: GPULAB_CLUSTER_ID, GPULAB_CPUS_RESERVED, GPULAB_GPUS_RESERVED, GPULAB_JOB_ID, GPULAB_MEM_RESERVED, GPULAB_PROJECT, GPULAB_SLAVE_DNSNAME, GPULAB_SLAVE_HOSTNAME, GPULAB_SLAVE_INSTANCE_ID, GPULAB_SLAVE_PID, GPULAB_USERNAME, GPULAB_VERSION and GPULAB_WORKER_ID.

Here is an example of the environment variables for the jobDefinition example at the start of this section:

GPULAB_CLUSTER_ID=1
GPULAB_CPUS_RESERVED=0
GPULAB_GPUS_RESERVED=0
GPULAB_JOB_ID=a280f7e8-b1e6-11e9-93a1-ffe9aeb472f8
GPULAB_MEM_RESERVED=2000m
GPULAB_PROJECT=ilabt-dev
GPULAB_SLAVE_DNSNAME=n085-02.wall2.ilabt.iminds.be
GPULAB_SLAVE_HOSTNAME=n085-02
GPULAB_SLAVE_INSTANCE_ID=af25579982dc3949040
GPULAB_SLAVE_PID=21924
GPULAB_USERNAME=wvdemeer
GPULAB_VERSION=dev
GPULAB_WORKER_ID=0
DEMO_A=1
DEMO_B=two
DEMO_PROJECT_GID=6978
DEMO_USER_UID=10006