GPULab client

Send bugreports, questions and feedback to: jfedbugreports@ilabt.imec.be

And/or use the mattermost channel

You can see known bugs and feature requests at the iLab.t GitLab

Current stable version

The current version of GPULab is 1.5

You can download it here: here.

Upgrade with:

sudo pip3 install gpulab-client-1.5.tar.gz

New features in this version (now available on stable and dev version):

  • SSH access to running containers:

    gpulab-cli ssh <job_id>

  • Get emails when a job start and/or completes:

    gpulab-cli submit --project <PROJECT> --email-run --email-done <job_id>

  • See the log as it is added:

    gpulab-cli log --follow <job_id>

  • The gpulab-cli client will now notify you when there’s an update available.

A lot of server side improvements have been added as well.

These changes are now deployed on both stable and dev.

Please let us know when you encounter any bugs.

Requirements

You need pip for python3 to install the gpulab-client. To install it on debian/ubuntu, try:

sudo apt-get install python3-pip

Make sure you have at least python 3.4. You can check with:

python3 --version

If your linux distribution does not have a recent enough python, try this procedure using pyenv that works on (almost) any linux:

curl -L https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer | bash
pyenv update
#optional on debian:   apt-get install libbz2-dev libreadline-dev libsqlite3-dev
pyenv install 3.6.2
pyenv local 3.6.2
pyenv versions
python3 --version

The last command should show that you now have a recent enough python version.

(Note: If you install python locally using this method, you do not need to add “sudo” in front of the installation command in the next section.)

Installation

You can find download for the tarball for the letest gpulab client at the top of this page.

To install the application, all you need to do is run:

sudo pip3 install gpulab-client-1.5.tar.gz

The python “pip” system will take care of all details. You will end up with a local install of gpulab-cli

Stable and dev version

There are 2 gpulab versions running in parallel. One is the stable version, the other is the development version.

Since gpulab is still very new, for now, you should consider the stable version also to be in “development” to some degree.

Off course, we will try to keep the “stable” version more stable. New features an bugfixes will first be deployed and tested on the “development” gpulab, and “stable” will get them later (unless they are critical bugfixes). We will also put any unstable hardware in “dev”, not in “stable”.

Currently the GPU nodes used by the 2 gpulab versions are 100% seperated. But this could change in the future.

Basic CLI usage

After installation, the gpulab-cli command is available:

$ gpulab-cli --help
Usage: gpulab-cli [OPTIONS] COMMAND [ARGS]...

   GPULab client version 1.5

   Send bugreports, questions and feedback to: jfedbugreports@ilabt.imec.be

   Documentation: https://doc.ilabt.imec.be/ilabt-documentation/gpulab.html

Options:
  --cert PATH          Login certificate  [required]
  -p, --password TEXT  Password associated with the login certificate
  --dev                Use the GPULab development environment
  --servercert PATH    The file containing the servers (self-signed)
                       certificate. Only required when the server uses a self
                       signed certificate.
  --version            Print the GPULab client version number and exit.
  -h, --help           Show this message and exit.

Commands:
  cancel    Cancel running job
  clusters  Retrieve info about the available clusters
  debug     Retrieve a job's debug info. (Do not rely on the presence or
            format of this info. It will never be stable between versions. If
            this has the only source o info you need, ask the developrs to
            expose that info in a different way!)
  hold      Hold queued job(s). Status will change from QUEUED to ONHOLD
  jobs      Get info about one or more jobs
  log       Retrieve a job's log
  release   Release held job(s). Status will change from ONHOLD to QUEUED
  rm        Remove job
  submit    Submit a jobDefinition to run
  wait      Wait for a job to change state

To get a list of currently running jobs:

$ gpulab-cli --cert /home/me/my_wall2_login.pem jobs
TASK ID                             NAME                      COMMAND                   CREATED              USER            PROJECT         STATUS

This command is quite long, you can store some of this info in environment variables, so you don’t have to type them each time.

export GPULAB_CERT='/home/me/my_wall2_login.pem'
export GPULAB_DEV='False'

Recommendation: If you append these exports to ~/.bashrc you’ll never have to type them again!

To same command to get a list of currently running jobs is now much shorter:

gpulab-cli jobs

To request info about the available worker nodes and clusters, use the following command:

$ gpulab-cli clusters --short
+---------+---------+----------------------+---------+---------+---------+-----------------+
| Cluster | Version | Host                 | Workers | GPUs    | CPUs    | Memory (GB)     |
+---------+---------+----------------------+---------+---------+---------+-----------------+
| 1       | stable  | gpu2                 |   2/2   |   2/2   |  16/16  |   31.18/31.18   |
| 1       | stable  | gpu1                 |   2/2   |   2/2   |  16/16  |   31.18/31.18   |
+---------+---------+----------------------+---------+---------+---------+-----------------+

Ommitting --short results in more info, including the GPU model etc.

Next create a test job (you can of course use any editor you like):

cat > my-first-jobDefinition.json <<EOF
 {
     "jobDefinition": {
         "name": "Iterativenet",
         "description": "hello world",
         "clusterId": 1,
         "dockerImage": "gpulab.ilabt.imec.be:5000/sample:nvidia-smi",
         "jobType": "BATCH",
         "command": "",
         "resources": {
             "gpus": 1,
             "systemMemory": 32768,
             "cpuCores": 2
         },
         "jobDataLocations": [ ],
         "portMappings": [ ]
     }
 }
EOF

To submit the job, you’ll have to specify the name of the project on the wall2 authority in which you want it to run. (is fixed I think:Be aware that this name is case sensitive. ) For now, only 1 GPU per job can be addressedn, so the field gpus is not considered now. The command is:

$ gpulab-cli submit --project=myproject < my-first-jobDefinition.json
78125766-0b45-11e8-be1c-0fbd357c0b05

A hash representing the job ID is returned.

You can now query the status of this job using this job ID or first part of this job ID (if it’s unique enough):

$ gpulab-cli jobs 7812
          Job ID: 78125766-0b45-11e8-be1c-0fbd357c0b05
           Name: no name
        Project: fed4fire
       Username: wvdemeer
   Docker image: gpulab.ilabt.imec.be:5000/sample:nvidia-smi
        Command: -
         Status: FINISHED
        Created: 2018-02-06T13:56:02-07:00
      Worker ID: -
Worker hostname: 192.168.0.1
        Started: 2018-02-06T14:09:44-07:00
       Duration: 1 second
       Finished: 2018-02-06T14:09:45-07:00
       Deadline: 2018-02-07T00:09:44-07:00

You can also view the command line output of the job:

$ gpulab-cli log 7812
2018-02-06T14:09:45.185400608Z
2018-02-06T14:09:45.185451167Z ==============NVSMI LOG==============
2018-02-06T14:09:45.185459009Z
2018-02-06T14:09:45.185466771Z Timestamp                           : Tue Feb  6 14:09:45 2018
2018-02-06T14:09:45.185471972Z Driver Version                      : 390.12
2018-02-06T14:09:45.185477068Z
2018-02-06T14:09:45.185490896Z Attached GPUs                       : 1
2018-02-06T14:09:45.185612743Z GPU 00000000:02:00.0
2018-02-06T14:09:45.185713708Z     Product Name                    : GeForce GTX 580
2018-02-06T14:09:45.186226030Z     Product Brand                   : GeForc
...

You can also view the internal event log of GPULab. This is mostly useful for debugging purposes:

$ gpulab-cli debug 7812

Requesting resources

The jobDefinitions include a "clusterId". This corresponds to one or more slave nodes used by GPULab to execute the jobs.

Currently, these are the available clusters:

  • "clusterId"=1: 2 slaves nodes with each: 2 GPUs (GeForce GTX 1080 Ti), one CPU (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz - 8 cores - 16 threads) and 32GB of memory.

Note: a (client cli) command to clusters will be added later.

The jobDefinitions also includes "resources", which specifies the ammount of GPU’s ("gpus"), CPU’s ("cpuCores") and memory ("systemMemory") needed.

If more than one GPU or CPU is requested, the job will never run when less than requested is available.

The ammount of systemMemory needs to be specified in MB. If more memory is requested than is available, the job will run with reduced memory! (This behavious might be changed in future versions)

Shared docker repository

At gpulab.ilabt.imec.be:5000 a shared docker repository is available. You can use it to store your custom docker images.

Be aware that this is a shared repository: anyone can access the image stored in it. So do not store sensitive data inside images on this repository.

Also note that there are no backups for this repository. You are reponsible to keep your docker images backed up.

Storage: Accessing your project dir

Shared project dir

When you start an experiment with wall2 resources in the experiment MyProject, all your nodes will have this shared directory:

/groups/wall2-ilabt-iminds-be/MyProject/

You can use this dir to share data between nodes in the projects, and between project members. The data is stored permanently (it is not deleted after your experiment is terminated).

You can also make this shared dir available to gpulab jobs.

(Important disclaimer: There are no automatic backups for this storage! You need to keep backups of important files yourself!)

Access data locally

To access this data share from your own machine, you’ll need to use the jFed experimenter GUI to reserve a resource, and access the data from that resource. You can find a detailed tutorial on how to do this in the Fed4Fire first experiment tutorial. Note that jFed has basic scp functionality, to make transferring files easier.

Access data from gpulab

You can access this same storage from within gpulab jobs. You just need to add the /project mountpoint to the jobDefinition:

"jobDataLocations": [
    {
       "mountPoint": "/project"
    }
],

This will cause a directory /project to be bound inside your docker container.

It will contain the same data as in /groups/wall2-ilabt-iminds-be/MyProject/. As the same NFS share is mounted behind the scenes, the data is instantly shared and never deleted.

You can also mount only sub directories of the /project dir this way:

"jobDataLocations": [
    {
       "sharePath": "/project/mycode/",
       "mountPoint": "/work/code/"
    }
],

A second example: using the project dir

First create a jobDefinition which uses the shared project storage

cat > my-second-jobDefinition.json <<EOF
 {
     "jobDefinition": {
         "name": "my-2nd-gpulab-job",
         "description": "hello again world",
         "clusterId": 1,
         "dockerImage": "gpulab.ilabt.imec.be:5000/sample:nvidia-smi",
         "jobType": "BATCH",
         "command": "/project/start-my-gpulab-app.sh",
         "resources": {
             "gpus": 1,
             "systemMemory": 32768,
             "cpuCores": 1
         },
         "jobDataLocations": [
             {
                 "mountPoint": "/project"
             }
         ],
         "portMappings": []
     }
 }
EOF

Now, create start-my-gpulab-app.sh which is the start command in the jobDefinition. This will override any start command in the docker container.

$ cat > start-my-gpulab-app.sh <<EOF
#!/bin/bash
echo 'This is my test gpulab app.'
sleep 30
echo 'Ok, all the hard work is done now'
EOF

Upload this script to your wall2 shared project dir. (See above)

Submit the job:

$ gpulab-cli submit --project=MyProject < my-second-jobDefinition.json
db56c279f4d5499585856a76caf28ef2

Check its status and logs:

$ gpulab-cli jobs db56c279f4d5499585856a76caf28ef2

$ gpulab-cli log db56c279f4d5499585856a76caf28ef2

Third example: real world example

Suppose the following typical usecase: we have software on our laptop to crunch data and we want to run this on GPUlab.

As the software, we take gpuburn (http://wili.cc/blog/gpu-burn.html), so we can download the source, but need to compile. For this, you need a linux machine with the CUDA SDK. First important thing, the current docker containers are only providing v8 of the lib, not v9. So download CUDA from https://developer.nvidia.com/cuda-80-ga2-download-archive. Then compile the software, and move the software to the Virtual Wall homedir /groups/wall2-ilabt-iminds-be/projectname. (much easier: use a virtual wall machine to download cuda and compile the software)

Second step, make the script to execute and put it also in the NFS homedir:

cat > start-my-gpulab-app.sh <<EOF
#!/bin/bash
ls /project
#ls /usr/local/cuda/
#ls /usr/local/cuda/lib64

date
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
cd /project/gpuburn/
./gpu_burn 3600 > /project/gpuburn/output_`date +%s`_$RANDOM.log 2>&1
sleep 30
echo 'ok'
date
EOF

A couple of important points here: * use /bin/bash at the first line to make sure that e.g. ls works * watch the export of LD_LIBRARY_PATH * because gpu_burn has also a file compare.ptx which is needed, you need to cd in that directory (might be true for other software as well) * gpulab is a bit sensitive at too much logging, so redirect your logging. Be careful if you launch multiple instances of the same job/script that the logfiles are unique per job (the example uses linux timestamp and a random number) * date comes in handy to know start and stop times

Then define the job description for this job (not needed on the homedir, should be your local laptop or e.g. agnesi.intec.ugent.be):

cat > my-third-jobDefinition.json <<EOF
{
 "jobDefinition": {
     "name": "projectdir",
     "description": "hello again world",
     "clusterId": 1,
     "dockerImage": "gpulab.ilabt.imec.be:5000/sample:nvidia-smi",
     "jobType": "BATCH",
     "command": "/project/start-my-gpulab-app.sh",
     "resources": {
         "gpus": 1,
         "systemMemory": 32768,
         "cpuCores": 1
     },
     "jobDataLocations": [
         {
             "mountPoint": "/project"
         }
     ],
     "portMappings": []
 }

}

So, we can use the same nvidia-smi docker container with the cuda 8 libs. Then launch this job, and verify that that the output comes in a logfile in your homedir.

gpulab-cli --cert bvermeu2_decrypted.pem  --dev  submit  --project=bvermeul9 < my-third-jobDefinition.json

You can launch this 12 times to fully load dev gpulab :-).

Fourth example: using a custom docker image

This example features a custom docker image.

The docker registry is initially not secured and it is shared by all users of gpulab. This will be changed later.

First let’s look at the images available on the docker registry:

$ curl -X GET https://gpulab.ilabt.imec.be:5000/v2/_catalog
{"repositories":["alpine","sample"]}
$ curl -X GET https://gpulab.ilabt.imec.be:5000/v2/alpine/tags/list
{"name":"alpine","tags":["latest"]}
$ curl -X GET https://gpulab.ilabt.imec.be:5000/v2/sample/tags/list
{"name":"sample","tags":["nbody","matrixMulCUBLAS","bandwidthTest","deviceQuery","nvidia-smi","vectorAdd"]}

(TODO There might be a better way to query this info.)

Fetch an image locally:

docker pull gpulab.ilabt.imec.be:5000/sample:nvidia-smi

You can investigate the image if you like, but for this tutorial, that is not required:

$ docker run -t -i --entrypoint bash gpulab.ilabt.imec.be:5000/sample:nvidia-smi
root@ea1bb2a5bae1:/# ls /usr/local/cuda/bin
bin2c        crt       cuda-gdbserver  cudafe    cuobjdump  gpu-library-advisor  nvcc.profile  nvlink  nvprune
computeprof  cuda-gdb  cuda-memcheck   cudafe++  fatbinary  nvcc                 nvdisasm      nvprof  ptxas
root@7f2e1c1ded19:/# exit

Next, we’ll create a custom image to run our application.

mkdir my-first-gpulab-app
cd my-first-gpulab-app
cat > Dockerfile <<EOF
# Start from a sample image from nvidia
FROM gpulab.ilabt.imec.be:5000/sample:nvidia-smi

# Set the working directory to /project
#  (which will be linked to your project in the jobDefinition)
WORKDIR /project

# Install packages inside the container (if needed, openssl used as an example here)
RUN apt-get update
RUN apt-get install -y openssl

# Start your applications startup script in the /project dir
#  (which will be linked to your project in the jobDefinition)
CMD ["/project/start-my-gpulab-app.sh"]
EOF

Now build the image:

$ docker build -t gpulab.ilabt.imec.be:5000/my-first-gpulab-app .
Sending build context to Docker daemon  3.072kB
Step 1/5 : FROM gpulab.ilabt.imec.be:5000/sample:nvidia-smi
 ---> 3c8f5c1a3ca0
Step 2/5 : WORKDIR /project
 ---> Using cache
 ---> 2f16c8f05768
Step 3/5 : RUN apt-get update
 ---> Running in a6a7c73e4c60
Get:1 http://security.ubuntu.com/ubuntu xenial-security InRelease [102 kB]
Ign:2 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  InRelease

... etc ...

Processing triggers for libc-bin (2.23-0ubuntu9) ...
 ---> ebcb9595c0ae
Removing intermediate container 95b110fd232b
Step 5/5 : CMD /project/start-my-gpu-app.sh
 ---> Running in 305d2e446862
 ---> 6aa40ec0a05e
Removing intermediate container 305d2e446862
Successfully built 6aa40ec0a05e
Successfully tagged my-first-gpulab-app:latest

And upload it to the repository:

docker push gpulab.ilabt.imec.be:5000/my-first-gpulab-app:latest

Next, create a job that uses this image:

cat > my-third-jobDefinition.json <<EOF
 {
     "jobDefinition": {
         "name": "my-2nd-gpulab-job",
         "description": "hello again world",
         "clusterId": 1,
         "dockerImage": "gpulab.ilabt.imec.be:5000/my-first-gpulab-app:latest",
         "jobType": "BATCH",
         "command": "",
         "resources": {
             "gpus": 1,
             "systemMemory": 32768,
             "cpuCores": 1
         },
         "jobDataLocations": [
             {
                 "mountPoint": "/project"
             }
         ],
         "portMappings": []
     }
 }
EOF

Note that in this case, no command is specified in the jobDefinition, so the command specified in CMD in the Dockerfile will be used.

Now that the preparations are done, submit the job:

$ gpulab-cli submit --project=MyProject < my-third-jobDefinition.json
e4f5652f51794cfa924c0503f6ffa2dc

Check it’s status and logs:

$ gpulab-cli jobs e4f
         Job ID: e4f5652f51794cfa924c0503f6ffa2dc
        Project: MyProject
       Username: username
   Docker image: gpulab.ilabt.imec.be:5000/my-first-gpulab-app:latest
        Command:
         Status: FINISHED (last change: 2017-08-25T16:03:51.128169)
        Created: 1503676999
         Worker: gpu2.gpulab.wall2-ilabt-iminds-be.wall2.ilabt.iminds.be#1
        Started: 2017-08-25T16:03:20
       Duration: 30 seconds
       Finished: 2017-08-25T16:03:51
$ gpulab-cli log e4f
2017-08-25T10:03:20.900523045Z  This is my test gpulab app. It is executed by the docker image if no command is specified
2017-08-25T10:03:50.902423420Z  Ok, all the hard work is done now

Accessing network services running on the container: port mappings

Sometimes, you want to access network services that run on the docker container. For example, a webserver showing status info might be running. To access web services, the ports need to be “exposed” by docker. You need to specify this in the job definition.

You can specify zero, one or more port mappings. An example:

"portMappings": [ { "containerPort": 80 } ]

This will map port 80 to a port on the host machine. The output of the jobs <job id> command will show which port.

You can also choose the port of the host machine, but this might cause the job to fail if the port is already in use:

"portMappings": [ { "containerPort": 80, "hostPort" : 8080 } ]

Using the CLI without password

You can use the CLI without password. Be aware that this lowers security.

You need to install openssl to execute the commands below. On Debian, try:

sudo apt-get install openssl

The password is “stored” in the PEM file, because it is used to encrypt the private RSA key inside the PEM file. You can decrypt the RSA key and store it, to remove the password. Below, we assume that your (password protected) wall2 PEM file is in my_wall2_login.pem. The commands will create the file my_wall2_login_decrypted.pem which will not be password protected.

Use these commands:

openssl rsa -in my_wall2_login.pem > my_wall2_login_decrypted.pem
openssl x509 -in my_wall2_login.pem >> my_wall2_login_decrypted.pem

(The first command will ask your password, the second won’t)

GPUlab with jupyter notebook

You can use GPUlab to run an interactive jupyter notebook server.

Docker image

Use this docker image: gpulab.ilabt.imec.be:5000/jupyter-example:v1

Or build a similar one. This is the Dockerfile of the image above:

FROM gpulab.ilabt.imec.be:5000/sample:nvidia-smi

RUN apt-get update && apt-get install -y build-essential libssl-dev libffi-dev python-dev && rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y python-pip python3-pip && rm -rf /var/lib/apt/lists/*
RUN pip install ipywidgets bcolz sympy ujson pandas matplotlib graphviz pydot jupyter
RUN pip3 install ipywidgets bcolz sympy ujson pandas matplotlib graphviz pydot jupyter

EXPOSE 8888
ENTRYPOINT ["/usr/local/bin/jupyter", "notebook", "--allow-root", "--no-browser", "--ip=0.0.0.0", "--port=8888"]
#CMD jupyter notebook --allow-root --no-browser --ip=0.0.0.0 --port=8888

These commands where used to build it and push it to the repository:

docker build -t gpulab.ilabt.imec.be:5000/jupyter-example:v1 .
docker push gpulab.ilabt.imec.be:5000/jupyter-example:v1

Running the jupyter notebook as a job

This is the job definition:

{
    "jobDefinition": {
        "name": "Jupyter-ex",
        "description": "Jupyter Notebook Example",
        "clusterId": 1,
         "dockerImage": "gpulab.ilabt.imec.be:5000/jupyter-example:v1",
        "jobType": "BATCH",
        "command": "",
        "resources": {
             "gpus": 1,
             "systemMemory": 32768,
             "cpuCores": 1
        },
        "jobDataLocations": [
            {
                "mountPoint": "/project"
            }
        ],
        "portMappings": [ { "containerPort": 8888 } ]
    }
}

Save this to a file named jupyterEx-jobDefinition.json

Then execute gpulab:

$ gpulab-cli submit --project YOURPROJECT --wait-run < test-jobDefinition.json jupyterEx-jobDefinition.json
eb8e6578-6251-11e8-b435-8b225f12d88d

Once done, get some info about the job:

$ gpulab-cli jobs eb8e6578

Look for the following lines to get the hostname and port:

Port Mappings: 8888/tcp -> 33043
  Worker Host: n085-01.wall2.ilabt.iminds.be

Also look in the job logs, to find the jupyter token. You can use grep to quickly find it:

$ gpulab-cli --dev log eb8e6578 | grep token=
2018-05-28T08:34:43.368617581Z [I 08:34:43.368 NotebookApp] http://e4807d176222:8888/?token=2469168f238518cc894a0c9349ff2091ffb0d3a123628269
2018-05-28T08:34:43.369187810Z         http://e4807d176222:8888/?token=2469168f238518cc894a0c9349ff2091ffb0d3a123628269&token=2469168f238518cc894a0c9349ff2091ffb0d3a123628269

Open your browser and use the corrected URL. In the example, this is:

http://n085-01.wall2.ilabt.iminds.be:33043/?token=2469168f238518cc894a0c9349ff2091ffb0d3a123628269

Note that you need IPv6 to access some of the worker nodes.

Once inside the jupyter notebook, you can open and save files. Note that the /project dir is available (wall2 shared project dir).

Do not forget to use the “Quit” button at the top left of the window, to close the jupyter notbook. This will stop the gpulab job. You can also stop the job using the cli:

$ gpulab-cli cancel eb8e6578-6251-11e8-b435-8b225f12d88d