GPULab Client CLI

Prerequisites

You need a Fed4FIRE account to use GPULab. You can sign up at the imec User Authority.

To run the CLI, you need pip for python3 to install the gpulab-client. To install it on Debian/Ubuntu, try:

sudo apt-get install python3-pip

Make sure you have at least Python 3.4. You can check with:

python3 --version

Tip: Using pyenv to install GPULab in a separate environment

If your Linux distribution does not have a recent enough Python, try using pyenv which works on (almost) any Linux:

curl -L https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer | bash
pyenv update
#optional on debian:   apt-get install libbz2-dev libreadline-dev libsqlite3-dev
pyenv install 3.6.2
pyenv local 3.6.2
pyenv versions
python3 --version

The last command should show that you now have a recent enough python version.

(Note: If you install python locally using this method, you do not need to add sudo in front of the installation command in the next section.)

Installation

The current version of GPULab is 1.9

You can download it here: gpulab-client-1.9.tar.gz.

All version can be found at the gitlab GPULab tags page. This includes the changelog for each version.

To install, run:

sudo pip3 install gpulab-client-1.9.tar.gz

The python “pip” system will take care of all details. You will end up with a local install of gpulab-cli

Basic CLI usage

After installation, the gpulab-cli command is available:

$ gpulab-cli --help
Usage: gpulab-cli [OPTIONS] COMMAND [ARGS]...

   GPULab client version 1.9

   Send bugreports, questions and feedback to: jfedbugreports@ilabt.imec.be

   Documentation: https://doc.ilabt.imec.be/ilabt-documentation/gpulab.html

Options:
  --cert PATH          Login certificate  [required]
  -p, --password TEXT  Password associated with the login certificate
  --dev                Use the GPULab development environment
  --servercert PATH    The file containing the servers (self-signed)
                       certificate. Only required when the server uses a self
                       signed certificate.
  --version            Print the GPULab client version number and exit.
  -h, --help           Show this message and exit.

Commands:
  cancel    Cancel running job
  clusters  Retrieve info about the available clusters
  debug     Retrieve a job's debug info. (Do not rely on the presence or
            format of this info. It will never be stable between versions. If
            this has the only source o info you need, ask the developrs to
            expose that info in a different way!)
  hold      Hold queued job(s). Status will change from QUEUED to ONHOLD
  jobs      Get info about one or more jobs
  log       Retrieve a job's log
  release   Release held job(s). Status will change from ONHOLD to QUEUED
  rm        Remove job
  submit    Submit a jobDefinition to run
  wait      Wait for a job to change state

To get a list of currently running jobs:

$ gpulab-cli --cert /home/me/my_wall2_login.pem jobs
TASK ID                              NAME                 COMMAND              CREATED                   USER         PROJECT         STATUS
7eaec798-ac49-11e9-93a1-cfd87533270a JupyterHub-singleuse ***                  2019-07-22T08:25:26+02:00 pbonte       Orca            RUNNING
ca38f0c4-ac23-11e9-93a1-1757063cb6eb JupyterHub-singleuse ***                  2019-07-22T03:55:32+02:00 ykuno        DeepBeamforming RUNNING
da112f92-ac1f-11e9-93a1-cbc72e039c23 JupyterHub-singleuse ***                  2019-07-22T03:27:21+02:00 ykuno        DeepBeamforming CANCELLED

Recommendation

The GPULab CLI also supports getting your login certificate information from an environment variable. This allows you to omit the --cert argument from all feature commands.

export GPULAB_CERT='/home/me/my_wall2_login.pem'
export GPULAB_DEV='False'

If you append these exports to ~/.bashrc you’ll never have to type them again!

To same command to get a list of currently running jobs is now much shorter:

gpulab-cli jobs

Note

Using the CLI without password

You can use the CLI without password. Be aware that this lowers security.

You need to install openssl to execute the commands below. On Debian, try:

sudo apt-get install openssl

The password is “stored” in the PEM file, because it is used to encrypt the private RSA key inside the PEM file. You can decrypt the RSA key and store it, to remove the password. Below, we assume that your (password protected) wall2 PEM file is in my_wall2_login.pem. The commands will create the file my_wall2_login_decrypted.pem which will not be password protected.

Use these commands:

openssl rsa -in my_wall2_login.pem > my_wall2_login_decrypted.pem
openssl x509 -in my_wall2_login.pem >> my_wall2_login_decrypted.pem

(The first command will ask your password, the second won’t)

Submitting a GPULab Job

A GPULab job is defined by a JSON job definition, which looks as follows:

my-first-jobDefinition.json
{
  "jobDefinition": {
    "name": "helloworld",
    "description": "Hello world!",
    "clusterId": 1,
    "dockerImage": "gpulab.ilabt.imec.be:5000/sample:nvidia-smi",
    "command": "",
    "resources": {
      "gpus": 1,
      "systemMemory": 2000,
      "cpuCores": 2
    },
    "jobDataLocations": [ ],
    "portMappings": [ ]
  }
}

To submit the job, you’ll have to specify the name of the project on the wall2 authority in which you want it to run. The command is:

$ gpulab-cli submit --project=myproject < my-first-jobDefinition.json
78125766-0b45-11e8-be1c-0fbd357c0b05

A hash representing the job ID is returned.

Getting information on a running job

You can query the status of this job using (an unique prefix of) the job ID:

$ gpulab-cli jobs 7812
          Job ID: 78125766-0b45-11e8-be1c-0fbd357c0b05
           Name: no name
        Project: fed4fire
       Username: wvdemeer
   Docker image: gpulab.ilabt.imec.be:5000/sample:nvidia-smi
        Command: -
         Status: FINISHED
        Created: 2018-02-06T13:56:02-07:00
      Worker ID: -
Worker hostname: 192.168.0.1
        Started: 2018-02-06T14:09:44-07:00
       Duration: 1 second
       Finished: 2018-02-06T14:09:45-07:00
       Deadline: 2018-02-07T00:09:44-07:00

Getting logs of a job

You can view the command line output of the job using log:

$ gpulab-cli log 7812
2018-02-06T14:09:45.185400608Z
2018-02-06T14:09:45.185451167Z ==============NVSMI LOG==============
2018-02-06T14:09:45.185459009Z
2018-02-06T14:09:45.185466771Z Timestamp                           : Tue Feb  6 14:09:45 2018
2018-02-06T14:09:45.185471972Z Driver Version                      : 390.12
2018-02-06T14:09:45.185477068Z
2018-02-06T14:09:45.185490896Z Attached GPUs                       : 1
2018-02-06T14:09:45.185612743Z GPU 00000000:02:00.0
2018-02-06T14:09:45.185713708Z     Product Name                    : GeForce GTX 580
2018-02-06T14:09:45.186226030Z     Product Brand                   : GeForc
...

Getting the GPULab event log of a job

You can view the internal event log of GPULab. This is mostly useful for debugging purposes. It can contain error messages thrown by the GPULab code which allow you to find the error in your jobDefinition, or to submit a bugreport.

$ gpulab-cli debug ffe249

2019-07-22T11:13:41+02:00 STATE CHANGED -> QUEUED
2019-07-22T11:13:45+02:00 STATE CHANGED -> STARTING
2019-07-22T11:13:45+02:00 LOG INFO: Running job on cluster 5 host slave5a (slave5a.wall2.ilabt.iminds.be) worker 0
2019-07-22T11:13:46+02:00 LOG INFO: Fetching latest version of image 'jupyter/minimal-notebook:latest' (no auth)
2019-07-22T11:13:48+02:00 LOG INFO: Fetched image 'jupyter/minimal-notebook:latest' with hash sha256:37c4f3f362331d1107fcd8ea8d16a8b6b8171ad437e8b933bd0d35bd98dea718
2019-07-22T11:13:48+02:00 LOG INFO: Launching JupyterHub-singleuser (image jupyter/minimal-notebook:latest with command ['start-notebook.sh', '--notebook-dir=/project/', '--NotebookApp.default_url=/lab']
2019-07-22T11:13:48+02:00 LOG DEBUG: OK: Project dir "/data/twalcari-test" exists on slave5a-f2333333333333333333@stable-cluster5
2019-07-22T11:13:48+02:00 LOG DEBUG: OK: Project dir "/data/twalcari-test" exists on slave5a-f2333333333333333333@stable-cluster5
2019-07-22T11:13:48+02:00 LOG DEBUG:    Mem_limit=2048m  (job requested 2048m)
2019-07-22T11:13:48+02:00 LOG DEBUG:    CPU ID restriction: None
2019-07-22T11:13:48+02:00 LOG DEBUG: Successfully setup SSH pubkey access step 1. ssh_username=D5MTPCUG
2019-07-22T11:13:49+02:00 LOG INFO: Started container 577de083df9fce36fef3a90573faa27b053641b39dc2421833986463831b385d
2019-07-22T11:13:49+02:00 LOG DEBUG: port_mapping_info={"8888/tcp": [{"HostIp": "0.0.0.0", "HostPort": "32851"}]}
2019-07-22T11:13:49+02:00 LOG DEBUG: ReadLogsThread for stdout stderr started
2019-07-22T11:13:49+02:00 LOG DEBUG: Successfully setup SSH pubkey access step 2. ssh_username=D5MTPCUG
2019-07-22T11:13:49+02:00 LOG DEBUG: Job run_details have been updated
2019-07-22T11:13:49+02:00 STATE CHANGED -> RUNNING
2019-07-22T11:14:08+02:00 LOG DEBUG: Docker container status is now "running". Job status is "JobState.STARTING".
2019-07-22T11:14:27+02:00 LOG DEBUG: Docker container status is now "running". Job status is "JobState.RUNNING".

Getting console access to a GPULab job

You can gain console access to a job by using the ssh-command of the gpulab-cli.

Example:

$ gpulab-cli ssh ffe249

Interactive jobs

When you’re preparing a job, you don’t always know the command you want to run yet, or the script you want to run is not yet ready. In these cases, you want to test things out inside a GPULab container, but don’t want to container to stop when the command stops. You can start a job with sleep 600 as command, and log in with gpulab-cli ssh, GPULab also has a specific command for this: gpulab-cli interactive.

gpulab-cli interactive requires you to specify the docker image, the duration and your project:

gpulab-cli interactive --project my-project --duration-minutes 10 --docker-image gpulab.ilabt.imec.be:5000/sample:nvidia-smi

This will start a job, which waits for 10 minutes before stopping. GPULab also automatically connects over ssh, so that you can run commands. Inside the job, you’ll find a file /running. If you delete it, the job will stop (you can also cancel the job using gpulab-cli cancel from another terminal). An example:

~ $ gpulab-cli --dev interactive --project my-project --duration-minutes 10 --docker-image gpulab.ilabt.imec.be:5000/sample:nvidia-smi
30c1247c-b1f3-11e9-93a1-2f2d78d6eb89
2019-07-29 13:22:46 +0200 -            - Waiting for Job to start running...
2019-07-29 13:22:48 +0200 - 2 seconds  - Job is in state STARTING
2019-07-29 13:22:50 +0200 - 4 seconds  - Job is in state RUNNING
2019-07-29 13:22:50 +0200 - 4 seconds  - Job is now running
root@86b0f7321a8f:/# ls
bin   dev  home  lib64  mnt  proc     root  running  srv  tmp  var
boot  etc  lib   media  opt  project  run   sbin     sys  usr
root@86b0f7321a8f:/# rm /running
root@86b0f7321a8f:/# Connection to n085-02.wall2.ilabt.iminds.be closed.
wim@tolstoy ~ $

If you log out of the ssh session (without removing /running), the job does not stop, and you can reconnect to it using gpulab-cli sh.

Check gpulab-cli interactive --help for options that let you specify the cluster ID, the number of CPUs and GPUs, the amount of memory, and more.