How to: Launching Containers From the Command Line - Data Science/Machine Learning Platform (DSMLP)


Overview


DSMLP jobs are executed in the form of Docker containers - these are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Kubernetes container management/orchestration system routes users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate. These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.

Access to the front-end/submission node


To start a container, SSH to dsmlp-login.ucsd.edu with your UC San Diego username, e.g., ssh username@dsmlp-login.ucsd.edu. 

Warning: Avoid Running Code on DSMLP-Login

dsmlp-login is primarily designed to facilitate the execution of docker images using the launch.sh command. By running launch.sh, dedicated nodes are provisioned, allowing the utilization of these images to create customized environments suitable for various development workflows. It is important to note that dsmlp-login itself is prohibited for manual job execution, such as running Python scripts, Java projects, or machine learning tasks. Developing within these dedicated nodes, instead of using dsmlp-login, can help minimize any potential impact on server performance.

Launching a Container


Starting a container

Note: You must be connected to the UCSD VPN to access your container.

SSH to dsmlp-login.ucsd.edu and run either of the following commands:

Launch Script

Description

#GPU

#CPU

RAM (GB)

launch-scipy-ml.sh

See: Standard Datahub/DSMLP Containers

0

2

8

launch-scipy-ml.sh -g 1

1

4

16

We encourage you to not use a GPU until your code is fully tested and a simple training run is successful. PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU.

Example:

[dta001@dsmlp-login ~]$ launch-scipy-ml.sh
Mon Jun 27 15:49:13 PDT 2022 Submitting job dta001-16599
pod/dta001-16599 created
Mon Jun 27 15:49:13 PDT 2022 INFO starting up - pod status: Pending ; Successfully assigned dta001/dta001-16599 to its-dsmlpdev-n01.ucsd.edu
Mon Jun 27 15:49:16 PDT 2022 INFO starting up - pod status: Pending ; Started container init-support
Mon Jun 27 15:49:18 PDT 2022 INFO pod assigned to node: its-dsmlpdev-n01.ucsd.edu
Mon Jun 27 15:49:18 PDT 2022 INFO ucsdets/datascience-notebook:2020.2-stable is now active.
You may access your Jupyter notebook at:  http://dsmlp-login.ucsd.edu:14244/user/dta001/?token=TOKEN

Now paste http://dsmlp-login.ucsd.edu:14244/user/dta001/?token=TOKEN into your web browser. You should see the JupyterHub web app.

Launch script command line flag options

You can provide various command line options to the launch scripts to customize behavior. See "Launch Script Command Line Options" in How to Select and Configure Your Container on the Data Science/Machine Learning Platform (DSMLP). Here are some detailed examples:

Requesting resources for the container

You may request a specific amount of CPUs, GPUs, and RAM for the container using the -c #, -g #, and -m # flags. For example, to use 8 CPUs, 32GB RAM, 1 GPU run:

launch-scipy-ml.sh -c 8 -m 32 -g 1

To request a specific type of GPU use the -v flag, e.g. -v 1080ti. Refer to the DataHub status page for a list of GPU types.

Custom images

Use the -i flag to run a custom image, e.g. -i myrepo/myimage

If the image isn't based on JupyterHub use -e CMD to use an alternate entrypoint, e.g. -i myrepo/myimage -e /entrypoint.sh. This runs /entrypoint.sh instead of jupyterhub.

Running in the background

Pods run in the foreground by default. If you disconnect from dsmlp-login the pod is terminated. Pods may be launched in the background to prevent this. Use the -b flag to launch a pod in the background. Background pods run for 6 hours by default.

To connect to the shell of the notebook use kubesh <pod-name>, e.g. kubesh dta001-16599 in the example above. If you forget the pod's name use kubectl get pod to list the names of all of your pods.

When you're finished with the notebook, use kubectl get pod to get the pod's name. Then use kubectl delete pod <pod-name> to terminate the pod.

Extending pod runtime

To support longer training runs, we permit background execution of student containers, up to 6 hours execution time, via the "-b" command line option.

Please be considerate and terminate idle containers: while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis. When attached to a container they become unusable by others even if completely idle.

By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs. This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" configuration variable. Please contact datahub@ucsd.edu if you need to run a pod for more than 12 hours.

Pulling the most recent version of your image

If you are making changes to an image and want to pull the most recent version of it (e.g., from dockerhub or ghcr.io) please use the command line option: "-P Always", e.g.:

launch.sh -i myrepo/myimage -P Always

Running on a specific node

Use -n # to run on a specific node. For example, to run on node 10 use -n 10. Refer to the DataHub status page for a list of nodes.

Workspaces

To start the container in a course-specific workspace (a directory where your course-specific files are stored), use the -W flag, e.g. -W DSC10_FA22_A00.

Using Visual Studio Code (VS Code)


It's not permitted to run VS Code directly on dsmlp-login since it uses a lot of resources. However, you can run VS Code inside of a container.

Generate SSH key pair

To get started, generate generate a SSH key pair on your computer. This can be done with the ssh-keygen command. Make a note of where the keys are stored.

SSH to dsmlp-login and append the public key to ~/.ssh/authorized_keys.

Check that the key pair works. Run "ssh -i /path/to/private.key USERNAME@dsmlp-login.ucsd.edu". You shouldn't be prompted for a login password, however, you may be prompted to enter the password for your private key.

Setup VS Code

Launch VS Code and install the Remote-SSH extension. For more information, refer to the tips and tricks article, but keep in mind the connection procedure is different. The required modifications to your SSH config file are detailed below.

Go to Remote Explorer in VS Code and select SSH targets. Click the gear icon and edit the config file in your home directory.

Create a new Host entry using the sample below. If you don't know the course id for the -W command please run "workspace --list" on dsmlp-login to retrieve it. If you are an independent study user you may omit the -W flag to use your personal home directory. You must create a separate Host entry for each course, otherwise you may be unable to connect due to mismatched host keys.

# /home/USERNAME/.ssh/config
Host MYCOURSE
User USERNAME
ProxyCommand ssh -i /path/to/private.key USERNAME@dsmlp-login.ucsd.edu /opt/launch-sh/bin/launch.sh -W MYCOURSE -H -N vscode-dsmlp

The ProxyCommand is very important. Without this line VS Code will attempt to run directly on dsmlp-login, but this isn't permitted. ProxyCommand instructs SSH to run a script on dsmlp-login to start a container after the SSH connection is made.

You may notice the launch.sh script is the same one used to launch a notebook, however, two options have been added: -H starts the SSH server in the container; -N vscode-dsmlp gives the container a unique name. This prevents multiple VS Code servers from running.

You may also add other command line options, e.g. -c 4 -m 8 to start the container with 4 CPUs, 8 GB of RAM

To use a GPU for a machine learning container, use /opt/launch-sh/bin/launch-scipy-ml.sh (instead of launch.sh) with command line option -g 1. 

If your vscode connection is dropping within a minute for a memory-intensive task, make sure you are including a "-m #GB" argument in the ProxyCommand.

Finally, right click the SSH target and connect to the server. Click "details" in the lower right part of the screen to monitor connection process. Please send these logs, along with your Host entry to datahub@ucsd.edu if you have problems.

Public Key Authentication

If you use SSH frequently, you may want to set up an SSH key between your PC and the dsmlp-login server to login without entering a password each time. This can be accomplished by using the ssh-keygen command on a local machine. This command will prompt the user to create a public/private key pair, along with the name of the key file and a passphrase. Pressing 'Enter' for each option will select the default (in parentheses), or enter a file location/passphrase if desired, and create the private and public key.

To put the key onto the "dsmlp-login", enter the following command: cat ~/.ssh/id_rsa.pub | ssh user@dsmlp-login.ucsd.edu "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys" on the local machine. To verify that the key works, log in to "dsmlp-login.ucsd.edu". If a password is not prompted in the login process, the ssh key is valid. If a passphrase was set when creating the ssh key, it will provide a prompt to enter the passphrase.

Additional Information


Port Forwarding

You can use SSH to forward a local port into a pod. Run this command on your desktop PC.

ssh -N -L LOCALPORT:dsmlp-login.ucsd.edu:REMOTEPORT username@dsmlp-login.ucsd.edu

Example

The launch script provides an URL like this. It has forwarded port 14244 on dsmlp-login to the Jupyter notebook.

http://dsmlp-login.ucsd.edu:14244/user/username/?token=TOKEN

Run:

ssh -N -L 8888:dsmlp-login.ucsd.edu:14244 username@dsmlp-login.ucsd.edu

Now you can access JupyterHub at http://localhost:8888.

TensorBoard

To access the TensorBoard dashboard set the shell variable "IDENTITY_PROXY_PORTS=1" before launching the container. 

Example

[user@dsmlp-login]:~:598$ IDENTITY_PROXY_PORTS=1 launch-scipy-ml.sh -g 1

After the container launches look for the line "Identity port map 1: Container port 12345 mapped to dsmlp-login.ucsd.edu:12345". Make a note of the port number and use it to start TensorBoard.

Go to the Jupyter URL and open a terminal with "New -> Terminal" and run "tensorboard --logdir logs --bind_all --port 12345" to start TensorBoard.

Now you should be able to access the TensorBoard dashboard at http://dsmlp-login.ucsd.edu:12345.

Monitoring Resource Usage within Jupyter/Python Notebooks

Users of the stock containers will find CPU/Memory/GPU utilization noted at the top of the Jupyter notebook screen:

Screenshot: Jupyter Notebook top page

Monitoring Resource Usage within Containers

Users of the bash command line can find the CPU/RAM usage of their pod by using the `htop` command. To see GPU usage, monitor the `nvidia-smi` command

Monitoring Cluster Status

DSMLP cluster status is available at: https://datahub.ucsd.edu/hub/status (requires login). Aggregate cluster status is available from the DSMLP Cluster Status tab inside a running jupyter notebook.

GPU Wait Times

If no GPUs are available, you may have to wait some time (on average 5-10 minutes) for the launch script to resolve and for you to be able to use your container. If there are issues with GPU availability and you need to run multiple commands, please make sure not to run your launch scripts with the "-f" flag (which kills the container after running the given command). Instead, remove the flag so you can run your commands in the container's shell.

Container Termination Messages

Containers may occasionally exit with one of the following error messages:

OOMKilled

Container memory (CPU RAM) limit was reached.

DeadlineExceeded

Container time limit (default 6 hours) exceeded - see above.

Error

Unspecified error. Contact ITS/ETS for assistance.

These errors will show up in 'kubectl get pods' in the status column.

Your instructor or TA will be your best resource for course-specific questions. If you still have questions or need additional assistance, please email dsmlp@ucsd.edu or visit support.ucsd.edu.

Instructors: if your course uses dsmlp-login, ITS/ETS will provide you with login information for Instructor, TA, and student-test accounts for the courses.