How to: Launching Containers From the Command Line - Data Science/Machine Learning Platform (DSMLP)


Overview


DSMLP jobs are executed in the form of Docker containers - these are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Kubernetes container management/orchestration system routes users’ containers onto compute cluster nodes, monitors performance, and applies resource limits/quotas as appropriate. These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.

Access to the front-end/submission node


To start a container from the command line (instead of datahub.ucsd.edu), see "Starting a container" below.

Warning: Avoid Running Code on the dsmlp-login host

dsmlp-login.ucsd.edu allows execution of docker images on dedicated cluster nodes via the launch.sh command, creating customized environments suitable for various development workflows. It is important to note that manual job execution, such as running Python scripts, Java projects, or machine learning tasks, is prohibited on dsmlp-login.ucsd.edu to minimize impact on server performance - please use the launch.sh command instead.  

Launching a Container


Starting a container

Note: You must be connected to the UCSD VPN to access your container via a web interface after launching it.

1) Open a command/terminal window on your computer and ssh to dsmlp-login.ucsd.edu with your UC San Diego username:

ssh username@dsmlp-login.ucsd.edu

2) Run either of the following commands to receive the resources noted:

Launch Script

Description

#GPU

#CPU

RAM (GB)

launch.sh

See: Standard Datahub/DSMLP Containers

0

2

8

launch.sh -g 1

1

4

16

(We encourage you to not use a GPU until your code is fully tested and a simple training run is successful. PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU.)

Example:

[dta001@dsmlp-login ~]$ launch.sh
Mon Jun 27 15:49:13 PDT 2022 Submitting job dta001-16599
pod/dta001-16599 created
Mon Jun 27 15:49:13 PDT 2022 INFO starting up - pod status: Pending ; Successfully assigned dta001/dta001-16599 to its-dsmlpdev-n01.ucsd.edu
Mon Jun 27 15:49:16 PDT 2022 INFO starting up - pod status: Pending ; Started container init-support
Mon Jun 27 15:49:18 PDT 2022 INFO pod assigned to node: its-dsmlpdev-n01.ucsd.edu
Mon Jun 27 15:49:18 PDT 2022 INFO ucsdets/datascience-notebook:2020.2-stable is now active.
You may access your Jupyter notebook at:  http://dsmlp-login.ucsd.edu:14244/user/dta001/?token=TOKEN

Now paste http://dsmlp-login.ucsd.edu:14244/user/dta001/?token=TOKEN into your web browser. You should see the JupyterHub web app.

Launch script command line flag options

You can provide various command line options to the launch scripts to customize behavior. Run "launch.sh -h", or see "Launch Script Command Line Options" in How to Select and Configure Your Container on the Data Science/Machine Learning Platform (DSMLP). Here are some detailed examples:

Requesting resources for the container

You may request a specific amount of CPUs, GPUs, and RAM for the container using the -c #, -g #, and -m # flags. For example, to use 8 CPUs, 32GB RAM, 1 GPU run:

launch.sh -c 8 -m 32 -g 1

To request a specific type of GPU use the -v flag, e.g. -v 1080ti. Refer to the DataHub status page for a list of GPU types.

Custom images

Use the -i flag to run a custom image, e.g. launch.sh -i myrepo/myimage

If the image isn't based on JupyterHub use -e CMD to use an alternate entrypoint, e.g. -i myrepo/myimage -e /entrypoint.sh. This runs /entrypoint.sh instead of jupyterhub.

Alternately, run with "-s" to launch only a shell terminal, inhibiting launch of Jupyter.

Running in the background

Pods run in the foreground by default. If you disconnect from dsmlp-login the pod is terminated. Pods may be launched in the background to prevent this. Use the -b flag to launch a pod in the background, monitoring launch progress until it's scheduled. Background pods run for 6 hours by default.

To connect to the shell of the notebook use kubesh <pod-name>, e.g. kubesh dta001-16599 in the example above. If you forget the pod's name use kubectl get pod to list the names of all of your pods.

When you're finished with the notebook, use kubectl get pod to get the pod's name. Then use kubectl delete pod <pod-name> to terminate the pod.

Batch or non-interactive jobs

Unattended jobs may be launched via the "-B" command line option; you must specify a script or program to be executed within the container.  Use the "--" option to separate 'launch.sh' options such as "-g 1" from those of your command:

[user@dsmlp-login]:~:611$ launch.sh -g 1 -B -- python ./f2.py
Wed Mar 27 14:45:46 PDT 2024 INFO job was successfully submitted
Please remember to shut down via: "kubectl delete pod user-455" ; "kubectl get pods" to list running pods.
You may retrieve output from your pod via: "kubectl logs user-455"

Pod status ("kubectl get pods") will remain "Pending" until resources become available and the job is scheduled on a node. Run "kubectl describe pod <pod-name>" for more detailed messages regarding scheduling or execution.

Review job output via "kubectl logs <pod-name>", or redirect your command to a file within your container:

[user@dsmlp-login]:~:616$ launch.sh -g 1 -B -- bash -c 'python ./f2.py > output.txt'

Extending pod runtime

To support longer training runs, we permit background execution of student containers, up to 6 hours execution time, via the "-b" command line option.

Please be considerate and terminate idle containers: while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s GPU cards are assigned to users on an exclusive basis. When attached to a container they become unusable by others even if completely idle.

By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs. This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" environment variable prior to launching your job. Please contact datahub@ucsd.edu if you need to run a pod for more than 12 hours.

Pulling the most recent version of your image

If you are making changes to an image and want to pull the most recent version of it (e.g., from dockerhub or ghcr.io) please use the command line option: "-P Always", e.g.:

launch.sh -i myrepo/myimage -P Always

Running on a specific node

Use -n # to run on a specific node. For example, to run on node 10 use -n 10. Refer to the DataHub status page for a list of nodes.

Workspaces

To start the container in a course-specific workspace (a directory where your course-specific files are stored), use the -W flag, e.g. -W DSC10_FA22_A00.

Using Visual Studio Code (VS Code)


It's not permitted to run VS Code directly on dsmlp-login since it consumes significant CPU/RAM resources. However, you can run VS Code inside of a container, and connect to that container from your personal computer.

Generate SSH key pair

To get started, generate generate a SSH key pair on your computer. This can be done with the ssh-keygen command. Make a note of where the keys are stored.

ssh to dsmlp-login and append the public key to ~/.ssh/authorized_keys.

Check that the key pair works. Run "ssh -i /path/to/private.key USERNAME@dsmlp-login.ucsd.edu". You shouldn't be prompted for a login password, however, you may be prompted to enter the password for your private key.

Setup VS Code

Launch VS Code and install the Remote-SSH extension. For more information, refer to the tips and tricks article, but keep in mind the connection procedure is different. The required modifications to your SSH config file are detailed below.

Go to Remote Explorer in VS Code and select SSH targets. Click the gear icon and edit the config file in your home directory.

Create a new Host entry using the sample below. If you don't know the course id for the -W command please run "workspace --list" on dsmlp-login to retrieve it. If you are an independent study user you may omit the -W flag to use your personal home directory. You must create a separate Host entry for each course, otherwise you may be unable to connect due to mismatched host keys.

# /home/USERNAME/.ssh/config
Host MYCOURSE
User USERNAME
ProxyCommand ssh -i /path/to/private.key USERNAME@dsmlp-login.ucsd.edu /opt/launch-sh/bin/launch.sh -W MYCOURSE -H -N vscode-dsmlp

The ProxyCommand is very important. Without this line VS Code will attempt to run directly on dsmlp-login, but this isn't permitted. ProxyCommand instructs SSH to run a script on dsmlp-login to start a container after the SSH connection is made.

You may notice the launch.sh script is the same one used to launch a notebook, however, two options have been added: -H starts the SSH server in the container; -N vscode-dsmlp gives the container a unique name. This prevents multiple VS Code servers from running.

You may also add other command line options, e.g. -c 4 -m 8 to start the container with 4 CPUs, 8 GB of RAM

To use a GPU for a machine learning container, use /opt/launch-sh/bin/launch-scipy-ml.sh (instead of launch.sh) with command line option -g 1. 

If your vscode connection is dropping within a minute for a memory-intensive task, make sure you are including a "-m #GB" argument in the ProxyCommand.

Finally, right click the SSH target and connect to the server. Click "details" in the lower right part of the screen to monitor connection process. Please send these logs, along with your Host entry to datahub@ucsd.edu if you have problems.

Public Key Authentication

If you use SSH frequently, you may want to set up an SSH key between your PC and the dsmlp-login server to login without entering a password each time. This can be accomplished by using the ssh-keygen command on a local machine. This command will prompt the user to create a public/private key pair, along with the name of the key file and a passphrase. Pressing 'Enter' for each option will select the default (in parentheses), or enter a file location/passphrase if desired, and create the private and public key.

To put the key onto the "dsmlp-login", enter the following command: cat ~/.ssh/id_rsa.pub | ssh user@dsmlp-login.ucsd.edu "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys" on the local machine. To verify that the key works, log in to "dsmlp-login.ucsd.edu". If a password is not prompted in the login process, the ssh key is valid. If a passphrase was set when creating the ssh key, it will provide a prompt to enter the passphrase.

Additional Information


Port Forwarding

You can use SSH to forward a local port into a pod. Run this command on your desktop PC.

ssh -N -L LOCALPORT:dsmlp-login.ucsd.edu:REMOTEPORT username@dsmlp-login.ucsd.edu

Example

The launch script provides an URL like this. It has forwarded port 14244 on dsmlp-login to the Jupyter notebook.

http://dsmlp-login.ucsd.edu:14244/user/username/?token=TOKEN

Run:

ssh -N -L 8888:dsmlp-login.ucsd.edu:14244 username@dsmlp-login.ucsd.edu

Now you can access JupyterHub at http://localhost:8888.

TensorBoard

To access the TensorBoard dashboard set the shell variable "IDENTITY_PROXY_PORTS=1" before launching the container. 

Example

[user@dsmlp-login]:~:598$ IDENTITY_PROXY_PORTS=1 launch-scipy-ml.sh -g 1

After the container launches look for the line "Identity port map 1: Container port 12345 mapped to dsmlp-login.ucsd.edu:12345". Make a note of the port number and use it to start TensorBoard.

Go to the Jupyter URL and open a terminal with "New -> Terminal" and run "tensorboard --logdir logs --bind_all --port 12345" to start TensorBoard.

Now you should be able to access the TensorBoard dashboard at http://dsmlp-login.ucsd.edu:12345.

Monitoring Resource Usage within Jupyter/Python Notebooks

Users of the stock containers will find CPU/Memory/GPU utilization noted at the top of the Jupyter notebook screen:

Screenshot: Jupyter Notebook top page

Monitoring Resource Usage within Containers

Users of the bash command line can find the CPU/RAM usage of their pod by using the `htop` command. To see GPU usage, monitor the `nvidia-smi` command

Monitoring Cluster Status

DSMLP cluster status is available at: https://datahub.ucsd.edu/hub/status (requires login). Aggregate cluster status is available from the DSMLP Cluster Status tab inside a running jupyter notebook.

GPU Wait Times

If no GPUs are available, you may have to wait some time (on average 5-10 minutes) for the launch script to resolve and for you to be able to use your container. If there are issues with GPU availability and you need to run multiple commands, please make sure not to run your launch scripts with the "-f" flag (which kills the container after running the given command). Instead, remove the flag so you can run your commands in the container's shell.

Container Termination Messages

Containers may occasionally exit with one of the following error messages:

OOMKilled

Container memory (CPU RAM) limit was reached.

DeadlineExceeded

Container time limit (default 6 hours) exceeded - see above.

Error

Unspecified error. Contact ITS/ETS for assistance.

These errors will show up in 'kubectl get pods' in the status column.

Your instructor or TA will be your best resource for course-specific questions. If you still have questions or need additional assistance, please email datahub@ucsd.edu or visit support.ucsd.edu.