How to: Launching Containers From the Command Line - Data Science/Machine Learning Platform (DSMLP)


Overview


DSMLP jobs are executed in the form of Docker containers - these are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Kubernetes container management/orchestration system routes users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate. These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.

 

Access to the front-end/submission node


To start a container, SSH to dsmlp-login.ucsd.edu with your UC San Diego username, e.g., ssh username@dsmlp-login.ucsd.edu. 

NOTE: If you are using Visual Studio Code or launching a container with a Jupyter notebook after you log into dsmlp-login, you will need to be on the UC San Diego campus network for this to work.  If you are off-campus, please connect to the VPN before you ssh to dsmlp-login. 

Students should login to the front-end nodes using either their UCSD email username (e.g. ‘jsmith’), or in some cases, a course specific account, e.g. “cs253wXX" for CSE253, Winter 2018. Consult the ITS/ETS Account Lookup Tool for instructions on activating course specific accounts.  UCSD Extension/Concurrent Enrollment students: see Extension for a course account token, then complete the ITS/ETS Concurrent Enrollment Computer Account form.

Public Key Authentication

If you use SSH frequently, you may want to set up an SSH key between your PC and the dsmlp-login server to login without entering a password each time. This can be accomplished by using the ssh-keygen command on a local machine. This command will prompt the user to create a public/private key pair, along with the name of the key file and a passphrase. Pressing 'Enter' for each option will select the default (in parentheses), or enter a file location/passphrase if desired, and create the private and public key.

To put the key onto the "dsmlp-login", enter the following command: cat ~/.ssh/id_rsa.pub | ssh user@dsmlp-login.ucsd.edu "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys" on the local machine. To verify that thee key works, log in to "dsmlp-login.ucsd.edu". If a password is not prompted in the login process, the ssh key is valid. If a passphrase was set when creating the ssh key, it will provide a prompt to enter the passphrase.

 

Launching a Data Science or Machine Learning Container


Starting a container

SSH to dsmlp-login.ucsd.edu and run either of the following commands:

Launch Script

Description

#GPU

#CPU

RAM (GB)

launch-scipy-ml.sh

See: Standard Datahub/DSMLP Containers

0

2

8

launch-scipy-ml-gpu.sh

1

4

16

We encourage you to not use a GPU until your code is fully tested and a simple training run is successful. PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU.

Example:

[dta001@dsmlp-login ~]$ launch-datascience.sh
Mon Jun 27 15:49:13 PDT 2022 Submitting job dta001-16599
pod/dta001-16599 created
Mon Jun 27 15:49:13 PDT 2022 INFO starting up - pod status: Pending ; Successfully assigned dta001/dta001-16599 to its-dsmlpdev-n01.ucsd.edu
Mon Jun 27 15:49:16 PDT 2022 INFO starting up - pod status: Pending ; Started container init-support
Mon Jun 27 15:49:18 PDT 2022 INFO pod assigned to node: its-dsmlpdev-n01.ucsd.edu
Mon Jun 27 15:49:18 PDT 2022 INFO ucsdets/datascience-notebook:2020.2-stable is now active.
You may access your Jupyter notebook at:  http://dsmlp-login.ucsd.edu:14244/user/dta001/?token=TOKEN

Now paste http://dsmlp-login.ucsd.edu:14244/user/dta001/?token=TOKEN into your web browser. You should see the JupyterHub web app.

NOTE: You must be connected to the UCSD VPN to access to the Jupyter notebook.

Launch script command line flag options

You can provide various command line options to the launch scripts to customize behavior.  See "Launch Script Command Line Options" in How to Select and Configure Your Container on the Data Science/Machine Learning Platform (DSMLP).

Requesting resources for the container

You may request a specific amount of CPUs, GPUs, and RAM for the container using the -c #, -g #, and -m # flags. For example, to use 8 CPUs, 32GB RAM, 1 GPU run launch-scipy-ml.sh -c 8 -m 32 -g 1.

To request a specific type of GPU use the -v flag, e.g. -v 1080ti. Refer to the DataHub status page for a list of GPU types.

Running in the background

Pods run in the foreground by default. If you disconnect from dsmlp-login the pod is terminated. Pods may be launched in the background to prevent this. Use the -b flag to launch a pod in the background. Background pods run for 6 hours by default.

To connect to the shell of the notebook use kubesh <pod-name>, e.g. kubesh dta001-16599 in the example above. If you forget the pod's name use kubectl get pod to list the names of all of your pods.

When you're finished with the notebook, use kubectl get pod to get the pod's name. Then use kubectl delete pod <pod-name> to terminate the pod.

Extending pod runtime

To support longer training runs, we permit background execution of student containers, up to 6 hours execution time, via the "-b" command line option.  

Please be considerate and terminate idle containers:  while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis. When attached to a container they become unusable by others even if completely idle.

By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs.  This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" configuration variable.  Please contact datahub@ucsd.edu if you need to run a pod for more than 12 hours.

 

Custom images

Use the -i flag to run a custom image, e.g. -i myrepo/myimage

If the image isn't based on JupyterHub use -e CMD to use an alternate entrypoint, e.g. -i myrepo/myimage -e /entrypoint.sh. This runs /entrypoint.sh instead of jupyterhub.

Running on a specific node

Use -n # to run on a specific node. For example, to run on node 10 use -n 10. Refer to the DataHub status page for a list of nodes.

Workspaces

To start the container in a course-specific workspace (a directory where your course-specific files are stored), use the -W flag, e.g. -W DSC10_FA22_A00.

Visual Studio (VS) Code

Students may use the 'Remote-SSH' extension in VS Code to access dsmlp-login.ucsd.edu. Follow this guide to setup this extension.

Beginning in Winter 2022, we require that VS Code is launched via the following method found on Microsoft's VSCode website here. If you're familiar with SSH practices/VSCode's Remote-SSH, here is a more condensed guide:

PostgreSQL Container

For projects that require the use of a postgres database, you can create one using the launch-postgres.sh launch script.

Additional Information

Monitoring Resource Usage within Jupyter/Python Notebooks

Users of the stock containers will find CPU/Memory/GPU utilization noted at the top of the Jupyter notebook screen:

Screenshot: Jupyter Notebook top page

Monitoring Resource Usage within Containers

Users of the bash command line can find the CPU/RAM usage of their pod by using the `htop` command. To see GPU usage, monitor the `nvidia-smi` command

 

Monitoring Cluster Status

DSMLP cluster status is available at: https://datahub.ucsd.edu/hub/status (requires login). Aggregate cluster status is available from the DSMLP Cluster Status tab inside a running jupyter notebook.

 

GPU Wait Times

If no GPUs are available, you may have to wait some time (on average 5-10 minutes) for the launch script to resolve and for you to be able to use your container. If there are issues with GPU availability and you need to run multiple commands, please make sure not to run your launch scripts with the "-f" flag (which kills the container after running the given command). Instead, remove the flag so you can run your commands in the container's shell.

 

Container Termination Messages

Containers may occasionally exit with one of the following error messages:

OOMKilled

Container memory (CPU RAM) limit was reached.

DeadlineExceeded

Container time limit (default 6 hours) exceeded - see above.

Error

Unspecified error.  Contact ITS/ETS for assistance.

These errors will show up in 'kubectl get pods' in the status column.

Notes

Your instructor or TA will be your best resource for course-specific questions. If you still have questions or need additional assistance, please email dsmlp@ucsd.edu or visit support.ucsd.edu.

(Instructors: if your course uses dsmlp-login, ITS/ETS will provide you with login information for Instructor, TA, and student-test accounts for the courses.)