How to: Launching Containers From the Command Line - Data Science/Machine Learning Platform (DSMLP)


Overview


DSMLP jobs are executed in the form of Docker containers - these are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Kubernetes container management/orchestration system routes users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate. These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.


Access to the front-end/submission node


To start a container, ssh to dsmlp-login.ucsd.edu with your UC San Diego username (e.g., ssh username@dsmlp-login.ucsd.edu)

Note: if you are using Visual Studio Code or launching a container with a jupyter notebook after you log into dsmlp-login, you will need to be on the UC San Diego campus network for this to work.  If you are off-campus, please connect to the VPN before you ssh to dsmlp-login. 

Students should login to the front-end nodes using either their UCSD email username (e.g. ‘jsmith’), or in some cases, a course specific account, e.g. “cs253wXX" for CSE253, Winter 2018. Consult the ITS/ETS Account Lookup Tool for instructions on activating course specific accounts.  UCSD Extension/Concurrent Enrollment students: see Extension for a course account token, then complete the ITS/ETS Concurrent Enrollment Computer Account form.

You may also ssh to ieng6.ucsd.edu if you have been given an account there.  Students logging in to 'ieng6' with their UCSD username (e.g. 'jsmith') must use the 'prep' command to activate their course environment and gain access to the GPU tools. Select the relevant option from the menu (e.g. cs253w, cs291w). ('prep' is implicit on 'dsmlp-login', or when using a course-specific account on ieng6.)

SSH Key

If a student finds themselves using SSH frequently, they may want to set up an ssh key between their local machine and the "dsmlp-login".ucsd.edu Linux server to login without the need to enter a password each time. This can be accomplished by using the ssh-keygen command on a local machine. This command will prompt the user to create a public/private key pair, along with the name of the key file and a passphrase. Pressing 'Enter' for each option will select the default (in parentheses), or enter a file location/passphrase if desired, and create the private and public key.

To put the key onto the "dsmlp-login", enter the following command: cat ~/.ssh/id_rsa.pub | ssh user@dsmlp-login.ucsd.edu "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys" on the local machine. To verify that thee key works, log in to "dsmlp-login.ucsd.edu". If a password is not prompted in the login process, the ssh key is valid. If a passphrase was set when creating the ssh key, it will provide a prompt to enter the passphrase.


Launching a Data Science or Machine Learning Container


After signing-on to the front-end node,  you may start a Pod/container using either of the following commands:

Launch Script

Description

#GPU

#CPU

RAM (GB)

launch-scipy-ml.sh

See: Standard Datahub/DSMLP Containers

0

2

8

launch-scipy-ml-gpu.sh

1

4

16

Docker container image and CPU/GPU/RAM settings are all configurable. See How to Select and Configure Your Container for more information.

We encourage you to use non-GPU (CPU-only) containers until your code is fully tested and a simple training run is successful.  (PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU)

Once started, containers can provide Bash (shell/command-line), as well as Jupyter/Python Notebook environments.


Bash Shell / Command Line

The predefined launch scripts above initiate an interactive Bash shell similar to ssh; containers terminate when this interactive shell exits. Our image includes the GNU Screen utility, which may be used to manage multiple terminal sessions in a window-like manner.


Jupyter/Python Notebooks

The default container configuration creates an interactive web-based Jupyter/Python Notebook which may be accessed via a TCP proxy URL output by the launch script. Note that access to the TCP proxy URL requires a UCSD IP address: either on-campus wired/wireless, VPN, or port mapping. See http://blink.ucsd.edu/go/vpn for instructions on the campus VPN. Port-mapping may be used to open the notebook in a local browser. After getting the URL output by the launch script, use the port in the URL to map to the local machine.

Example: ssh -N -L localhost:13151:127.0.0.1:13151 <user>dsmlp-login.ucsd.edu


Postgres Database

For projects that require the use of a postgres database, you can create one using the launch-postgres.sh launch script.


Visual Studio (VS) Code

Students may use the 'Remote-SSH' extension in VS Code to access dsmlp-login.ucsd.edu. Follow this guide to setup this extension.

Beginning in Winter 2022, we will require that VS Code is launched via the following method found on Microsoft's VSCode website here. If you're familiar with SSH practices/VSCode's Remote-SSH, here is a more condensed guide:


Monitoring Resource Usage within Jupyter/Python Notebooks

Users of the stock containers will find CPU/Memory/GPU utilization noted at the top of the Jupyter notebook screen:

Screenshot: Jupyter Notebook top page

Monitoring Resource Usage within Containers

Users of the bash command line can find the CPU/RAM usage of their pod by using the `htop` command. To see GPU usage, monitor the `nvidia-smi` command


Monitoring Cluster Status

DSMLP cluster status is available at: https://datahub.ucsd.edu/hub/status (requires login). Aggregate cluster status is available from the DSMLP Cluster Status tab inside a running jupyter notebook.


GPU Wait Times

If no GPUs are available, you may have to wait some time (on average 5-10 minutes) for the launch script to resolve and for you to be able to use your container. If there are issues with GPU availability and you need to run multiple commands, please make sure not to run your launch scripts with the "-f" flag (which kills the container after running the given command). Instead, remove the flag so you can run your commands in the container's shell.


Background Execution / Long-Running Jobs

To support longer training runs, we permit background execution of student containers, up to 12 hours execution time, via the "-b" command line option.  

Use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container, and ‘kubectl delete pod <pod-name>’ to terminate.

Please be considerate and terminate idle containers:  while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis. When attached to a container they become unusable by others even if completely idle.  


Container Run Time Limits

By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs.  This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" configuration variable. Contact your TA or instructor if you require more than 12 hours.


Container Termination Messages

Containers may occasionally exit with one of the following error messages:

OOMKilled

Container memory (CPU RAM) limit was reached.

DeadlineExceeded

Container time limit (default 6 hours) exceeded - see above.

Error

Unspecified error.  Contact ITS/ETS for assistance.

These errors will show up in 'kubectl get pods' in the status column.

Find the original version of this guide.

For more information, see the FAQ

Your instructor or TA will be your best resource for course-specific questions. If you still have questions or need additional assistance, please email dsmlp@ucsd.edu or visit support.ucsd.edu.

(Instructors: if your course uses dsmlp-login, ITS/ETS will provide you with login information for Instructor, TA, and student-test accounts for the courses.)