DSMLP jobs are executed in the form of Docker containers - these are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Kubernetes container management/orchestration system routes users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate. These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.
To start a container, SSH to dsmlp-login.ucsd.edu with your UC San Diego username, e.g., ssh email@example.com.
dsmlp-login is primarily designed to facilitate the execution of docker images using the
launch.sh command. By running
launch.sh, dedicated nodes are provisioned, allowing the utilization of these images to create customized environments suitable for various development workflows. It is important to note that dsmlp-login itself is prohibited for manual job execution, such as running Python scripts, Java projects, or machine learning tasks. Developing within these dedicated nodes, instead of using dsmlp-login, can help minimize any potential impact on server performance.
SSH to dsmlp-login.ucsd.edu and run either of the following commands:
launch-scipy-ml.sh -g 1
We encourage you to not use a GPU until your code is fully tested and a simple training run is successful. PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU.
[dta001@dsmlp-login ~]$ launch-scipy-ml.sh
Mon Jun 27 15:49:13 PDT 2022 Submitting job dta001-16599
Mon Jun 27 15:49:13 PDT 2022 INFO starting up - pod status: Pending ; Successfully assigned dta001/dta001-16599 to its-dsmlpdev-n01.ucsd.edu
Mon Jun 27 15:49:16 PDT 2022 INFO starting up - pod status: Pending ; Started container init-support
Mon Jun 27 15:49:18 PDT 2022 INFO pod assigned to node: its-dsmlpdev-n01.ucsd.edu
Mon Jun 27 15:49:18 PDT 2022 INFO ucsdets/datascience-notebook:2020.2-stable is now active.
You may access your Jupyter notebook at: http://dsmlp-login.ucsd.edu:14244/user/dta001/?token=TOKEN
http://dsmlp-login.ucsd.edu:14244/user/dta001/?token=TOKEN into your web browser. You should see the JupyterHub web app.
You can provide various command line options to the launch scripts to customize behavior. See "Launch Script Command Line Options" in How to Select and Configure Your Container on the Data Science/Machine Learning Platform (DSMLP). Here are some detailed examples:
You may request a specific amount of CPUs, GPUs, and RAM for the container using the -c #, -g #, and -m # flags. For example, to use 8 CPUs, 32GB RAM, 1 GPU run:
launch-scipy-ml.sh -c 8 -m 32 -g 1
Use the -i flag to run a custom image, e.g. -i myrepo/myimage
If the image isn't based on JupyterHub use -e CMD to use an alternate entrypoint, e.g. -i myrepo/myimage -e /entrypoint.sh. This runs /entrypoint.sh instead of jupyterhub.
Pods run in the foreground by default. If you disconnect from dsmlp-login the pod is terminated. Pods may be launched in the background to prevent this. Use the -b flag to launch a pod in the background. Background pods run for 6 hours by default.
To connect to the shell of the notebook use kubesh <pod-name>, e.g. kubesh dta001-16599 in the example above. If you forget the pod's name use kubectl get pod to list the names of all of your pods.
When you're finished with the notebook, use kubectl get pod to get the pod's name. Then use kubectl delete pod <pod-name> to terminate the pod.
To support longer training runs, we permit background execution of student containers, up to 6 hours execution time, via the "-b" command line option.
Please be considerate and terminate idle containers: while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis. When attached to a container they become unusable by others even if completely idle.
By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs. This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" configuration variable. Please contact firstname.lastname@example.org if you need to run a pod for more than 12 hours.
If you are making changes to an image and want to pull the most recent version of it (e.g., from dockerhub or ghcr.io) please use the command line option: "-P Always", e.g.:
launch.sh -i myrepo/myimage -P Always
Use -n # to run on a specific node. For example, to run on node 10 use -n 10. Refer to the DataHub status page for a list of nodes.
To start the container in a course-specific workspace (a directory where your course-specific files are stored), use the -W flag, e.g. -W DSC10_FA22_A00.
It's not permitted to run VS Code directly on dsmlp-login since it uses a lot of resources. However, you can run VS Code inside of a container.
To get started, generate generate a SSH key pair on your computer. This can be done with the ssh-keygen command. Make a note of where the keys are stored.
SSH to dsmlp-login and append the public key to ~/.ssh/authorized_keys.
Check that the key pair works. Run "ssh -i /path/to/private.key USERNAME@dsmlp-login.ucsd.edu". You shouldn't be prompted for a login password, however, you may be prompted to enter the password for your private key.
Launch VS Code and install the Remote-SSH extension. For more information, refer to the tips and tricks article, but keep in mind the connection procedure is different. The required modifications to your SSH config file are detailed below.
Go to Remote Explorer in VS Code and select SSH targets. Click the gear icon and edit the config file in your home directory.
Create a new Host entry using the sample below. If you don't know the course id for the -W command please run "workspace --list" on dsmlp-login to retrieve it. If you are an independent study user you may omit the -W flag to use your personal home directory. You must create a separate Host entry for each course, otherwise you may be unable to connect due to mismatched host keys.
ProxyCommand ssh -i /path/to/private.key USERNAME@dsmlp-login.ucsd.edu /opt/launch-sh/bin/launch.sh -W MYCOURSE -H -N vscode-dsmlp
The ProxyCommand is very important. Without this line VS Code will attempt to run directly on dsmlp-login, but this isn't permitted. ProxyCommand instructs SSH to run a script on dsmlp-login to start a container after the SSH connection is made.
You may notice the launch.sh script is the same one used to launch a notebook, however, two options have been added: -H starts the SSH server in the container; -N vscode-dsmlp gives the container a unique name. This prevents multiple VS Code servers from running.
You may also add other command line options, e.g. -c 4 -m 8 to start the container with 4 CPUs, 8 GB of RAM
To use a GPU for a machine learning container, use /opt/launch-sh/bin/launch-scipy-ml.sh (instead of launch.sh) with command line option -g 1.
If your vscode connection is dropping within a minute for a memory-intensive task, make sure you are including a "-m #GB" argument in the ProxyCommand.
Finally, right click the SSH target and connect to the server. Click "details" in the lower right part of the screen to monitor connection process. Please send these logs, along with your Host entry to email@example.com if you have problems.
If you use SSH frequently, you may want to set up an SSH key between your PC and the dsmlp-login server to login without entering a password each time. This can be accomplished by using the
ssh-keygen command on a local machine. This command will prompt the user to create a public/private key pair, along with the name of the key file and a passphrase. Pressing 'Enter' for each option will select the default (in parentheses), or enter a file location/passphrase if desired, and create the private and public key.
To put the key onto the "dsmlp-login", enter the following command:
cat ~/.ssh/id_rsa.pub | ssh firstname.lastname@example.org "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys" on the local machine. To verify that the key works, log in to "dsmlp-login.ucsd.edu". If a password is not prompted in the login process, the ssh key is valid. If a passphrase was set when creating the ssh key, it will provide a prompt to enter the passphrase.
You can use SSH to forward a local port into a pod. Run this command on your desktop PC.
ssh -N -L LOCALPORT:dsmlp-login.ucsd.edu:REMOTEPORT email@example.com
The launch script provides an URL like this. It has forwarded port 14244 on dsmlp-login to the Jupyter notebook.
ssh -N -L 8888:dsmlp-login.ucsd.edu:14244 firstname.lastname@example.org
Now you can access JupyterHub at http://localhost:8888.
To access the TensorBoard dashboard set the shell variable "IDENTITY_PROXY_PORTS=1" before launching the container.
[user@dsmlp-login]:~:598$ IDENTITY_PROXY_PORTS=1 launch-scipy-ml.sh -g 1
After the container launches look for the line "Identity port map 1: Container port 12345 mapped to dsmlp-login.ucsd.edu:12345". Make a note of the port number and use it to start TensorBoard.
Go to the Jupyter URL and open a terminal with "New -> Terminal" and run "tensorboard --logdir logs --bind_all --port 12345" to start TensorBoard.
Now you should be able to access the TensorBoard dashboard at http://dsmlp-login.ucsd.edu:12345.
Users of the stock containers will find CPU/Memory/GPU utilization noted at the top of the Jupyter notebook screen:
Users of the bash command line can find the CPU/RAM usage of their pod by using the `htop` command. To see GPU usage, monitor the `nvidia-smi` command
DSMLP cluster status is available at: https://datahub.ucsd.edu/hub/status (requires login). Aggregate cluster status is available from the DSMLP Cluster Status tab inside a running jupyter notebook.
If no GPUs are available, you may have to wait some time (on average 5-10 minutes) for the launch script to resolve and for you to be able to use your container. If there are issues with GPU availability and you need to run multiple commands, please make sure not to run your launch scripts with the "-f" flag (which kills the container after running the given command). Instead, remove the flag so you can run your commands in the container's shell.
Containers may occasionally exit with one of the following error messages:
Container memory (CPU RAM) limit was reached.
Container time limit (default 6 hours) exceeded - see above.
Unspecified error. Contact ITS/ETS for assistance.
These errors will show up in 'kubectl get pods' in the status column.
Your instructor or TA will be your best resource for course-specific questions. If you still have questions or need additional assistance, please email email@example.com or visit support.ucsd.edu.
(Instructors: if your course uses dsmlp-login, ITS/ETS will provide you with login information for Instructor, TA, and student-test accounts for the courses.)