How to: File/Data Transfer - Data Science/Machine Learning Platform (DSMLP/DataHub)


Overview


It's always a good idea to backup your files. This guide will help you transfer files from the DSMLP cluster to your desktop or git.

Critical Concepts


Steps to Take: Archival


How to download files from the DSMLP cluster to your desktop.

git

Create a repository on the git host of host of your choice, e.g. GitHub. Then commit and push your files.

Visual Studio Code

Launch a Visual Studio Code pod and copy/paste the files from VS Code to your desktop.

SCP/SFTP

Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility.  We recommend this option for most users.

Example using the Mac/Linux 'sftp' command line program:

slithy:Downloads agt$ sftp <username>@dsmlp-login.ucsd.edu
pod agt-4049 up and running; starting sftp
Connected to ieng6.ucsd.edu
sftp> put 2017-11-29-raspbian-stretch-lite.img
Uploading 2017-11-29-raspbian-stretch-lite.img to /datasets/home/08/108/agt/2017-11-29-raspbian-stretch-lite.img
2017-11-29-raspbian-stretch-lite.img             100% 1772MB  76.6MB/s   00:23
sftp> quit
sftp complete; deleting pod agt-4049
slithy:Downloads agt$

On Windows, we recommend the WinSCP utility.

rsync

'rsync' also may be used from a Mac or Linux terminal window to synchronize data sets:

slithy:ME198 agt$ rsync -avr tub_1_17-11-18 <username>@dsmlp-login.ucsd.edu
pod agt-9924 up and running; starting rsync
building file list ... done
rsync complete; deleting pod agt-9924
sent 557671 bytes  received 20 bytes  53113.43 bytes/sec
size is 41144035  speedup is 73.78
slithy:ME198 agt$

Workspaces

As of summer 2022, all courses now have workspaces, which provide two options for file transfer. 

The first option is to download the files from dsmlp-login as described above. However, the location of the files will be under /dsmlp/workspaces-fs0*/COURSE/home/USERNAME, where workspaces-fs0* is a string such as workspaces-fs01, workspaces-fs02, or workspaces-fs03.

You can find a list of all workspaces with the command workspaces -l:

-bash-4.2$ workspace -l                                                                                                                                                  
2023-06-29 22:31:23,653 - workspace - INFO - Retrieving course info...
2023-06-29 22:31:23,742 - workspace - INFO - 1/3: Retrieving course info for <COURSE_0>...
2023-06-29 22:31:23,742 - workspace - INFO - 2/3: Retrieving course info for <COURSE_1>...
2023-06-29 22:31:23,742 - workspace - INFO - 3/3: Retrieving course info for <COURSE_2>...


Course_ID, Path to Course Workspace Home Directory
--------------------------------------------------
<COURSE_0> /dsmlp/workspaces-fs04/<COURSE_0>/home/<username>
<COURSE_1> /dsmlp/workspaces-fs03/<COURSE_1>/home/<username>
total = 2
Note: above list may include courses that have ended

Then, to access a given workspace directory:

cd /dsmlp/workspaces-fs04/<COURSE_0>/home/<username>

The second option is to use Visual Studio Code. The ProxyCommand line in ssh_config should include the -W (workspace) option with the ID of the workspace, i.e. the course id.

ProxyCommand ssh -i path/to/privatekey username@dsmlp-login.ucsd.edu /opt/launch-sh/bin/launch-datascience.sh -p normal -W COURSEID -H -N vscode-dsmlp

The files can be copy/pasted from VS Code to your desktop.

Third-Party Datasets


Sometimes you may need to download a dataset onto the cluster. First, consider its size and the number of users. If it is large and used by multiple people please send a request to datahub@ucsd.edu and we can put it into the /datasets folder so it can be shared.

If it's small you can use wget or curl to download it. First SSH to dsmlp-login and then invoke wget or curl.

Find the original version of this guide.

For more information, check the FAQ.

If you still have questions or need additional assistance, please email datahub@ucsd.edu or visit support.ucsd.edu.