A minimal modern deep learning cloud setup
Published at Mar 20, 2024
Overview
This is a setup guide aimed at people new to GCP and a general cloud workflow with GPUs.
Table of contents
GCP VM setup
login and basics
For this we are going to use GCP compute engine VMs specifically with GPUs, starting off here and activate the compute engine API.
You might have to input your payment information but you will most likely get around 300$ in credits which equates to roughly 300 hours of VM usage, so assuming this is your first time using GCP you shouldn’t be too worried.
Once you set up everything you should be greeted with a screen similar to this.
It’s a bit overwhelming but the two most important things for now are CREATE INSTANCE and the External IP ( which you use to connect to the VM ).
quota check
Before we the instance - since we want to use GPUs - we have to check if our GPU quota (number of GPUs allowed) is >0, since for some reason an issue is that the default is sometimes 0. To check this
- Click on the search bar above CREATE INSTANCE
- Look for
All quotas
- In the field
Enter property name or value
look forName:GPUs (all regions)
- Click the little checkbox next to the name
- Click
EDIT QUOTAS
(top right of the screen) - Type 1 (if you want multiple GPUs then type more)
- Hit
SUBMIT REQUEST
Visually this looks something like this
I’ve seen some varying response times here, they officially say 2 days but for small requests it seems to just be automated and often happens immediately. You will very likely get this approved, it’s only for more expensive GPUs where they decline you and you (I believe) have to appeal and talk to an actual service agent.
instance creation
Then you edit the following settings
settings you should edit | |
---|---|
region | Now GPUs are quite popular especially in US regions so trying to pick a higher tier GPU there is hard, I’ve had luck with certain EU regions but it’s more or less just hoping you pick an okay region. |
gpu | This is going to depend a lot on your workflow, Nvidia L4 should suffice for most moderately intense models, it’s not going to be the fastest but it gets the job done. |
boot disk | Here you want to pick any VM with CUDA, later versions generally have patches that fix certain bugs so >12 is a good thing to go with unless you need older versions. |
boot disk size | This again depends mainly on the model and data you are working with, but anything between 50 - 100GB should be okay. |
Visually the most important parts look like this
At the bottom, you then press CREATE
which spawns your VM instance back on the main screen.
ssh keys
For brevity’s sake I’m going to assume you have ssh
installed, after which the process here becomes mostly trivial.
- open whatever shell you use
- type
ssh-keygen
- click through all the options (remember what your public key is called)
- type
cat ~/.ssh/yourpublickey
- copy the key to your clipboard
Once you have the key in your clipboard its back to clicking through dashboards you want to
- go back to your homepage
- click on the instance name
- click
EDIT
- scroll down to find the SSH key section
- click
Add item
- paste the SSH key
- click
Save
Once again the visual overview of this is as follows
At this stage, your VM should be up and running
coding setup
vscode
For the editor, we are going to use VSCode since it can attach to remote instances via the Remote - SSH
extension. So first things first
- Install
Remote - SSH
- Press
CTRL + P
and type “Connect to Host” - This should bring up a menu, you want to likely configure SSH Hosts, since this makes the entire process alot easier
- Clicking on this should bring up your SSH
config
files, you want to paste the following
Host <external IP>
HostName <external IP>
IdentityFile ~\.ssh\<privkey>
User <ssh-user>
Here
- The external IP is the one on the main dashboard page
- The Identity file is the private ssh key you generated using
ssh-keygen
- The user is the user account associated with that
SSH
key, if you are confused about what this is look at the ending string of the generated key
Once you have this setup VSCode should work its magic and install the vscode server on the remote machine.
working GPU
This can be one of the most annoying steps, first things first make sure to install the Nvidia driver, you should be prompted to do this, if not then run the following command
sudo /opt/deeplearning/install-driver.sh
If you’re GPU is suddenly not being detected after you stop and start the VM, then re-running this command can fix that problem potentially
poetry
For dependency management, we are going to use poetry.
It doesn’t matter to much what you use here, the rough idea is usually the same. I quite like poetry because I think it’s always pretty simple and clean to use, but you can just as well use something like conda which is a bit more popular for scientific computing.
Run the following commands
curl -sSL https://install.python-poetry.org | python -
export PATH=$PATH:/home/<user>/.local/bin
Where
<user>
is the user you are logged on with
project setup
To create the project simply run
mkdir proj-name
cd proj-name
poetry init
poetry add tensorflow[and-cuda]
final test
To then test to see if TensorFlow can detect the GPU run
poetry shell
so we enter the venvpython
- then paste the following
import os
os.environ["TF_CPP_MAX_VLOG_LEVEL"] = "3"
import tensorflow as tf
print(tf.config.experimental.list_physical_devices('GPU'))
If the last line looks like this
...
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Then everything is ready for you to start your regular ML workflow. If not then you might have to do some debugging.
GitHub actions
If you are working with a group of people a source of annoying merge conflicts can be notebooks that don’t have all the cell outputs cleared, luckily someone already thought of this as a potential issue and created a GitHub actions workflow that fails if notebook outputs aren’t cleared, so a good strategy is.
- Add a branch protection rule to
main
- Run
mkdir -p .github/workflows
- Create the following workflow in the folder
.github/workflows
name: Lint Jupyter Notebooks
on:
push:
paths:
- '**.ipynb'
pull_request:
paths:
- '**.ipynb'
jobs:
lint-notebooks:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v2
- name: Ensure Clean Jupyter Notebooks
uses: ResearchSoftwareActions/EnsureCleanNotebooksAction@1.1
- Enforce this as a status check for PRs
This forces any contributor to have clear their notebook cells and avoids annoying merge conflicts.
Final Notes
- Remember to stop your VM if you aren’t using it, this essentially just shuts things down, it does not affect your drive in any way
- An acknowledgment to a friend has to be made for helping me debug a bunch of issues
- If you want to see an example of this type of setup check out this project
Debugging
Things can sometimes go wrong, here are some general debugging tips
- remember
sudo /opt/deeplearning/install-driver.sh
in case the GPU is suddenly not being detected os.environ["TF_CPP_MAX_VLOG_LEVEL"] = "3"
dumps a bunch of debug information about what tensorflow does when it’s being imported, if at any point something fails it should show up here, the most important thing happening here is that TensorFlow is loading all the cuda related stuff, for this it uses$LD_LIBRARY_PATH
- check
$LD_LIBRARY_PATH
,echo $LD_LIBRARY_PATH
should show something like this - the command
nvidia-smi
is useful for seeing if the GPU is being detected by the system in general, if you want to monitor your GPU during training you can dowatch -n0.1 nvidia-smi
which is nice to have open in a second terminal
/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64