Remote Development on Kubernetes

Being able to leverage remote machines with more compute power or different accelerators than your local machine is a very nice capability to have - especially for the large C++ codebases I’m used to in the self-driving car world.

Background

There’re many different ways to try to lean on remote machines in your workflows, a non-exhaustive list of examples is:

Remote Execution - the local build system uses remote machines to execute jobs
- Overal I’ve been very happy with using Bazel’s remote execution and caching, but for large repos Bazel can still eat massive amounts of CPU and memory
- Running language servers like clangd on large codebases can require a large amount of local compute so you might need to develop a nice remote indexing solution
Remote Desktop - RDP-like/VNC-like connections to a remote machine
- Depending on your remote desktop solution, network topology, and hardware encoders/decoders the remote desktop may feel laggy
- Application forwarding is a nice alternative so the remote system is only used for the applications it’s required for and the rest of the user’s interactions are with local applications
Remote Editing - using a local application (e.g. VS Code, a Jetbrains IDE, or your web browser) to interact with the remote system

You don’t need to use these methods individually, for example you may get the best results from combining all three:

Remote editing for most of your development workflow
Remote execution for rebuilding large swaths of code and running tests
Remote desktop (or application forwarding) to run other GUI tools

My Specific Problem

Back in the day I used to do almost all my development in neovim over SSH, but over the past 4 years or so I’ve been seduced by VS Code. Usually I’ll alternate at random between sitting on my couch using VS Code Remote from a laptop and sitting at my desk and developing on my desktop machine locally. Recently the problem I’ve wanted to solve is being able to waste my life playing video games without having to kill my ML model training runs.

I spend most of my time in Ubuntu and flip over into Windows when I feel the itch to play a video game that involves crabs constantly leaping through the air spraying bullets. The obvious problem here (aside from my addiction to pretending to being a crab) being that the need to reboot over to Windows means that even if my RTX 3090 has spare capacity to render crab graphics I still have to stop my training jobs regardless.

Requirements

Based on the problem and my preferences I’ve drawn up some loose requirements for a remote editing solution.

Must be able to spin up environments and leave them up for long periods of time
Must be able to provision access to GPUs
Must be able to access services hosted in the remote environment, like Tensorboard, from my local machine
Needs some way to get code to the remote environment

Potential Solutions

The Dell R730 that runs a lot of my Kubernetes workload, including this blog, has reasonable compute capacity: 128GB DDR4 RAM, 72 logical CPU cores, a large amount of RAID’d storage, and a GTX 980 Ti. The GPU is nothing to write home about but I plan to replace it at some point in the near future with something more powerful, likely the 3090 out of my desktop when I upgrade to a 4090. Given that I’ve got a cluster with the hardware I need, the question is just what exactly do I run on it.

A Quick Market Survey

A quick glance shows that there are some good looking open source options floating around. The ones that caught my attention are:

GitPod looks nice, but they’ve recently stopped supporting the self-hosted version of the product which doesn’t have great implications for self-managed deployments.
Eclipse Che is interesting, but very web-focused. Additionally having to use a special binary like chectl to deploy the services is just plain annoying.

If I were trying to support a large organization of developers I’d probably evaluate Eclipse Che, but since it’s just me and VS Code supports attaching to Kubernetes pods all I really need is a quick way to deploy, track, and connect to pods.

The Chosen Solution

I’ve taken a long road to an unimpressive solution for my use case: just use kubectl and label things well.

Since VS Code can handle attaching to pods, forwarding ports, and forwarding SSH credentials the only thing I need a solution to is provisioning. Unsurprisingly it turns out kubectl does that quite well out of the box.

commands

# Create a development environment
kubectl create -f example.yaml
kubectl label -f example.yaml domain.example/dev-environment=ml-1

# Find all dev environments
kubectl get all --selector "domain.example/dev-environment"

# Find resources in particular dev environments
kubectl get all --selector "domain.example/dev-environment=ml-1"
# Delete a particular dev environment
kubectl delete all --selector "domain.example/dev-environment=ml-1"

example.yaml

apiVersion: v1
kind: Pod
metadata:
  name: dev-pod
spec:
  containers:
    - name: primary
      image: ubuntu:lunar
      command: [sleep, infinity]
      resources:
        # If your development is really bursty you may want to set lower requests
        limits:
          cpu: "32000m"
          memory: "64Gi"
          nvidia.com/gpu.shared: "1"

Improving on `kubectl`

Using kubectl directly works fine for my small use case, but to support multiple users this approach needs work:

Labeling needs to be consistent and automatic
The user should have the ability to customize the manifest

To solve automatic label management you could either make some shell functions and aliases or create a small Python (or your language of choice) program. I strongly favor an anything-but-shell approach if only because unit testing becomes a lot easier - but even in Python I’d probably still subprocess kubectl for convenience rather than use the Kubernetes API.

Providing the ability to customize the development environment really depends on what your existing deployment system looks like. For my purposes I use Jsonnet for most things, but you could also use Kustomize, Jinja templates, or generate your manifest from Python.

Stephan Wolski is a robot engineer, founder, angel investor, penguin enthusiast, and all around cliche.