I spent the last few months reading ten systems papers that, taken together, map the entire design space for serverless sandboxing. This post is not a paper-by-paper summary. It is an attempt to identify the fault lines -- the tensions, convergence points, and trade-offs that define where the field is heading. I am writing this as someone actively building a serverless sandbox (Shimmy), so the lens is practical: what would you actually build today?
Reviewing your own code a few weeks after writing it is a particular kind of experience. The decisions that felt obvious at the time now look questionable. The "temporary" shortcuts are still there. And some things you were sure were correct turn out to have bugs that you can trace directly to an assumption you made at 2am.
The Leverage OJ rewrite ended with a working platform: backend, frontend, judge engine, ELO system, real-time human-vs-bot matches. The natural next question was whether an AI agent could use it autonomously — not just run code against an API, but design an entire game from scratch.
The answer turned out to be yes, with one key ingredient: a machine-readable protocol document and an MCP server.
I've been documenting the evolution of sandbox_exec into something more general. This post covers Sandlock v1.4.0 — the point where it became a proper multi-layer security system rather than a clever wrapper.
Update 2026-03-09: sandbox_exec has since evolved into Sandlock — a modular, full-stack sandbox with strict mode, language-level sandboxes (Python/JS), a source scanner, and LD_PRELOAD hooks. See Sandlock v1.4: From Single File to Full-Stack Sandbox and the GitHub repo.
The previous two posts covered the threat model and the seccomp sandbox. This one is about going further: a WebAssembly execution environment where the security properties come from the compilation target, not from OS-level filters.
Last week we shipped sandbox_exec — a 224-line C program using seccomp-bpf to isolate student code in AWS Lambda. The honest answer at the time was: "WASM would be cleaner, but the Python ecosystem isn't there yet."
This week we measured exactly what "the Python ecosystem isn't there yet" costs in milliseconds. The answer is more nuanced than expected.
A few months ago I started a serious code review of Leverage, a NestJS Online Judge platform that had been running in production for years. No tests. No linter enforcement. No formal review process. Just code that had grown organically, feature by feature, under deadline pressure.
I came out of it with 29 documented issues. Some were minor style things. Six of them were the kind of bugs that make you stare at the screen for a moment and think "how has this been running?"
Authentication is one of those things that feels solved — until you inherit a codebase where it isn't. When I started the Leverage OJ rewrite, the auth system was three separate problems wearing a trench coat: a session setup that broke under PM2, a ContestUser concept that had diverged into its own parallel auth universe, and a password hashing scheme that was one config leak away from a full credential dump.
The submission pipeline is the critical path of an Online Judge. A student submits code, it goes into a queue, a worker picks it up, sends it to the judge, waits for results, writes them back. Simple in theory. The original Leverage implementation was a custom queue built on Redis Lists — and it had problems that only showed up when things went sideways.
