EduHam — Coding Learning Platform from Zero to Scale
A coding learning platform built from scratch in twelve months. Custom multi-language sandbox, instructor authoring tools, real-time classrooms, and a closed-loop analytics stack — running 5,000 concurrent learners by handover.
The Challenge
A B2C learn-to-code platform with two product modes — self-paced courses and instructor-led cohorts — and one hard requirement: every lesson ends in code the learner actually runs, not a multiple-choice question. Over twelve months we built the platform from an empty repository: identity, courses, lesson authoring, quiz and assignment builders, a multi-language code-execution sandbox, real-time classrooms, billing, and a closed-loop analytics pipeline. By month twelve the platform served 5,000 concurrent learners across ten UI languages and executed 50,000+ code submissions a day.
Four Builder Surfaces, One Platform
Every course is assembled from four kinds of unit. Treating them as separate apps would have given instructors four logins, four content models, and four analytics surfaces. Treating them as one platform required a shared content model and a single submission pipeline — the choice paid back the first time an instructor dropped a code assignment into the middle of a live classroom in three clicks, not a release.
| Dimension | Lesson Builder | Quiz Builder | Assignment Builder | Live Classroom |
|---|---|---|---|---|
| Primary author | Curriculum lead | Instructor | Instructor | Instructor |
| Learner surface | Self-paced reader | Self-paced or cohort | Self-paced or cohort | Cohort only |
| Sandbox required | Optional (snippets) | Optional (code questions) | Always | Always |
| Completion measure | Time-on-page + scroll depth | Score + attempts | Tests passed | Attendance + submissions |
When the curriculum team launched an early language track in month seven, they assembled it from existing unit pools in eleven business days — no new components shipped.
Solution Architecture
Microservice decomposition
A handful of Laravel services with hard domain boundaries. Most own their own Postgres database; tightly-coupled domains share a connection rather than forcing a boundary that was not real. Each service exposes a versioned contract — HTTP for synchronous calls, a message queue for async paths.
One frontend, three surfaces
A single Vue (TypeScript) codebase split into three surfaces — learner app, instructor authoring, operator dashboard — sharing the same component library and i18n bundle. Pusher carries real-time presence, live submissions, and instructor notifications.
Isolated code sandbox
The sandbox is a separate domain on its own Kubernetes node pool. A Go orchestrator schedules submissions onto pre-warmed language containers — seccomp-locked, egress-denied, ephemeral rootfs. Ten programming languages by month twelve, p99 cold start under 500 ms.
Closed-loop analytics
Managed change-data-capture streams operational databases into a columnar warehouse. The data team and instructors query learner behaviour, completion curves, and content health on near-real-time dashboards. Reporting is live views with notification triggers — not a morning email.
Domain microservices, one Vue frontend, an isolated sandbox node pool, and a CDC→warehouse analytics spine.
Code Sandbox: Request to Result in Under 500 ms p99
The sandbox is the load-bearing piece of the product. A learner who waits four seconds for their first code run drops; one who waits 200 ms keeps going. Four stages run in sequence on the orchestrator's hot path, each with its own latency budget.
Validate
Auth, rate limits, payload size, language whitelist, lightweight syntax pre-check. Hard cut on obvious abuse.
Schedule
Pick a pre-warmed container from the language-specific pool, attach the submission rootfs, apply the security profile. No image pull on the hot path.
Execute
Run the learner's code with a wall-clock timeout, memory cap, CPU quota, and egress-denied network namespace. The hot pool keeps the cold-start tail off the p99.
Capture and score
Stream stdout/stderr, diff against expected output for assignments, persist artifacts, emit a submission event to the analytics bus.
The cheapest win was deleting the image-pull step from the hot path: a 3.2-second median cold start in week six became a 90 ms hot start by month nine.
Learner Funnel — Month 3 MVP vs Month 12 at Scale
The MVP shipped at month six with a single language, no real-time, and a single UI language. Month twelve served ten programming languages, real-time classrooms, and instructor-built assignments. The funnel compares cohorts of 1,000 newly signed-up learners on the month-3 MVP against the month-12 production platform — same channel, same onboarding email.
The largest single lift came from cutting the first-code-run delay. When the median time from signup to first successful code execution dropped from 4 minutes 12 seconds to 38 seconds, the day-7 retention curve moved with it.
Technical Stack
Customer + operator UI
One frontend codebase for learner, instructor, and operator surfaces
Domain services
Domain-rich CRUD that ships weekly across identity, courses, billing, notifications
Sandbox orchestrator
Submission scheduler in Go for predictable latency; seccomp-locked, egress-denied containers per submission
Real-time
Managed because the team was small and the SLA was tight
Analytics
Bought instead of built — connector quality and schema-drift handling outpaced what an eight-person team could maintain
Infrastructure
Burst capacity for cohort launches; the sandbox node pool runs under tighter isolation than general workloads
Key Technical Decisions
Custom sandbox over a hosted runner
Tradeoff: Six engineering weeks of upfront work and a node pool to operate from day one
Why: Hosted runners either capped at three languages we cared about, charged per-execution at a multiple of our own pool cost at scale, or sat behind a public egress we could not lock down. Owning the sandbox kept p99 cold start under 500 ms, opened the door to ten languages within twelve months, and gave us a single security profile to reason about.
Pusher Channels over a self-hosted WebSocket fleet
Tradeoff: Per-message vendor cost; channel sharding strategy had to be designed early
Why: Real-time would have eaten two engineers for three months. Pusher gave us sub-100 ms fan-out on day one. Channel namespaces by tenant kept the vendor cost linear with paying cohorts, not with total signups.
Managed CDC + warehouse over building our own ETL
Tradeoff: Vendor cost on the connector; less control over edge-case schema changes
Why: An in-house ETL would have absorbed a data engineer for six months and produced a worse result. The data team shipped instructor dashboards in month nine instead of month fifteen.
Implementation Timeline
Discovery, architecture, core hires
Domain map, service boundaries, hiring loop designed. First three engineers on board by end of month two.
Identity, courses, authoring shell
Laravel skeleton for identity and courses, Vue shell for instructor authoring, first lesson rendered end-to-end.
MVP — sandbox v1, quiz, assignment, billing
Code sandbox v1 in production with two languages, quiz and assignment builders shipped, Stripe live. Private beta opens at end of month six.
Real-time classrooms, i18n to 10 UI languages
Pusher integration for classrooms, first cohort runs, i18n bundle to 10 UI languages, more programming languages added.
Analytics warehouse, instructor dashboards
Managed CDC into the warehouse, learner-event schema modelled, near-real-time instructor and CRO dashboards.
Scale hardening and handover
Channel sharding by tenant, sandbox pool sized for peak, observability hardened, on-call handed to the in-house team. Platform exits at 5,000 concurrent and steady.
Challenges and How We Solved Them
The Problem
First sustained-load stress at ~3,000 concurrent. A paid-acquisition push in month eight ran longer than the sandbox pool had been sized for, and the real-time fan-out plus Postgres read pressure on the courses service hit at the same fifteen-minute window.
Approach
Sharded real-time channels by tenant so one noisy classroom could not starve the fleet, added read replicas behind the courses service, moved hot reads for course catalogues into Redis, and resized the sandbox pre-warm pool to track the previous hour's submission rate.
Outcome
Latency stayed inside SLA for the remaining 90 minutes of the campaign. The same architecture absorbed 5,000 concurrent two months later without a second incident. The catalogue cache cut p95 lesson-load from 1.4 s to 320 ms.
The Problem
Sandbox abuse — shortly after opening a lower-level language runner, two learners tested the boundaries. One ran an outbound port scanner from inside the container; another launched a long-lived background process that consumed compute for forty minutes before the wall-clock budget killed it.
Approach
Locked the egress namespace to DNS plus the internal package mirror, tightened the security profile to deny raw sockets, and added a per-container CPU monitor that auto-kills jobs whose wall-clock-to-CPU ratio matches a sustained-abuse profile.
Outcome
Zero confirmed escape attempts since. The abuse-detection rule has fired three times in nine months — all genuine. The egress-denied default became a documented platform invariant.
Numbers That Moved
5,000+
Concurrent learners (peak)
was 0 ↑
50,000+ / day
Code submissions executed
was 900 / day ↑
480 ms
Sandbox cold start, p99
was 3.2 s ↓
10
Programming languages supported
was 1 ↑
10
UI languages shipped
was 1 ↑
41%
Day-7 learner retention
was 28% ↑
22 min
MTTR on production incidents
was 3h 40min ↓
8
Engineers on the team
was 0 ↑
Engagement Team
Lessons Learned
Designing for scalability on day one is cheaper than retrofitting it in month nine. The architectural decisions made before the second engineer joined paid back every time the platform absorbed a step-change in concurrent learners.
Observability is non-negotiable in a code-execution platform. Every container had structured telemetry from the first commit; sandbox abuse incidents were caught and closed in days, not weeks.
Analytics velocity is bought, not built, at this stage. Managed CDC plus a managed warehouse replaced six months of platform engineering — the data engineer instead shipped dashboards the curriculum team used every day.
More cases
View all cases
FinCue
FinCue — Consumer Lending Platform across Two Markets
ObjectFirst