返回 Discover
Field DispatchStack Overflow7 · 2026-05-30

Designing a Scalable Code Execution Service (LeetCode-like) for ~20k Users / 1000 Concurrent Users

Tags
postgresqlazurecodesandboxjudge-api
Score
0
Answers
0
Views
15
Answered
No
痛点分析发布于 2026/05/29

痛点为 AI 基于上游原始证据的初步提炼;未包含额外中国市场检索。

痛点

用户正在构建类似 LeetCode 的在线代码执行平台,当前架构在约 100 用户时工作正常,但扩展到 20000 总用户、1000 并发时出现严重问题:Docker 容器创建后无法可靠清理,导致孤儿容器堆积;并发增长时执行失败率上升;Judge 虚拟机的资源使用变得不可预测。用户的核心任务是确保代码执行服务在规模下稳定、低延迟且成本可控,但现有方案在容器生命周期管理、资源隔离和清理机制上存在明显摩擦,导致运维负担加重、系统可靠性下降,且无法直接复用现有模式,需要重新设计架构。

§ Dossier

Stack Overflow question

I am building an online code execution platform similar to LeetCode. Current Architecture Fastify Backend Receives requests from users. Sends code execution requests to a separate Judge service. Judge Service (Dedicated VM) Receives code, language, and input. Spins up a Docker container for each execution request. Runs the code inside the container. Returns the output. Removes the container after execution. Current Problem This architecture works reasonably well for small-scale usage (~100 users), but under higher load I start seeing issues: Some Docker containers are created but never removed. Container cleanup becomes unreliable. Execution failures increase as concurrency grows. Resource usage on the Judge VM becomes unpredictable. Submission Flow I currently have two execution paths: Run Code API Used when users click "Run". Direct HTTP request to the Judge service. No queue is used because users expect an immediate response. Submit Code API Used when users submit a solution. Request is pushed to a Redis queue. BullMQ workers consume jobs and run test cases asynchronously. Scaling Goal I want to scale this system to approximately: 20,000 total users ~1,000 concurrent users Fast response times for the "Run Code" feature Cost-efficient infrastructure Reliable sandboxing and container cleanup Questions What architecture do large platforms (e.g., LeetCode, HackerRank, Codeforces) typically use for code execution at scale? Is spinning up a Docker container per request still a good approach at this scale? Should the "Run Code" API also use a queue, or is there a better pattern for low-latency execution? Would Kubernetes be the recommended solution, or are there better alternatives? How should sandbox lifecycle management and cleanup be handled to prevent orphaned containers? What would be a cost-optimized architecture capable of handling ~1,000 concurrent executions? Are there any open-source judge systems or execution architectures worth studying? Any architecture diagrams, production experience, or recommendations would be greatly appreciated.

§ Dossier

Question details

View count
15
Answer count
0
Last activity
2026/05/29
源数据· Raw Archive
source
Stack Overflow
upstream_source
stackoverflow
upstream_item_id
79948683
daily_ranking_item_id
94decec2-d041-4b28-90d4-be24b46b6b77
rank_date
2026-05-30
rank
7
name
Designing a Scalable Code Execution Service (LeetCode-like) for ~20k Users / 1000 Concurrent Users
tagline
postgresql, azure, codesandbox, judge-api
description
I am building an online code execution platform similar to LeetCode. Current Architecture Fastify Backend Receives requests from users. Sends code execution requests to a separate Judge service. Judge Service (Dedicated VM) Receives code, language, and input. Spins up a Docker container for each execution request. Runs the code inside the container. Returns the output. Removes the container after execution. Current Problem This architecture works reasonably well for small-scale usage (~100 users), but under higher load I start seeing issues: Some Docker containers are created but never removed. Container cleanup becomes unreliable. Execution failures increase as concurrency grows. Resource usage on the Judge VM becomes unpredictable. Submission Flow I currently have two execution paths: Run Code API Used when users click "Run". Direct HTTP request to the Judge service. No queue is used because users expect an immediate response. Submit Code API Used when users submit a solution. Request is pushed to a Redis queue. BullMQ workers consume jobs and run test cases asynchronously. Scaling Goal I want to scale this system to approximately: 20,000 total users ~1,000 concurrent users Fast response times for the "Run Code" feature Cost-efficient infrastructure Reliable sandboxing and container cleanup Questions What architecture do large platforms (e.g., LeetCode, HackerRank, Codeforces) typically use for code execution at scale? Is spinning up a Docker container per request still a good approach at this scale? Should the "Run Code" API also use a queue, or is there a better pattern for low-latency execution? Would Kubernetes be the recommended solution, or are there better alternatives? How should sandbox lifecycle management and cleanup be handled to prevent orphaned containers? What would be a cost-optimized architecture capable of handling ~1,000 concurrent executions? Are there any open-source judge systems or execution architectures worth studying? Any architecture diagrams, production experience, or recommendations would be greatly appreciated.
votes_count
0
comments_count
0
created_at_on_source
2026-05-29T20:49:30.000Z
topics
postgresqlazurecodesandboxjudge-api
media / source-specific data
{
  "stackoverflow": {
    "score": 0,
    "view_count": 15,
    "is_answered": false,
    "top_answers": [],
    "answer_count": 0,
    "accepted_answer_id": null,
    "last_activity_date": 1780087770
  }
}
raw_payload
{
  "stats": {
    "score": 0,
    "view_count": 15,
    "is_answered": false,
    "answer_count": 0,
    "creation_date": 1780087770,
    "last_edit_date": null,
    "accepted_answer_id": null,
    "last_activity_date": 1780087770
  },
  "api_wrapper": {
    "backoff": null,
    "has_more": true,
    "page_size": 8,
    "quota_max": 300,
    "quota_remaining": 299
  },
  "question_id": 79948683,
  "answer_fetch": {
    "has_more": false,
    "answers_fetched": 0,
    "answer_page_size": 3
  },
  "snapshot_version": "stackoverflow_question_v1"
}
source_raw_snapshot
{
  "id": "fbedd208-158b-4851-8a45-985e88087220",
  "daily_ranking_item_id": "94decec2-d041-4b28-90d4-be24b46b6b77",
  "source": "stackoverflow",
  "external_id": "79948683",
  "fetched_at": "2026-05-29T22:02:13.965Z",
  "question_raw": {
    "body": "<p>I am building an online code execution platform similar to LeetCode.</p>\n<h3>Current Architecture</h3>\n<ul>\n<li><p><strong>Fastify Backend</strong></p>\n<ul>\n<li><p>Receives requests from users.</p>\n</li>\n<li><p>Sends code execution requests to a separate Judge service.</p>\n</li>\n</ul>\n</li>\n<li><p><strong>Judge Service (Dedicated VM)</strong></p>\n<ul>\n<li><p>Receives code, language, and input.</p>\n</li>\n<li><p>Spins up a Docker container for each execution request.</p>\n</li>\n<li><p>Runs the code inside the container.</p>\n</li>\n<li><p>Returns the output.</p>\n</li>\n<li><p>Removes the container after execution.</p>\n</li>\n</ul>\n</li>\n</ul>\n<h3>Current Problem</h3>\n<p>This architecture works reasonably well for small-scale usage (~100 users), but under higher load I start seeing issues:</p>\n<ul>\n<li><p>Some Docker containers are created but never removed.</p>\n</li>\n<li><p>Container cleanup becomes unreliable.</p>\n</li>\n<li><p>Execution failures increase as concurrency grows.</p>\n</li>\n<li><p>Resource usage on the Judge VM becomes unpredictable.</p>\n</li>\n</ul>\n<h3>Submission Flow</h3>\n<p>I currently have two execution paths:</p>\n<ol>\n<li><p><strong>Run Code API</strong></p>\n<ul>\n<li><p>Used when users click &quot;Run&quot;.</p>\n</li>\n<li><p>Direct HTTP request to the Judge service.</p>\n</li>\n<li><p>No queue is used because users expect an immediate response.</p>\n</li>\n</ul>\n</li>\n<li><p><strong>Submit Code API</strong></p>\n<ul>\n<li><p>Used when users submit a solution.</p>\n</li>\n<li><p>Request is pushed to a Redis queue.</p>\n</li>\n<li><p>BullMQ workers consume jobs and run test cases asynchronously.</p>\n</li>\n</ul>\n</li>\n</ol>\n<h3>Scaling Goal</h3>\n<p>I want to scale this system to approximately:</p>\n<ul>\n<li><p>20,000 total users</p>\n</li>\n<li><p>~1,000 concurrent users</p>\n</li>\n<li><p>Fast response times for the &quot;Run Code&quot; feature</p>\n</li>\n<li><p>Cost-efficient infrastructure</p>\n</li>\n<li><p>Reliable sandboxing and container cleanup</p>\n</li>\n</ul>\n<h3>Questions</h3>\n<ol>\n<li><p>What architecture do large platforms (e.g., LeetCode, HackerRank, Codeforces) typically use for code execution at scale?</p>\n</li>\n<li><p>Is spinning up a Docker container per request still a good approach at this scale?</p>\n</li>\n<li><p>Should the &quot;Run Code&quot; API also use a queue, or is there a better pattern for low-latency execution?</p>\n</li>\n<li><p>Would Kubernetes be the recommended solution, or are there better alternatives?</p>\n</li>\n<li><p>How should sandbox lifecycle management and cleanup be handled to prevent orphaned containers?</p>\n</li>\n<li><p>What would be a cost-optimized architecture capable of handling ~1,000 concurrent executions?</p>\n</li>\n<li><p>Are there any open-source judge systems or execution architectures worth studying?</p>\n</li>\n</ol>\n<p>Any architecture diagrams, production experience, or recommendations would be greatly appreciated.</p>\n",
    "link": "https://stackoverflow.com/questions/79948683/designing-a-scalable-code-execution-service-leetcode-like-for-20k-users-100",
    "tags": [
      "postgresql",
      "azure",
      "codesandbox",
      "judge-api"
    ],
    "owner": {
      "link": "https://stackoverflow.com/users/27308944/jnanesh",
      "user_id": 27308944,
      "user_type": "registered",
      "account_id": 35617201,
      "reputation": 1,
      "display_name": "Jnanesh",
      "profile_image": "https://www.gravatar.com/avatar/bc5ab54c3e7477be3f5e77a05d0f1b86?s=256&d=identicon&r=PG&f=y&so-version=2"
    },
    "score": 0,
    "title": "Designing a Scalable Code Execution Service (LeetCode-like) for ~20k Users / 1000 Concurrent Users",
    "view_count": 15,
    "is_answered": false,
    "question_id": 79948683,
    "answer_count": 0,
    "creation_date": 1780087770,
    "content_license": "CC BY-SA 4.0",
    "last_activity_date": 1780087770
  },
  "answers_raw": [],
  "tags_raw": [
    "postgresql",
    "azure",
    "codesandbox",
    "judge-api"
  ],
  "stats_raw": {
    "score": 0,
    "view_count": 15,
    "is_answered": false,
    "answer_count": 0,
    "creation_date": 1780087770,
    "last_edit_date": null,
    "accepted_answer_id": null,
    "last_activity_date": 1780087770
  },
  "selection_meta": {
    "site": "stackoverflow",
    "api_wrapper": {
      "backoff": null,
      "has_more": true,
      "page_size": 8,
      "quota_max": 300,
      "quota_remaining": 299
    },
    "answer_fetch": {
      "backoff": null,
      "has_more": false,
      "answers_fetched": 0,
      "quota_remaining": 272,
      "answer_page_size": 3
    },
    "snapshot_version": "stackoverflow_question_v1",
    "selection_strategy": "tag_whitelist_unanswered_high_score_recent_active"
  },
  "created_at": "2026-05-29T22:02:14.151Z",
  "updated_at": "2026-05-29T22:02:14.151Z"
}