Scaling live trivia game to half million concurrent users

4 min readJan 27, 2019

TL;DR. This post talks about high level architecture to scale a live trivia game show to 500,000 concurrent users

I have been working with Swoo for the past one year, a company that builds live interactive games.

Live games and live game shows are a hard problem to solve. Problems we faced scaling up:

Sudden incoming traffic at the start of the game
Showing accurate, live game stats
Handling and processing huge number of concurrent requests

I have worked on five different live games and built architecture that can easily handle 250,000+ concurrent users.

Sharing the high level architecture/take-aways to build a live trivia game show which can scale to half a million concurrent users.

What is live Trivia?

Trivia is live interactive KBC style game show with a host. The host asks questions and the users can choose the answer from a list in android/iOS app within a limited time.
The winners get prize money.

Display question and Display statistics screens

How does live Trivia work?

Host asks a question and user is presented with question and multiple options(typically three) to choose right answer from.
User gets 10 seconds to choose right answer from presented options.
Post 10 seconds, the user is eliminated or moves to the next question by choosing the correct answer.
For every question/answer, user can see stats related to the question immediately on the screen post 10 seconds. Stats include % of submitted answers to each of the presented options + correct answer
This continues till all the questions are finished. Typically 10 questions per show
Prize money is distributed among the users who answered all 10 questions correctly.

There are more than 10 variations to this game in Swoo but sticking to the above rules for simplicity sake.

Components

Client and Game operator are the two major components

Client: User playing the game on web/app. Each client has two main states

Display Question. User is shown a question + answer options
Display Statistics. Stats to be shown for each answer

Game operator: Person/Job which sends signals to show next state to the client.

Game flow

High level architecture

Clients poll server every x-seconds(5 secs here) to get next game state

Game operator changes the game state(display question/display statistics). Let’s go with game state “display question1”.
Gaming service accepts requests from game operator and updates the game state to Redis. In this case, game state in Redis is changed to “display question1”
Client polls polling service continuously to check current state. If there is state change then client takes action based on new state.
Polling service returns new state “display question1” and client displays question on the screen with options to select right answer. Polling service is built on Kong (which is Lua application over nginx). Nginx support Redis out of box. When ever there is state change, a Redis pub-sub event is triggered. We wrote simple Lua cache loader script which reads event and updates global cache of game state for all the workers to consume. Each worker has cache is of size 8KB and Global cache of all workers is of size 64MB.
Clients submits answer to the question by selecting one of the options and answer submission request is sent to gaming service
Gaming service receives answer submission request and adds event to Kafka topic
Kafka brokers are setup to receive answer submission events from all gaming service hosts.
Flink cluster consumes event batch from Kafka for processing and updates multiple sources. The current system at Swoo doesn’t have Flink setup as we have seen max 100K concurrent users. Kafka Streams serves well if your live concurrent users are below 100K.
SycllaDB receives user answer submission data and persists it.
Flink also updates Redis with answer count for each option after processing events.

Cycle is repeated for next game state “Display stats for q1” and repeated until total number of questions are completed.

Infra:

Infrastructure details for 150K+ concurrent users. You can scale infra using below data as benchmarks and estimated traffic. We run completely on AWS.

Kong (Load balancer): 5 instances of c5.xlarge, 4 cores, 8gb ram.
Polling Service: 5 instances of c4.xlarge, 4 cores, 8gb RAM
Gaming Service: 14 instances of c5.large, 2 cores, 4gb RAM
Kafka: 4 instances of t2.medium, 2 cores, 4gb RAM
Flink: 4 instances of m4.xlarge, 4 cores, 16gb RAM
Redis: 2 cache.m3.2xlarge machines(primary and secondary), 8 cores, 33.8 gb RAM
SycllaDB: 3 instancies of i2.xlarge, 4 cores, 30.5gb RAM

Feel free to comment if you have any questions. When I get time I will add some benchmark graphs w.r.t latencies, requests and CPU.