Scaling live trivia game to half million concurrent users
TL;DR. This post talks about high level architecture to scale a live trivia game show to 500,000 concurrent users
I have been working with Swoo for the past one year, a company that builds live interactive games.
Live games and live game shows are a hard problem to solve. Problems we faced scaling up:
- Sudden incoming traffic at the start of the game
- Showing accurate, live game stats
- Handling and processing huge number of concurrent requests
I have worked on five different live games and built architecture that can easily handle 250,000+ concurrent users.
Sharing the high level architecture/take-aways to build a live trivia game show which can scale to half a million concurrent users.
What is live Trivia?
Trivia is live interactive KBC style game show with a host. The host asks questions and the users can choose the answer from a list in android/iOS app within a limited time.
The winners get prize money.
How does live Trivia work?
- Host asks a question and user is presented with question and multiple options(typically three) to choose right answer from.
- User gets 10 seconds to choose right answer from presented options.
- Post 10 seconds, the user is eliminated or moves to the next question by choosing the correct answer.
- For every question/answer, user can see stats related to the question immediately on the screen post 10 seconds. Stats include % of submitted answers to each of the presented options + correct answer
- This continues till all the questions are finished. Typically 10 questions per show
- Prize money is distributed among the users who answered all 10 questions correctly.
There are more than 10 variations to this game in Swoo but sticking to the above rules for simplicity sake.
Client and Game operator are the two major components
Client: User playing the game on web/app. Each client has two main states
- Display Question. User is shown a question + answer options
- Display Statistics. Stats to be shown for each answer
Game operator: Person/Job which sends signals to show next state to the client.
High level architecture
Clients poll server every x-seconds(5 secs here) to get next game state
- Game operator changes the game state(display question/display statistics). Let’s go with game state “display question1”.
- Gaming service accepts requests from game operator and updates the game state to Redis. In this case, game state in Redis is changed to “display question1”
- Client polls polling service continuously to check current state. If there is state change then client takes action based on new state.
- Polling service returns new state “display question1” and client displays question on the screen with options to select right answer. Polling service is built on Kong (which is Lua application over nginx). Nginx support Redis out of box. When ever there is state change, a Redis pub-sub event is triggered. We wrote simple Lua cache loader script which reads event and updates global cache of game state for all the workers to consume. Each worker has cache is of size 8KB and Global cache of all workers is of size 64MB.
- Clients submits answer to the question by selecting one of the options and answer submission request is sent to gaming service
- Gaming service receives answer submission request and adds event to Kafka topic
- Kafka brokers are setup to receive answer submission events from all gaming service hosts.
- Flink cluster consumes event batch from Kafka for processing and updates multiple sources. The current system at Swoo doesn’t have Flink setup as we have seen max 100K concurrent users. Kafka Streams serves well if your live concurrent users are below 100K.
- SycllaDB receives user answer submission data and persists it.
- Flink also updates Redis with answer count for each option after processing events.
Cycle is repeated for next game state “Display stats for q1” and repeated until total number of questions are completed.
Infrastructure details for 150K+ concurrent users. You can scale infra using below data as benchmarks and estimated traffic. We run completely on AWS.
- Kong (Load balancer): 5 instances of c5.xlarge, 4 cores, 8gb ram.
- Polling Service: 5 instances of c4.xlarge, 4 cores, 8gb RAM
- Gaming Service: 14 instances of c5.large, 2 cores, 4gb RAM
- Kafka: 4 instances of t2.medium, 2 cores, 4gb RAM
- Flink: 4 instances of m4.xlarge, 4 cores, 16gb RAM
- Redis: 2 cache.m3.2xlarge machines(primary and secondary), 8 cores, 33.8 gb RAM
- SycllaDB: 3 instancies of i2.xlarge, 4 cores, 30.5gb RAM
Feel free to comment if you have any questions. When I get time I will add some benchmark graphs w.r.t latencies, requests and CPU.