Powering GCC backend with Apache Kafka

Prathamesh Mane
7 min readDec 26, 2021

About Global Coding Challenge

Each year Credit Suisse organizes the Global Coding Challenge (GCC). It is an online coding competition between participants across the globe. Around 3 weeks, participants try to attempt solutions to nine coding problems. Participants can improve their code as many times as they like during the competition. The top performers across the regions win exciting prizes and they also have a chance to join Credit Suisse as a Technical Analyst (TA).

The competition has been entirely designed, built, and run by Credit Suisse TAs. As a former TA myself, along with my colleagues we try to improve the participants’ overall experience year on year.

Initial Architecture

Once any participant signs up for the competition, a private git repository is allocated to them. The participant has to commit the code against his/her repository only. Once the code is pushed, via git hooks, we capture the commit, run the code in a sandboxed environment and calculate the ranked score of the submission. The overview of the submission process is described below -

GCC: Code Submission Flow

Along with Code evaluation, we have multiple services to manage. Some critical services include —

  1. User Management Service — to support user management lifecycle, like sign-ups, logins, profiles, etc.
  2. Evaluator Service — Runs the user submitted code under a sandboxed environment and calculates the score.
  3. Support Service — To manage support queries raised by participants.
  4. Mailing Service — Send competition-related updates and results.
  5. Cache Service — Serve most accessed data.
  6. etc,

For the first 2 years, when we are mostly focused on bringing core features to the users, the system architecture looked like this —

GCC: Initial Design of the Backend System

This architecture worked pretty well during the initial phases where we have only 3 regions and participation in few thousand. But in the 2nd edition, due to the popularity of the first season and increased regions, we hit full throttle. The system hit its absolute capacity and we were certain that we need to rethink the current architecture to support future scale.

Solution

The day 1 solution had many point-to-point connections between the services. It caused the backpressure to other services when one dependent system is overloaded with the user requests. The one observation we had is, the user submission spikes after 7 PM IST, as the most number of the participants are from India, and this is the time when most of the universities are closed for the day, and a student gets free time to work on the challenges. Hence we must be prepared to handle the sudden spike in the submissions.

Secondly, the challenge requirements are mostly suitable for the asynchronous architecture. Consider the evaluator service as an example. The service gets triggered only when a user commits the code to the git repo or the online code editor. And once the code is evaluated, we either send them an email with the results or show the results in the code editor. This definitely can be processed asynchronously.

The third aspect, not that important as above, but very crucial for the smooth user experience is caching the most accessed data. For a mammoth coding event like this, what could be the most accessed information? And you guessed it correctly, it’s the submission results. This data is being consumed by multiple services like frontend, leaderboards, email services. And as those services are distributed across multiple systems, the cache can’t sit on a single service, but rather we need a distributed solution for it.

Considering the above factors, we need a solution that satisfies the following criteria —

  1. High Availability — The system should be highly available during some system failures and between rolling new updates of the system.
  2. Decoupling producers and consumers — Producers and consumers share information asynchronously.
  3. Producers and consumers can be scaled up and down independently.
  4. Ability to replay messages later in the future.
  5. Data must-read by multiple consumers — Consumers from different groups can read the same data snapshot.
  6. Load balancing data between multiple consumers — Consumers from the same group must read the mutually exclusive set of data.

Considering the above factors, Kafka matches our requirements as —

  1. Kafka is a distributed commit log messaging system, hence it’s highly available by design by duplicating the message to multiple replicas.
  2. As Kafka is a message queue, producers can independently send the messages to Kafka topics, and consumers can read them any time they want.
  3. This allows independent scaling producers and consumers, hence slow consumers cannot backpressure the producer to halt its operations.
  4. Kafka can act as a durable store for the messages. You can set the Kafka deletion policy to prune the messages after a certain amount of time or to store only the latest key-value pair in the system.
  5. Consumer group semantics allows the client from different consumer groups to read the messages independently.
  6. And the same semantics allows the clients in the same consumer group to load balance the data between them exclusively.

The new approach that we used in the GCC 3.0 version is below —

The details of each component are —

  1. Github (Submission) Controller — The producer is responsible for sending the user submitted code to the Kafka topic named “evaluator-submission”. We allow test submissions (which are not counted into the final score and are run against the small number of the test cases) and the real submissions (exhaustive set of test cases are executed against the submission). So the processor has to cater to both the submission types. The actual handling of the messages is out of the scope of this article for simplicity.
  2. Evaluator Service — The service is responsible for running the submitted code in a sandboxed environment, collecting the matrices as total time takes for execution, number of test cases passed, amount of memory used to name a few. The result is then pushed to the “evaluator-result” topic. During peak times we receive hundreds of requests per minute and each submission takes on average ~1.2 minutes to execute. Hence we run an average of 40–60 processors in parallel. The number of consumers in the same group Kafka can load balance is capped by the number of partitions the input topic has. Hence we have configured the input topic partitions accordingly.
  3. Submission Controller — This is a stream processor, which reads from the “evaluator-result” topic, applies the scoring function, and writes to the “score-result” topic. This topic is again read to form the KTable which stores the latest submissions and the respective score. This table is then exposed to multiple services like leaderboard via “Interactive Queries”. This table we use as a distributed cache to serve the read requests for submissions. This is a plus for us as we don’t have to accommodate another library for doing this stuff.
  4. Score Persistor — This is a simple microservice, which reads from the “score-result” topic and stores them into Mongo DB. This part can be replaced with a Mongo Sink connector for Apache Kafka. The persisted data is then used for reporting and other stuff.

The above architecture allowed us to scale our system to a greater load without any hassle. This year a total of 25,000+ students across 7 regions signed up for the content. During the competitions days, we saw traffic of 100,000+ messages flowing through the Kafka cluster. We managed to scale down evaluator services from a few tens to max 80 instances parallelly. So all in all we can say that we are heading towards a correct path :)

Key Learnings and Way forward

If I say we found a perfect secret sauce for our problem, then it would be a total lie. While observing the system over the competition we found out pretty good observations, which we will try to mitigate next year.

  1. Head of line blocking — It only takes a small number of slow requests to hold up the processing of the subsequent requests. As an evaluation engine, we allow participants to submit code in multiple languages like C, C++, Java, Python, etc. Solutions in some languages like Javascript tends to take more time to execute than others. Hence we did set up a large timeout interval for Javascript evaluators. The “evaluator-submission” topic has key=submission-id, and Kafka default partitioner takes a hash of the key, modes it with the number of partitions and allocates the message to that partition. Hence a partition having JS submissions at the head starved much faster submissions in the queue. Hence our overall response time was affected. The good part is we have a very low number of JS submissions (in a few tens). As a solution, we are planning to write our custom partitioner, to allocate the slow executing submissions to a specific partition.
  2. Managing Persistence of State Stores — We used Heroku for hosting our services and the Apache Kafka. The containers used at Heroku are called Dynos, and they by default do not allow any persistence store. To store the state stores of the stream processor, we either have to buy new S3 storage or an NFS solution. A rather simpler approach is to use standby replicas of the stream processor. Standby Tasks are replicas of Stream Tasks that maintain fully-replicated copies of state. Dynos make use of Standby Tasks to resume work immediately instead of having to wait for the state to be rebuilt from changelog topics. Rebuilding the state can take up to 15–20 mins considering that we serve 100k messages per day. This is not realistic as Heroku mandates to make dynos up within 180 seconds. Hence having standby tasks by setting num.standby.replicas config in the application seems a way to go forward.

--

--