Incident report - April 16th
On April 16th, at 3:20 PM, Fermion's backend performance was heavily degraded for about 35 minutes. This impacted all customers using Fermion and made many services unavailable, including the Fermion API.
As an LMS provider and infrastructure for various layers, we take uptime as a core metric and this incident is something we're very sorry for. We are taking important measures to make sure the incident does not happen again.
Background
One of our customers we recently onboarded organized a 5000 student contest on Fermion. The contest consisted of coding lab questions and MCQs. 5000 users is not a huge scale, and we handle this scale often on a daily basis.
The contest was scheduled for the same time today. However, something happened in the contest that severely impacted our backend performance and brought down our core infrastructure.
What went wrong
Fermion's backend infrastructure roughly is as follows: Our internet traffic is received by a load balancer, that forwards the traffic to one of many Node.js processes connected with our production database using pgbouncer.
The setup is tested to handle more than 5000 RPS, and is fault tolerant to single host failure. Our peak requests currently is usually about 1000 RPS.
The customer who was running a contest with 5000 students caused the backend to go down, but not because of API requests. Quizzes in MCQs in a contest in Fermion use legacy software (from Codedamn) from a few years back.
The logic for the software works as follows:
In Fermion, a single quiz can have multiple questions. Also, you can have multiple quizzes in a single contest.
The customer who was running a contest created multiple quizzes with about 1000 questions in each quiz. Fermion allows limiting questions and randomizing questions to users, so even if you upload 1000 questions you can limit it to only 100 questions but fully randomized (stable randomizer with a seed) for every user.
Now when the attempt started, 5000 people started their first question. Here is a list of events in order:
Furthermore, once the CPU got blocked for a long time, Node.js accepted new HTTP requests but did not respond them back, hanging all subsequent requests directed by load balancer to this specific process.
Incident timeline
All timestamps are in IST on April 16th, 2025.
Follow up steps
We take these incidents very seriously, and deeply apologize for the magnitude of impact and distress this caused to customers using Fermion. Here is what we're doing either immediately or soon in future:
Conclusion
The conditions necessary to cause this bug in the backen service are no longer possible in our production environment. We are deeply sorry for any disruption this caused our customers and to end users trying to access services.