Incident report - April 16th

On April 16th, at 3:20 PM, Fermion's backend performance was heavily degraded for about 35 minutes. This impacted all customers using Fermion and made many services unavailable, including the Fermion API.

As an LMS provider and infrastructure for various layers, we take uptime as a core metric and this incident is something we're very sorry for. We are taking important measures to make sure the incident does not happen again.

Background

One of our customers we recently onboarded organized a 5000 student contest on Fermion. The contest consisted of coding lab questions and MCQs. 5000 users is not a huge scale, and we handle this scale often on a daily basis.

The contest was scheduled for the same time today. However, something happened in the contest that severely impacted our backend performance and brought down our core infrastructure.

What went wrong

Fermion's backend infrastructure roughly is as follows: Our internet traffic is received by a load balancer, that forwards the traffic to one of many Node.js processes connected with our production database using pgbouncer.

The setup is tested to handle more than 5000 RPS, and is fault tolerant to single host failure. Our peak requests currently is usually about 1000 RPS.

The customer who was running a contest with 5000 students caused the backend to go down, but not because of API requests. Quizzes in MCQs in a contest in Fermion use legacy software (from Codedamn) from a few years back.

The logic for the software works as follows:

Store the instructions, title, and options in markdown format.
When the client requests the quiz, convert the markdown into HTML on server-side. We used to do this for SEO reasons on Codedamn, and brought this code inside Fermion when building it last year.
Once the markdown to HTML conversion is done, we hash the markdown value (MD5 it) and store the generated HTML in a cache. This is done to prevent recomputing the same markdown again and again. To add on to it, the hash is not just of the markdown value, but some metadata about the markdown source as well.

In Fermion, a single quiz can have multiple questions. Also, you can have multiple quizzes in a single contest.

The customer who was running a contest created multiple quizzes with about 1000 questions in each quiz. Fermion allows limiting questions and randomizing questions to users, so even if you upload 1000 questions you can limit it to only 100 questions but fully randomized (stable randomizer with a seed) for every user.



Now when the attempt started, 5000 people started their first question. Here is a list of events in order:

On the user side, we eagerly fetch all questions in a single quiz. Our existing clients always have setup about 10 questions in length (we checked it).
Questions were randomized, so no one was getting same set of 100 questions since out of 1000 questions if you have to create 100 question sets, the count of such sets is extremely large.
Node.js is not good for CPU bound tasks. Converting markdown to HTML is a two step process for us. Firstly we convert it to pure HTML, and then we run a DOMParser on top of it to sanitize any malicious input. This whole process can take anywhere between 0.5-5ms depending on content size and complexity of markdown.
Furthermore, our caching policy of markdown to HTML included metadata about the contest question set itself that made it impossible to get a cache hit on previously converted markdown.
This is where the problem happened. 5000 users are requesting about 100 questions, fully random, that are CPU blocking Node.js on server side rendering the markdown to HTML part, uncached.
Since the load balancer distributes the requests evenly, it started choking every single Node.js process we had running on this markdown to HTML conversion.

Furthermore, once the CPU got blocked for a long time, Node.js accepted new HTTP requests but did not respond them back, hanging all subsequent requests directed by load balancer to this specific process.



Incident timeline

All timestamps are in IST on April 16th, 2025.

3:20PM - We started receiving health check alerts that 50%+ of node processes are non-responsive.
3:30PM - Initial diagnosis was completed, and we figured out CPU block was the root cause. It took another 5 minutes to find the exact place where it was happening (inside the quiz markdown to HTML code).
3:40PM - a hotfix was deployed on our CI to bypass the conversion from markdown to HTML and just return markdown directly to client. This had the drawback that customers will see markdown as raw HTML (i.e. hashtags instead of real headings), but a favorable compromise for the moment.
3:50PM - fix was deployed successfully and systems started to show recovery.
3:55PM - all systems were fully up again.

Follow up steps

We take these incidents very seriously, and deeply apologize for the magnitude of impact and distress this caused to customers using Fermion. Here is what we're doing either immediately or soon in future:

Minimizing CPU blocking code impact from consumer-facing backend APIs: The CPU block of markdown to HTML on a bulk question set (100+) in one go was a massive oversight from our side and it should not have happened. We're turning our backend upside down to make sure we do not have such point of failures anywhere else in the codebase.
Building isolated backend processes dedicated to DSA execution and other critical infrastructure. Fermion offers an API for DSA coding lab execution, and although unrelated, this service also went down because the shared Node.js backend process was busy converting markdown to HTML.



Conclusion

The conditions necessary to cause this bug in the backen service are no longer possible in our production environment. We are deeply sorry for any disruption this caused our customers and to end users trying to access services.