Scaling software is often thought of as hard problem, yet it is one of the bits of software development I enjoy the most. It provides access to some niche bugs that aren’t simply ‘You forgot to add a ; to the end of the line’ but rather provide scenarios such as ‘What happens when N people attempt to use Y at exactly the same time? Are all requests served by the server? Are none?’
The Exception Link to heading
All good blog posts start with things going wrong in some way and this is no different. I was peeking at the exception logging for a large service I run and noticed the following exception stacktrace. I’ve trimmed it for readability, but at a glance it provides conflicting information.
Traceback (most recent call last):
File "/code/suggestions/state.py", line 344, in fetch_user
return self.object_cache.get_entry(user_id)
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
File "/code/.venv/lib/python3.13/site-packages/commons/caching/timed_cache.py", line 157, in get_entry
raise NonExistentEntry
commons.caching.timed_cache.NonExistentEntry
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/code/suggestions/state.py", line 347, in fetch_user
self.object_cache.add_entry(user_id, user)
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "/code/.venv/lib/python3.13/site-packages/commons/caching/timed_cache.py", line 110, in add_entry
raise ExistingEntry
commons.caching.timed_cache.ExistingEntry
This exception indicates that:
- The initial problem was that a cache lookup failed to find an entry. A perfectly fine, if not expected situation during usage.
- That in handling the failed lookup and subsequent cache population, the cache then somehow already contained the item it had previously failed to lookup?
How interesting! Let’s dive deeper.
The Code Link to heading
Here is the relevant function responsible for this stacktrace.
async def fetch_user(self, user_id: int) -> disnake.User:
try:
return self.object_cache.get_entry(user_id)
except NonExistentEntry:
user = await self.bot.fetch_user(user_id)
self.object_cache.add_entry(user_id, user)
return user
It’s fairly simple at a glance.
- Attempt to return a user object from the cache
- If it isn’t in the cache:
- Fetch the object
- Add it to the cache
- Return the newly fetched object
If one were to call this function a few times, it’s easy to expect this to work. Logically it makes sense, return the cached item or fetch and add to cache. But in software at scale there are alternate questions to consider. Questions such as ‘what happens when this is called concurrently’.
The Cause Link to heading
Ultimately this was a simple bug, the cache provider by default doesn’t let callers silently override values and there is no access locking/queueing on the resource. Seems reasonable enough.
But then think about it. In order to actually run into this issue; and at multiple occurrences according to my logs; one must have called fetch_user with the following conditions:
- Two or more calls must be made to
fetch_userat nearly the same time - The user must not initially exist in the cache for all function calls
- One call to
fetch_usermust fetch and store the user in the cache before other callers but after other callers have already done the initial cache lookup
Now to you the reader I ask, what is the likelihood of that occurring multiple times in small low user count applications? I say almost never, and yet these are the kinds of software problems I enjoy solving.
Any who, thanks for reading my little deviation from the standard blogging content and I hope you enjoy the rest of your day wherever that may be :)