Building Better Teams

Articles and Posts

Blog

 

Editing Shared Resources: "Pessimistic Locking"

I think a better name for this would have been “greedy lock”, because it blocks everything from changing the resource until it is done. It’s a bit like having a one-bathroom home shared with 10 people.

The main idea is to lock the resource immediately, compute all changes, write the changes, and afterwards release it. All other processes must wait until it is done.

This diagram might be a bit dizzying because of all the code-branching, so here’s a short description:

  • Check if there is a lock on the resource, if there is, wait a few milliseconds and then check again.

  • Set the resource to locked.

  • Compute and set changes

  • Set resource to unlocked

This is great for preventing errors when two workers want to write to the same resource at the same time (one of them just has to wait). It also ensures that each process must wait for the resource to get updated before it can work on it. However, it introduces a new problem: the retry-loop blocks servers for a long time while they wait and do nothing. This lowers the maximum number of requests you can provide per second, reduces your ability to scale servers cost-effectively, and actually can introduce an easy-to-exploit weakness in the system where all workers in a web-cluster will be “busy” executing “sleep 25ms”. if an attacker spams slow-performing parts of your website, everything everywhere will get blocked (assuming no special mitigations are taken).

Each active request must “wait” for the entire length of time the previous request is waiting, this inheritance can quickly add up, especially if you have a few pages that (under normal conditions) are so slow they take 5 seconds to process.

Each active request must “wait” for the entire length of time the previous request is waiting, this inheritance can quickly add up, especially if you have a few pages that (under normal conditions) are so slow they take 5 seconds to process.

Another downside comes from the implementations of most big frameworks. Framework authors want to make this stuff easy for developers, so they turn this on by default so you don’t need to think about it. This becomes an obvious drawback when a process sets locks for no reason; No change will happen but a lock was acquired anyway, wasting time. (By the way, If you think you can just disable it for GET requests: keep reading, I have bad news for you further down).

That said, this is the “safest” generic approach for preventing data loss. There’s a number of “mitigations” against the previously mentioned issues that you can use:

  • Release Early, and Lock Late: Some web frameworks might acquire a lock earlier than they need to, (for example, it might be in pre-and-post controller hooks). This blocks the resource for a longer time than needed. Similarly, it will release a lock much later than it could have (for example, after a response is sent). If you can narrow down the time spent in a lock, you will improve the overall situation (but never fully solve the problem).
    Some implementations go to extreme lengths to reduce time by executing Lua scripts on Redis. These scripts will check+set the session lock in one redis command, instead of sending two redis commands (reducing the small ~2ms window of error).

  • Limiting requests: If your user is sending 5 requests at the same time, consider using a load balancer or firewall to either queue requests sequentially, or deny the requests entirely (HTTP 429 Too Many Requests”). This doesn’t make the worst-case faster, but it does prevent one user from filling every worker in your cluster with code that only runs “sleep 25ms“.

  • Limiting locks: This is easier said than done. A lot of little “simple features” are built in ways they need to change session data on GET requests, especially little things like “flash message notifications” (I’m talking about Symfony Flash Messages, Django Flash Messages, Rails Flash Messages… probably most of the major frameworks).
    There’s one solution I found to be quite a nice compromise in ASP.NET, where they configure a locking strategy on the route/page level. But be warned, this this brings a lot of different tradeoffs regarding software maintenance and flexibility (but the topics are out of scope for this article).

It’s important to keep in mind that these mitigations only make the underlying issue less serious, the problem (“sleep 25ms”) is still there. So we need to look a bit deeper if we want to free up some of that time wasted in a lock.