Imri Goldberg
June 20, 2022
At Piiano we are developing a Vault, a type of storage database dedicated to PII. Vault is implemented in Go, which is a great language for implementing cloud services. One of the basic requirements of any server is to implement a request timeout. Any request that is handled by the server should have a timeout which is typically around 15-30 seconds on web servers these days.
When that timeout expires the user should receive a “503 Timeout Exceeded” response, and the handling of the request on the backend should be stopped. As this was my first contribution to Vault, and also my first time writing Go, this definitely has been a learning experience. I thought that sharing a few lessons about implementing this feature would be helpful for other developers.
In Go many functions and definitely IO functions receive a Context argument that allows making any request cancelable, or with a deadline, in a cooperative manner. This means that the implementation of any such function should adhere to the cancel signal in the context, for example by select()-ing on the context’s Done channel. This is pretty standard in Go, and when you are looking to add a 10-second timeout to your request-handling code, you could write something like the following:
ctx, cancel := ctx.WithTimeout(10*time.Second)
defer cancel()
Error handling in Go is also pretty standardized. When you get an error from an internal function, you would normally wrap it e.g. using: fmt.Errorf("Unable to add person: %w", err)
Then, when you go up the call tree, before returning the HTTP error to the user, you could check if the error was initiated by a timeout or not, and if it’s from a timeout, handle it differently. You can implement this check using code such as:
func IsTimeoutError(err error) bool {
var netErr net.Error
if errors.As(err, &netErr) && netErr.Timeout() {
return true
}
return false
}
If IsTimeoutError returns True, then return a 503 timeout error, otherwise, return your standard 4xx error.
The above solution worked almost great. However, one of our tests was flaky. Once every 10 runs or so it would return a 500 error when it was supposed to return a 503 timeout error. Essentially it was treating a timeout error as a “generic” error. The reason was some errors that were caused by timeouts got translated to non-timeout errors. Why?
Read on for the situation that we ran into. Imagine you have a bit of code that starts a DB transaction, makes an INSERT query, and then commits the transaction. If the timeout expires during the INSERT then the DB API call will fail with a timeout error, which would be easy to detect. However, if the timeout expires exactly in the time window after the INSERT but before the COMMIT of the transaction, the SQL library will rollback the transaction, because the timeout was sent to the DB when the transaction was started.
When we’ll try to COMMIT the transaction, instead of a timeout error, we’ll get an error message saying we’re trying to commit a transaction that’s already rolled back. This error will NOT be a timeout error, and so our code will understand it to be some unrelated error, which would be a beautiful race-condition bug. This is easy to reproduce if you add a time.Sleep(10 * time.Second) between the last query and the commit. (How to discover this is left as an exercise for the reader :)
Fortunately in the above case, Go comes to our rescue. If the timeout on our context expires, we are guaranteed that ctx.Err() would be context.DeadlineExceeded (or an error wrapping it). So to handle our error, our code should now be:
if IsTimeoutError(ctx.Err()) {
return ...HTTPTimeoutError...
}
Where ...HTTPTimeoutError... should be replaced with your own code to return the appropriate HTTP error.
How should we test our code? Testing timeouts can be particularly tricky. First, we need to understand the requirements of our timeout test:
This is quite a tall order. Here is my initial approach:
Due to the way the API is structured, observing the state of the system is done with an API call. This is all fine and dandy until you realize that reading the state of the system will also fail due to a timeout.The solution: resetting the server with a new, longer timeout, without resetting the DB.
On fast machines, the action that should fail will actually succeed - the timeout is not short enough. If it’s too short, the server won’t even start processing the request and you’re not testing anything. So you need to find a sweet spot for the timeout in the test. Let’s say, 10ms.
However, on some faster machines, even that is enough time for the action to sometimes succeed, which makes the test flaky. The solution? Fault injection. We will add a configuration option that is not user visible and will only be used in testing, called FaultInjection. For some particular value, the AddPerson API call will make an SQL query with pg_sleep(1) (Note: this is postgres specific). This will be sufficient to make sure the call fails on timeout, and our test for the timeout is now stable.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
So there you have it, timeouts implemented with a stable test that proves that they work.
Tech Lead Advisor
Imri is a lifelong software engineer who loves Python and has taken on CTO roles as an entrepreneur. He has been writing code professionally for his entire life, and he is passionate about building SaaS companies.
Increased complexity as the number of keys and systems grow.
Adopt a centralized key management solution such as a Hardware Security Module (HSM) or cloud-based KMS to securely manage and control cryptographic keys at scale.
Ensuring secure and timely key distribution and synchronization at scale.
Automate key rotation processes to maintain synchronization, reduce human intervention, and minimize errors as the system grows.