Accidents happen, and in the realm of server maintenance, a single misstep can lead to significant consequences. On November 20th, 2023, I found myself facing the aftermath of an inadvertent error that impacted servers I had diligently maintained for over three years. The decision was clear—rebuild from scratch. In this tech blog post, I will share the challenges encountered and the step-by-step process of rebuilding 4 AMD64 servers and 3 ARM64 servers, each hosting critical applications for more than 45 services.
Assessing the Damage
The first order of business was to assess the extent of the damage. While the incident had no direct impact on the main site, Trapnest, it did result in the loss of data for a multitude of services. These services included but were not limited to PeerTube, Discourse, Mastodon, Navidrome, Azuracast, GitLab, Harbor, PhotoPrism, Master Sensei, Wikijs, Misskey, Lychee, and the indispensable Trapnest Backup.
Fresh deployment and migration were the two primary options on the table. However, given the complexity of the services and the interconnected nature of the data, neither option proved to be straightforward. After careful consideration, the decision was made to opt for a complete rebuild from scratch. This choice provided an opportunity to not only restore the lost services but also to optimize and enhance the infrastructure.
Building the Foundations
Docker Compose simplified the deployment of interconnected services. Each service, including PeerTube, Discourse, Mastodon, etc., was defined in a Docker Compose file, streamlining the process of bringing up the entire stack.
Rebuilding servers from scratch proved to be a formidable but necessary undertaking. The incident served as a catalyst for optimizing the infrastructure, implementing best practices, and documenting recovery processes. While the road to recovery was challenging, it paved the way for a more resilient and well-architected system.
As we reflect on this journey, the lessons learned resonate beyond the realm of server maintenance. The importance of robust backup strategies, automation, and infrastructure documentation cannot be overstated. In the face of adversity, the tech community thrives on the ability to adapt, learn, and emerge stronger.
The rebuilt servers now stand as a testament to the resilience of Trapnest's infrastructure, ready to support the diverse services and applications that define its digital ecosystem.
Copyright statement: Unless otherwise stated, all articles on this blog adopt the CC BY-NC-SA 4.0 license agreement. For non-commercial reprints and citations, please indicate the author: Henry, and original article URL. For commercial reprints, please contact the author for authorization.