G.Jonathan

G.Jonathan

Users
Tweets

G.Jonathan

@beardedtech_guy

Apr 23

𝗗𝗮𝘆 𝟭𝟬𝟬 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗙𝗶𝗻𝗮𝗹 𝗥𝗲𝗰𝗮𝗽 One hundred days later, and one thing is clear. 𝗦𝘆𝘀𝘁𝗲𝗺 𝗱𝗲𝘀𝗶𝗴𝗻 is not just about building systems that work. It is about building systems that continue to work under pressure, failure, growth, and change. From the early days of understanding how systems are structured, to diving into distributed systems, scaling, resilience, and real-world trade-offs, the journey has been less about memorizing patterns and more about developing a way of thinking. Along the way, certain themes kept repeating. 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝗻𝗼𝘁 𝗮 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝘆𝗼𝘂 𝗮𝗱𝗱 𝗮𝘁 𝘁𝗵𝗲 𝗲𝗻𝗱. It is something you design for from the beginning, whether through availability, failover, or graceful degradation. 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗶𝘀 𝗻𝗼𝘁 𝗷𝘂𝘀𝘁 𝗮𝗯𝗼𝘂𝘁 𝗵𝗮𝗻𝗱𝗹𝗶𝗻𝗴 𝗺𝗼𝗿𝗲 𝘁𝗿𝗮𝗳𝗳𝗶𝗰. It is about understanding your system, whether it is read-heavy or write-heavy, and making the right decisions around data, caching, and distribution. 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 𝗶𝘀 𝗻𝗼𝘁 𝗮𝗹𝘄𝗮𝘆𝘀 𝗮𝗯𝘀𝗼𝗹𝘂𝘁𝗲. In many cases, it becomes a trade-off, balanced against performance, latency, and availability. 𝗖𝗼𝘀𝘁 𝗶𝘀 𝗻𝗼𝘁 𝗷𝘂𝘀𝘁 𝗮 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁. It is part of the design. Every decision, from replication to infrastructure, carries a cost that must be justified. And most importantly, there is no perfect system. Every design is a series of trade-offs, shaped by what matters most for that particular use case. What changed the most over these 100 days is perspective. It is easier now to look at a system and not just see what it does, but understand why it was designed that way. To recognize the compromises behind the architecture, and the problems it is trying to solve. Because in the real world, systems are not built in ideal conditions. They are built under constraints, evolving over time, and constantly adapting. If there is one takeaway from this journey, it is this: Good systems work. Great systems are designed to survive @TosinOlugbenga We did it sha, Lol I am going to sleeeeepppppppp.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 22

𝗗𝗮𝘆 𝟵𝟵 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗗𝗲𝘀𝗶𝗴𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝗣𝗲𝗮𝗸 𝗧𝗿𝗮𝗳𝗳𝗶𝗰 In system design, the real test of a system is not how it performs under normal conditions, but how it behaves when demand is at its highest. Traffic is rarely consistent. Systems often experience spikes during promotions, major events, or unexpected surges, and these peak moments are where weaknesses are exposed. Designing for average traffic might make a system efficient, but it does not make it resilient. When peak traffic hits, systems that are not prepared can slow down, experience high latency, or fail completely. Designing for peak traffic means planning for the worst-case scenario. It involves scaling infrastructure, using caching to reduce load, and distributing requests effectively through load balancing and auto-scaling mechanisms. However, this approach comes with trade-offs. Preparing for peak demand can increase costs, as resources may remain underutilized during normal operation. The challenge is finding a balance between readiness and efficiency. In the end, a system is only as reliable as it is during its most demanding moments. Because peak traffic is not an exception. It is the moment your system is truly tested.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 21

𝗗𝗮𝘆 𝟵𝟴 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗢𝘃𝗲𝗿-𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝘃𝘀 𝗨𝗻𝗱𝗲𝗿-𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 In system design, one of the hardest decisions is not just what to build, but how much to build. 𝗢𝘃𝗲𝗿-𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗮𝗻𝗱 𝘂𝗻𝗱𝗲𝗿-𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 are two extremes that teams often fall into, especially when trying to balance current needs with future expectations. 𝗢𝘃𝗲𝗿-𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 happens when systems are designed with more complexity than necessary, often in anticipation of scale or problems that may never come. While it may seem like preparing for the future, it can slow down development, increase costs, and make systems harder to understand and maintain. 𝗨𝗻𝗱𝗲𝗿-𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴, on the other hand, occurs when systems are built too simply, without considering growth or real-world conditions. This often leads to performance issues, instability, and frequent rework as the system struggles to keep up. The difference between the two is not just technical, it is about timing and judgment. Over-engineering solves problems too early, while under-engineering solves them too late. The goal is to find a balance, designing systems that meet current needs while leaving room to evolve as requirements grow. In the end, good system design is not about building the most advanced solution. It is about building the right solution at the right time.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 20

𝗗𝗮𝘆 𝟵𝟳 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗖𝗼𝘀𝘁 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗶𝗻 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 As systems scale, one thing becomes very clear. Performance and reliability are not the only concerns anymore. Cost becomes just as important. It is easy to design a system that works well by throwing more resources at the problem, adding more servers, more replicas, and more infrastructure. But that approach does not scale sustainably. Cost optimization is about making intentional decisions on how resources are used, ensuring that systems meet their performance and reliability goals without unnecessary spending. In distributed systems, costs come from multiple areas, including compute, storage, network usage, and replication across regions. As traffic grows, these costs can increase rapidly if not managed carefully. This is why optimization becomes necessary. Techniques like right-sizing infrastructure, using caching to reduce repeated work, and scaling resources based on actual demand help keep costs under control while maintaining performance. However, there are always trade-offs. Reducing cost can impact redundancy or performance if not handled properly, so the goal is not simply to minimize cost, but to balance it against system requirements. In the end, good system design is not about building the most expensive or the most powerful system. It is about building a system that is efficient, sustainable, and capable of growing without becoming a burden.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 19

𝗗𝗮𝘆 𝟵𝟲 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗗𝗮𝘁𝗮 𝗟𝗼𝗰𝗮𝗹𝗶𝘁𝘆 In distributed systems, performance is not just about how fast your system is, but how far your data has to travel. 𝗗𝗮𝘁𝗮 𝗹𝗼𝗰𝗮𝗹𝗶𝘁𝘆 is the idea of bringing data closer to where it is needed, whether that is closer to users or closer to the services processing it. When data is stored far away, every request has to cross regions or networks, increasing latency and slowing down the system. As systems scale globally, this delay becomes more noticeable and impacts user experience. By designing systems so that data is stored and processed near its point of use, response times improve, network costs are reduced, and the system becomes more efficient overall. In practice, this involves techniques like partitioning data across regions, using caching, and aligning compute resources with where the data lives. However, improving data locality introduces trade-offs. Keeping data in multiple locations requires managing consistency and synchronization, which adds complexity to the system. In the end, data locality is about making systems faster not by doing more work, but by doing work closer to where it matters. Because in distributed systems, distance is a cost you cannot ignore.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 18

𝗗𝗮𝘆 𝟵𝟱 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗥𝗲𝗰𝗮𝗽: 𝗦𝗰𝗮𝗹𝗶𝗻𝗴, 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻, 𝗮𝗻𝗱 𝗚𝗹𝗼𝗯𝗮𝗹 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 𝗢𝗻 𝗱𝗮𝘆 𝟴𝟵, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗥𝗼𝗹𝗹𝗯𝗮𝗰𝗸𝘀, understanding how systems recover quickly from bad deployments by reverting to a stable version instead of trying to fix issues in a broken state. 𝗢𝗻 𝗱𝗮𝘆 𝟵𝟬, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗦𝗰𝗵𝗲𝗺𝗮 𝗠𝗶𝗴𝗿𝗮𝘁𝗶𝗼𝗻𝘀, exploring how databases evolve safely without breaking running systems, and why changes to data structures require careful planning. 𝗢𝗻 𝗱𝗮𝘆 𝟵𝟭, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗭𝗲𝗿𝗼-𝗗𝗼𝘄𝗻𝘁𝗶𝗺𝗲 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀, learning how systems can be updated without interrupting users by allowing old and new versions to coexist during transitions. 𝗢𝗻 𝗱𝗮𝘆 𝟵𝟮, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗥𝗲𝗮𝗱𝘀 𝘃𝘀 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗪𝗿𝗶𝘁𝗲𝘀, breaking down how different workloads require different scaling strategies, and why writes are often harder to scale than reads. 𝗢𝗻 𝗱𝗮𝘆 𝟵𝟯, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗠𝘂𝗹𝘁𝗶-𝗥𝗲𝗴𝗶𝗼𝗻 𝗦𝘆𝘀𝘁𝗲𝗺𝘀, where systems expand across geographic locations to improve performance and availability for users around the world. 𝗢𝗻 𝗱𝗮𝘆 𝟵𝟰, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗚𝗲𝗼-𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗧𝗿𝗮𝗱𝗲-𝗼𝗳𝗳𝘀, understanding the balance between consistency, latency, and availability when data is replicated across regions. What ties all of these together is the idea of growth. As systems scale, the challenges move beyond just handling more traffic and start involving how systems evolve, how data is managed, and how performance is maintained across distance. Scaling introduces complexity, replication introduces trade-offs, and global systems introduce new constraints that cannot be ignored. Because at this stage, system design is no longer just about building something that works. It is about building something that continues to work as it grows, changes, and reaches users everywhere.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 17

𝗗𝗮𝘆 𝟵𝟰 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗚𝗲𝗼-𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗧𝗿𝗮𝗱𝗲-𝗼𝗳𝗳𝘀 As systems expand across regions, data needs to be available closer to users, and this is where geo-replication becomes essential. 𝗚𝗲𝗼-𝗿𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗮𝗹𝗹𝗼𝘄𝘀 𝗱𝗮𝘁𝗮 𝘁𝗼 𝗯𝗲 𝗰𝗼𝗽𝗶𝗲𝗱 𝗮𝗰𝗿𝗼𝘀𝘀 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗿𝗲𝗴𝗶𝗼𝗻𝘀, 𝗶𝗺𝗽𝗿𝗼𝘃𝗶𝗻𝗴 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗮𝗻𝗱 𝗲𝗻𝘀𝘂𝗿𝗶𝗻𝗴 𝘁𝗵𝗮𝘁 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 𝗿𝗲𝗺𝗮𝗶𝗻 𝗮𝘃𝗮𝗶𝗹𝗮𝗯𝗹𝗲 𝗲𝘃𝗲𝗻 𝗶𝗳 𝗼𝗻𝗲 𝗿𝗲𝗴𝗶𝗼𝗻 𝗴𝗼𝗲𝘀 𝗱𝗼𝘄𝗻. But this comes with trade-offs that cannot be ignored. The biggest challenge is balancing consistency, latency, and availability. Ensuring that all regions always have the exact same data can slow down the system, while relaxing consistency can improve performance but introduce temporary inconsistencies. Strong consistency provides correctness but increases latency, especially when updates must be synchronized across distant regions. Eventual consistency improves speed and scalability but requires systems to handle situations where data may not be immediately aligned. There is also the added complexity of managing replication, handling conflicts, and dealing with network partitions, all of which become more prominent in globally distributed systems. In the end, geo-replication is not about finding a perfect solution. It is about choosing the right trade-offs based on what your system values most. Because in distributed systems, every improvement comes with a cost.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 16

𝗗𝗮𝘆 𝟵𝟯 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗠𝘂𝗹𝘁𝗶-𝗥𝗲𝗴𝗶𝗼𝗻 𝗦𝘆𝘀𝘁𝗲𝗺𝘀 As systems grow, users are no longer in one place, and serving everyone from a single region starts to create problems with latency, availability, and overall user experience. 𝗠𝘂𝗹𝘁𝗶-𝗿𝗲𝗴𝗶𝗼𝗻 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 solve this by distributing infrastructure across different geographic locations, allowing users to connect to the closest region for faster response times while also improving system resilience. Instead of relying on a single data center, services and data are replicated across regions, ensuring that if one region fails, others can continue serving traffic without significant disruption. However, this introduces a new level of complexity, especially when it comes to data consistency. Keeping data synchronized across regions while dealing with network delays and possible conflicts becomes one of the hardest challenges in system design. There are also trade-offs to consider. While multi-region systems improve performance and availability, they increase infrastructure cost and require more sophisticated design to manage replication and coordination. In the end, building globally distributed systems is not just about scaling up, it is about scaling smart. Because once your system goes global, distance becomes part of your design.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 15

𝗗𝗮𝘆 𝟵𝟮 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗥𝗲𝗮𝗱𝘀 𝘃𝘀 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗪𝗿𝗶𝘁𝗲𝘀 As systems grow, handling more traffic is not just about scaling infrastructure, it is about understanding the type of traffic your system receives. Most systems are not balanced. Some are read-heavy, where users mostly fetch data, while others are write-heavy, where data is constantly being created or updated. 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗿𝗲𝗮𝗱𝘀 𝗶𝘀 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗹𝘆 𝗲𝗮𝘀𝗶𝗲𝗿 𝗯𝗲𝗰𝗮𝘂𝘀𝗲 𝗱𝗮𝘁𝗮 𝗰𝗮𝗻 𝗯𝗲 𝗱𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲𝗱 𝗮𝗻𝗱 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗮𝗰𝗿𝗼𝘀𝘀 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗹𝗮𝘆𝗲𝗿𝘀. Techniques like caching, read replicas, and CDNs allow systems to serve data quickly without putting too much pressure on the primary database. 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝘄𝗿𝗶𝘁𝗲𝘀 𝗶𝘀 𝗺𝗼𝗿𝗲 𝗰𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗶𝗻𝗴 𝗯𝗲𝗰𝗮𝘂𝘀𝗲 𝗲𝘃𝗲𝗿𝘆 𝘄𝗿𝗶𝘁𝗲 𝗰𝗵𝗮𝗻𝗴𝗲𝘀 𝘁𝗵𝗲 𝘀𝘆𝘀𝘁𝗲𝗺 𝘀𝘁𝗮𝘁𝗲 𝗮𝗻𝗱 𝗺𝘂𝘀𝘁 𝗿𝗲𝗺𝗮𝗶𝗻 𝗰𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁. This often requires partitioning data across multiple nodes, coordinating updates, and managing conflicts, which adds complexity to the system. The key difference lies in how these operations behave. Reads can be scaled by copying data, while writes require careful coordination to maintain correctness. This is where trade-offs come in. Optimizing for reads can improve performance but may introduce stale data, while scaling writes increases throughput but makes the system harder to manage. In the end, effective system design starts with understanding your workload. Because how you scale depends on what your system does most.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 14

𝗗𝗮𝘆 𝟵𝟭 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗭𝗲𝗿𝗼-𝗗𝗼𝘄𝗻𝘁𝗶𝗺𝗲 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀 In distributed systems, downtime during deployments is no longer acceptable, because users expect services to be available at all times, regardless of updates or changes happening behind the scenes. Zero-downtime deployments are designed to meet this expectation by allowing systems to be updated without taking them offline, ensuring that users can continue interacting with the system without interruption. Instead of shutting down services to apply changes, new versions are introduced gradually while the system is still running. Old and new versions coexist for a period of time, and traffic is shifted carefully until the transition is complete. This approach relies on strategies like rolling updates, blue-green deployments, and canary releases, all working together to make deployments smooth and controlled. The challenge, however, lies in ensuring compatibility. Both versions of the system must work together seamlessly, especially when dealing with shared data and ongoing user activity. Without this level of planning, deployments can introduce inconsistencies or unexpected failures. With it, deployments become invisible to users. Because in modern system design, it is not just about releasing new features. It is about releasing them without anyone noticing

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 13

𝗗𝗮𝘆 𝟵𝟬 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗦𝗰𝗵𝗲𝗺𝗮 𝗠𝗶𝗴𝗿𝗮𝘁𝗶𝗼𝗻𝘀 In distributed systems, evolving your application is expected, but evolving your data safely is where the real challenge lies. 𝗦𝗰𝗵𝗲𝗺𝗮 𝗺𝗶𝗴𝗿𝗮𝘁𝗶𝗼𝗻𝘀 are how databases adapt to change, allowing you to modify structures like tables and columns without breaking the system that depends on them. Unlike code, database changes are harder to reverse and often affect large volumes of data, which means a small mistake can lead to downtime, inconsistencies, or even data loss. That is why migrations must be handled carefully, not as one-time changes but as controlled transitions. In practice, safe migrations are done incrementally by introducing new structures first, updating the application to use them, and only removing old ones after everything is stable. This ensures that both old and new versions of the system can coexist during the transition. Without this approach, deployments become risky, especially in systems that need to remain available at all times. With well-planned migrations, systems can evolve continuously without disrupting users or compromising data integrity. Because in the end, system design is not just about building features. It is about evolving safely. #SystemDesign #DistributedSystems #BackendEngineering #DatabaseDesign #100DaysOfCode

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 12

𝗗𝗮𝘆 𝟴𝟵 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗥𝗼𝗹𝗹𝗯𝗮𝗰𝗸𝘀 In distributed systems, no deployment is ever completely safe, because even well-tested changes can fail under real-world conditions. Rollbacks exist to make those failures manageable by providing a way to quickly return to a previous stable version instead of trying to fix issues while users are already affected. At its core, a rollback is about restoring stability. When a new release introduces errors or degrades performance, the system simply switches back to the last known working version, allowing normal operations to resume while the issue is investigated. Without rollbacks, a bad deployment can turn into a prolonged outage, as teams scramble to debug and patch problems in a live environment. With rollbacks, recovery becomes immediate, reducing impact and giving teams the space to fix issues properly. However, rollbacks are not always trivial. They require careful versioning, backward compatibility, and consideration of data changes, because reverting code without aligning data can create new inconsistencies. In the end, rollbacks are not just a fallback plan. They are a core part of safe system design, ensuring that no change is ever truly irreversible.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 11

𝗗𝗮𝘆 𝟴𝟴 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗥𝗲𝗰𝗮𝗽: 𝗔𝘃𝗮𝗶𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆, 𝗥𝗲𝗰𝗼𝘃𝗲𝗿𝘆, 𝗮𝗻𝗱 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 𝗢𝗻 𝗱𝗮𝘆 𝟴𝟭, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗛𝗶𝗴𝗵 𝗔𝘃𝗮𝗶𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝘃𝘀 𝗙𝗮𝘂𝗹𝘁 𝗧𝗼𝗹𝗲𝗿𝗮𝗻𝗰𝗲, understanding the difference between systems that recover quickly from failure and systems that are designed to never go down at all. 𝗢𝗻 𝗱𝗮𝘆 𝟴𝟮, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗙𝗮𝗶𝗹𝗼𝘃𝗲𝗿 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀, 𝗲xploring how systems switch to healthy components when failures occur, ensuring continuity instead of downtime. 𝗢𝗻 𝗱𝗮𝘆 𝟴𝟯, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗗𝗶𝘀𝗮𝘀𝘁𝗲𝗿 𝗥𝗲𝗰𝗼𝘃𝗲𝗿𝘆, shifting the focus to large-scale failures and how systems are restored after catastrophic events. 𝗢𝗻 𝗱𝗮𝘆 𝟴𝟰, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗥𝗧𝗢 & 𝗥𝗣𝗢, defining how fast systems should recover and how much data loss is acceptable, bringing structure to recovery planning. 𝗢𝗻 𝗱𝗮𝘆 𝟴𝟱, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗗𝗮𝘁𝗮 𝗕𝗮𝗰𝗸𝘂𝗽𝘀 𝗮𝘁 𝗦𝗰𝗮𝗹𝗲, understanding how data is protected reliably as systems grow and become more complex. 𝗢𝗻 𝗱𝗮𝘆 𝟴𝟲, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗕𝗹𝘂𝗲-𝗚𝗿𝗲𝗲𝗻 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀, learning how to release changes safely by switching between identical environments without downtime. 𝗢𝗻 𝗱𝗮𝘆 𝟴𝟳, 𝘄𝗲 𝘁𝗮𝗹𝗸𝗲𝗱 𝗮𝗯𝗼𝘂𝘁 𝗖𝗮𝗻𝗮𝗿𝘆 𝗥𝗲𝗹𝗲𝗮𝘀𝗲𝘀, introducing gradual rollouts that reduce risk by exposing changes to a small subset of users before going fully live. What ties all of these together is a single idea: systems are not just designed to work, they are designed to handle failure and change. Availability ensures systems stay accessible. Recovery ensures systems can bounce back. Deployment strategies ensure systems can evolve safely. Because in real-world systems, it is not enough to build for success. You have to design for failure… and still keep moving forward

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 10

𝗗𝗮𝘆 𝟴𝟳 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗖𝗮𝗻𝗮𝗿𝘆 𝗥𝗲𝗹𝗲𝗮𝘀𝗲𝘀 In distributed systems, releasing a new version to all users at once can be one of the riskiest decisions a team makes, because even a small issue can quickly scale into a widespread failure when exposed to full production traffic. 𝗖𝗮𝗻𝗮𝗿𝘆 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝘀 solve this problem by introducing change gradually instead of all at once, allowing a new version of a system to be deployed to a small subset of users while the majority continues using the stable version. This creates an opportunity to observe real-world behavior, monitor system performance, and detect issues early before they impact everyone. As confidence grows, the rollout is expanded step by step until the new version fully replaces the old one, making the entire deployment process feel less like a leap and more like a controlled transition. Without canary releases, failures tend to affect all users at the same time, making them harder to contain and more damaging. With canary releases, the impact is limited, giving teams the ability to react quickly and make informed decisions based on actual system behavior. This approach does come with added complexity, as it requires strong monitoring, traffic routing, and the ability to manage multiple versions of a system simultaneously, but the trade-off is a much safer and more reliable deployment process. In the end, canary releases shift deployments from high-risk events into gradual experiments, where systems evolve carefully instead of changing all at once.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 9

𝗗𝗮𝘆 𝟴𝟲 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗕𝗹𝘂𝗲-𝗚𝗿𝗲𝗲𝗻 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀 In distributed systems, deployments are one of the riskiest moments. A single bad release can break features, affect users, or bring everything down. Blue-green deployments are designed to remove that risk by changing how releases happen. Instead of updating the live system directly, you maintain two identical environments. One runs the current version, while the other holds the new version ready to go. The new version is deployed and tested in isolation, without affecting users. When everything is confirmed to be working, traffic is simply switched to the new environment, making the release instant and seamless. If anything goes wrong, switching back is just as fast. Without this approach, deployments can feel like a gamble. With blue-green deployments, releases become controlled, predictable, and reversible. The trade-off is cost and complexity, since you need to maintain duplicate environments and handle data consistency carefully. But in return, you gain confidence. Because in real systems, it is not just about building features. It is about releasing them safely.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 8

𝗗𝗮𝘆 𝟴𝟱 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗗𝗮𝘁𝗮 𝗕𝗮𝗰𝗸𝘂𝗽𝘀 𝗮𝘁 𝗦𝗰𝗮𝗹𝗲 In distributed systems, data is more valuable than uptime. You can recover from downtime. You can’t always recover from lost data. That’s why backups are not just a safety net, they are a core part of system design, especially as systems grow. At a small scale, backups feel simple. You copy data, store it somewhere safe, and restore it when needed. But at scale, things change. Data grows rapidly, systems become distributed, and backing up everything frequently becomes expensive, slow, and sometimes impractical. This is where strategy comes in. Instead of copying everything repeatedly, systems rely on incremental backups and snapshots, capturing only what has changed. This reduces storage costs, saves time, and makes backups more efficient without sacrificing reliability. But even with these strategies, trade-offs remain. Frequent backups reduce data loss but increase cost and resource usage. Less frequent backups save resources but increase risk. There is no perfect setup, only the right balance based on how much data you can afford to lose and how quickly you need to recover. Because in the end, backups are not about storing data. They are about making sure that when something goes wrong, recovery is not a question… it is a guarantee.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 7

𝗗𝗮𝘆 𝟴𝟰 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗥𝗧𝗢 & 𝗥𝗣𝗢 In distributed systems, planning for failure is not enough. You also need to define how fast you recover and how much data you can afford to lose. That’s where RTO and RPO come in. 𝗥𝗧𝗢, 𝗼𝗿 𝗥𝗲𝗰𝗼𝘃𝗲𝗿𝘆 𝗧𝗶𝗺𝗲 𝗢𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲, defines how quickly a system should be restored after a failure. It answers the question: how long can the system be down before it becomes a problem? 𝗥𝗣𝗢, 𝗼𝗿 𝗥𝗲𝗰𝗼𝘃𝗲𝗿𝘆 𝗣𝗼𝗶𝗻𝘁 𝗢𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲, defines how much data loss is acceptable. It answers a different question: how far back in time can we go when restoring data? These two concepts shape how disaster recovery systems are designed. A low RTO means faster recovery, often requiring automated failover and highly available infrastructure. A low RPO means minimal data loss, which usually requires frequent backups or real-time data replication. Without clearly defined RTO and RPO, recovery becomes guesswork. With them, system design becomes intentional, balancing business needs, cost, and complexity. The reality is, you can’t optimize for everything. Faster recovery and less data loss come at a cost, and every system must decide what is acceptable based on its use case. Because in the end, resilience is not just about surviving failure. It’s about knowing how fast you recover and how much you can afford to lose.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 5

𝗗𝗮𝘆 𝟴𝟮 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗙𝗮𝗶𝗹𝗼𝘃𝗲𝗿 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 In distributed systems, failure is inevitable, but downtime is a choice. 𝗙𝗮𝗶𝗹𝗼𝘃𝗲𝗿 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 define how a system responds when something breaks, ensuring that services remain available by shifting operations to a backup or standby system instead of waiting for recovery. At its core, failover is about continuity. When a service goes down, the system detects it and redirects traffic to another healthy instance so users can continue without significant disruption. This is made possible through redundancy, health checks, and intelligent routing, often handled by load balancers or orchestration systems working behind the scenes. There are different ways to approach this. Some systems use an active-passive setup where a standby system takes over only when failure occurs, while others use active-active configurations where multiple systems are running at the same time, sharing the load and reducing the risk of downtime. Without failover, a single failure can make an entire service unavailable. With failover, failures become events the system can handle, not disasters users have to experience. Designing failover is not just about having backups, it is about deciding how quickly your system should recover and how seamless that recovery needs to be. Because in real-world systems, it is not enough to build for success. You have to design for what happens next when things go wrong.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 4

𝗗𝗮𝘆 𝟴𝟭 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗛𝗶𝗴𝗵 𝗔𝘃𝗮𝗶𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝘃𝘀 𝗙𝗮𝘂𝗹𝘁 𝗧𝗼𝗹𝗲𝗿𝗮𝗻𝗰𝗲 In distributed systems, failure is not something you try to avoid completely. It is something you design for. 𝗛𝗶𝗴𝗵 𝗔𝘃𝗮𝗶𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗮𝗻𝗱 𝗙𝗮𝘂𝗹𝘁 𝗧𝗼𝗹𝗲𝗿𝗮𝗻𝗰𝗲 are often used interchangeably, but they represent two different approaches to handling failure in real systems. 𝗛𝗶𝗴𝗵 𝗔𝘃𝗮𝗶𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 is about keeping systems accessible by reducing downtime as much as possible. When a failure occurs, the system may experience a brief disruption, but it recovers quickly and continues serving users through mechanisms like redundancy and failover. 𝗙𝗮𝘂𝗹𝘁 𝗧𝗼𝗹𝗲𝗿𝗮𝗻𝗰𝗲 takes this a step further by ensuring that the system continues to operate without any visible interruption, even while failures are happening. Instead of recovering after failure, the system is designed to absorb it in real time. Without high availability, systems remain down longer than necessary, affecting user experience and reliability. Without fault tolerance, failures become noticeable, even if the system eventually recovers. The difference lies in timing and expectation. High availability accepts that failures may cause short downtime but focuses on rapid recovery, while fault tolerance is designed to prevent downtime altogether.

G.Jonathan

G.Jonathan

@beardedtech_guy

Apr 4

𝗗𝗮𝘆 𝟴𝟬 𝗼𝗳 #𝟭𝟬𝟬𝗗𝗮𝘆𝘀𝗢𝗳𝗦𝘆𝘀𝘁𝗲𝗺𝗗𝗲𝘀𝗶𝗴𝗻 — 𝗥𝗲𝗰𝗮𝗽𝘀 ( 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝗰𝗲 𝗽𝗮𝘁𝘁𝗲𝗿𝗻𝘀: 𝗳𝗮𝘂𝗹𝘁 𝗶𝘀𝗼𝗹𝗮𝘁𝗶𝗼𝗻, 𝘁𝗿𝗮𝗳𝗳𝗶𝗰 𝗰𝗼𝗻𝘁𝗿𝗼𝗹, 𝗮𝗻𝗱 𝗴𝗿𝗮𝗰𝗲𝗳𝘂𝗹 𝗱𝗲𝗴𝗿𝗮𝗱𝗮𝘁𝗶𝗼𝗻.) From day 76 to day 79, the focus shifted into a deeper layer of system resilience — not just building systems that work, but systems that survive pressure, failure, and unpredictability. On day 76, we explored Throttling vs Quotas, understanding how systems control usage both in bursts and over time, shaping how resources are shared fairly. On day 77, we talked about Graceful Degradation, a reminder that failure doesn’t have to mean total collapse — systems can bend without breaking. On day 78, we introduced Backpressure, where systems stop pretending everything is fine and start communicating when they’re overwhelmed. On day 79, we covered Bulkheads, isolating failures so that one weak part doesn’t take everything down with it. At this point in the journey, one thing is becoming clear: System design is less about building features and more about designing behavior under stress. On to the next one #SystemDesign #BackendEngineering #DistributedSystems #BuildInPublic #100DaysOfCode