Atlantis UI Refresh: Plan Persistence & User Experience

by Viktoria Ivanova 56 views

Hey guys! Let's dive into a super important discussion about refreshing the Atlantis UI. This is all about making our lives easier when we're managing infrastructure as code. We're talking about making sure those crucial plan results and job histories stick around, even when things get a little shaky with updates and scaling. So, grab your favorite beverage, and let’s get into the details!

The Current Challenge with Atlantis UI

Currently, many teams rely on the Atlantis UI to review plan results and the complete job history. The challenge arises during those pesky rolling updates or auto-scaling events. You know the drill: a container gets stopped and replaced. Now, here’s the kicker – after the replacement, the Jobs list in the UI? Poof! It's empty. And those detailed step-by-step plan outputs? Gone, even though the actual workflow-workspace.tfplan files are chilling safely in your shared storage, like Amazon EFS. This can be a real headache, especially when you need to go back and review what happened.

Imagine this: You've got multiple Atlantis containers happily deployed on AWS ECS. You’re using a shared file system, like Amazon EFS, mounted at the same path in every container. You've even got a lock database humming along in Redis. Everything seems smooth, right? But then, bam, a container gets replaced, and suddenly your UI is showing a blank slate. It’s like your Atlantis instance has amnesia! This is not ideal, especially when you need to track down past plans or troubleshoot an issue. We need a solution that ensures the UI reflects the actual state of our infrastructure, even across container restarts.

This ephemeral nature of the UI state means that each time a container restarts, the historical context is lost. This forces teams to rely on other means of tracking plan executions, which can be cumbersome and less efficient. We want to provide a seamless experience where the UI consistently displays the history of plans, regardless of container lifecycle events. This is not just about convenience; it's about maintaining a reliable audit trail and ensuring that teams have the information they need at their fingertips. By addressing this issue, we can significantly improve the usability and value of Atlantis in dynamic environments.

Proposed Solution: Re-hydrating the Atlantis UI

So, what’s the fix? The idea is simple: on container start-up, Atlantis should re-hydrate the UI from any plan artifacts already hanging out in the shared storage. Think of it like Atlantis waking up, taking a look around, and saying, "Oh yeah, I remember those plans!" This involves two key phases:

  1. Discovery Phase: Atlantis scans the EFS mount for those workflow-workspace.tfplan files for every repo and workspace. It's like a little detective, sniffing out all the past plans.
  2. Re-index Phase: Atlantis then populates its internal job cache, so the Jobs page in the UI shows those historical entries, exactly as they appeared before the container restart. It's like the UI getting its memory back!

This approach makes plans and their metadata first-class persisted resources, not just temporary container state. It’s similar to how locks survive restarts when you’re using Redis. It's all about making sure that the UI reflects the actual, persistent state of your plans, rather than just what's happening in the current container instance. By treating plan data as persistent, we can ensure a more consistent and reliable user experience, especially in dynamic environments where containers are frequently cycled.

The goal here is to create an experience that feels robust and dependable. Users should be able to rely on the Atlantis UI as a single source of truth for their infrastructure plans, regardless of the underlying container lifecycle. This re-hydration process ensures that the UI accurately reflects the history of plan executions, providing valuable context for debugging, auditing, and collaboration. Ultimately, this enhances the overall value and usability of Atlantis, making it an even more powerful tool for managing infrastructure as code.

Potential Drawbacks

Of course, no solution is perfect, and it’s crucial to consider the downsides. Here are a few potential drawbacks we need to think about:

  • Startup Latency: Imagine you have massive repositories with tons of historical plans. Scanning all those files could slow down container boot time while the index gets rebuilt. It’s like waiting for your computer to start up when you have a million files on your desktop – it can take a while!
  • Metadata Drift: What if a plan file exists, but the corresponding PR or commit has been deleted? The UI might show these “orphaned” entries, which could be confusing. We’d need some extra logic to handle these cases.
  • Concurrency Complexity: Picture multiple containers all trying to run this discovery process at the same time. They might race to write the same metadata into Redis or memory, leading to potential conflicts. We’d need some coordination, like Redis transactions or leader election, to keep things in order.
  • Maintenance Overhead: If we change the plan file formats or storage paths in the future, we’d need to update the re-hydration code to match. It’s like needing to update your GPS every time the roads change.

It’s important to address these drawbacks proactively. For the startup latency issue, we might consider strategies like lazy loading or indexing in the background. For metadata drift, we could implement periodic cleanup processes to remove orphaned entries. Concurrency complexity can be managed through robust locking mechanisms and distributed transaction patterns. And for maintenance overhead, we need to ensure that our re-hydration logic is modular and easily adaptable to future changes in plan storage or formatting.

By carefully considering these drawbacks and developing mitigation strategies, we can ensure that the re-hydration solution is both effective and maintainable in the long run. This proactive approach will help us deliver a robust and reliable user experience, even as our infrastructure and tooling evolve.

Alternative Solutions Considered

We also brainstormed a few other ways to tackle this problem, but they each had their own downsides:

  1. Separate “Archived Plans” Tab: We could keep the current Jobs list as is (ephemeral) but add a new tab that lists plans found on disk. But then users might get confused about where to look for plans.
  2. Persist Job Metadata in Redis (or DynamoDB): We could write a small record to Redis each time a plan completes and rebuild the UI from Redis keys at startup. But this introduces a second persistence strategy, and if the cache gets flushed, the index is lost.
  3. Force Containers to Run in “Sticky” Mode: We could disable automatic task replacement to avoid losing UI state. But this defeats the purpose of using ECS for automatic updates and rescheduling.

Let's break down these alternatives a bit further. The "Archived Plans" tab idea, while seemingly straightforward, could lead to a fragmented user experience. Users might not immediately know whether to look in the current Jobs list or the Archived Plans tab, potentially causing confusion and frustration. The separation of views could also make it harder to track the complete history of a plan execution, as users would need to switch between tabs to piece together the information.

Persisting job metadata in Redis or DynamoDB has its merits, but it also introduces additional complexity. Maintaining two separate persistence mechanisms (plans on EFS and metadata in Redis) can be challenging, especially when it comes to ensuring consistency and data integrity. A cache flush in Redis could result in the loss of valuable metadata, leaving the UI incomplete despite the plan files still being available on EFS. This approach also adds operational overhead, as we would need to manage and monitor the Redis or DynamoDB instance.

Forcing containers to run in "sticky" mode is the most operationally restrictive option. While it would prevent the loss of UI state during container restarts, it would also negate the benefits of using ECS for automatic scaling and updates. This approach would make it harder to deploy new versions of Atlantis and could lead to downtime or service disruptions. It's simply not a viable option for environments that require high availability and continuous deployment.

Why Re-hydrating from EFS Wins

Given all these trade-offs, re-hydrating the Jobs list directly from EFS seems like the sweet spot. It balances user experience and architectural simplicity. Plus, it aligns perfectly with how plan artifacts are already stored today. It’s all about sticking with what works and making it even better!

Think about it: we’re already storing the plan files on EFS, so why not leverage that? By scanning EFS for existing plans, we can rebuild the UI state without introducing new dependencies or persistence mechanisms. This approach keeps the architecture clean and straightforward, making it easier to maintain and troubleshoot. It also ensures that the UI accurately reflects the actual plans that have been executed, providing a reliable source of truth for users.

Moreover, re-hydrating from EFS provides a seamless user experience. Users can access the complete history of plans, even after container restarts, without having to navigate to different views or tabs. This consistency is crucial for maintaining a smooth workflow and ensuring that users have the information they need at their fingertips. By focusing on simplicity and usability, we can make Atlantis an even more valuable tool for infrastructure as code management.

In conclusion, the re-hydration approach strikes the right balance between functionality, maintainability, and user experience. It addresses the core issue of UI state persistence in a straightforward and efficient manner, while minimizing the risks and complexities associated with alternative solutions. This makes it the most compelling option for enhancing the Atlantis UI and providing a more robust and reliable experience for our users.

Next Steps

So, what’s next? Let's keep this conversation going! Share your thoughts, ideas, and concerns. Together, we can make the Atlantis UI even better and more user-friendly. Let's make managing our infrastructure a breeze, guys!