Architecture Decision Records in the Age of AI
Rethinking pull requests with GitHub stacked PRs and discussing Claude Mythos
In the age of AI, where everything seems to be a markdown file somewhere, it is interesting to see a renewed interest in having the (right) documentation available.
An Architecture Decision Record (ADR) is one of these types of documentation that can be valuable, as it captures the why a certain decision has been made in the past.
Our code tends to capture just what is being done, leaving the reasons for choosing a certain approach/technology to be part of the tribal knowledge. Like most things in our area, context is king, and as it changes, decisions of the past could be revisited.
An ADR can help capture these decisions, helping you avoid the merry-go-round of revisiting previous options without understanding if the conditions that warranted changed or not.
As Martin Fowler recently shared, ADRs should be short documents, preferably stored alongside the source repository where code lives.
They should follow the inverted pyramid style, where the most important things (such as the status, decision) can be found at the beginning, and the additional details (the assessed options, pros and cons) at the end.
They must never be updated once a decision has been made, but it is ok to have new versions, superseding the previous one. This keeps the evolution intact and provides a clear indication of what is the latest version.
Some tips:
Don’t invent the wheel: There are many ADR templates out there. Pick one that mostly represents you and adjust to your reality.
Keep it short: some ADRs have way too many sections that add confusion, can lead to unnecessary information being added, and discourage the audience from using it.
Automate as much as possible: In my case, I created an agent skill that I invoke and provide the context. The agent asks for the information based on my template and reasons with me to make sure the explanation I provided addresses the concerns, showing possible weak spots.
Then save it in your repository and start documenting your decisions!
GitHub Stacked PRs
It is a known fact that long code review sessions tend to affect the quality of the review negatively. It is the famous LGTM for a 50-file PR x 10 comments for a 2-file one.
So having small(er) PRs is meant to help the quality of the review from being affected. But how to slice your changes? And how to make sure your successive PRs build on top of each other.
One approach has been to create multiple branches in a cascade fashion (see below) and then submit. Some challenges are:
There is no clear indication of the order of the PRs
If changes are requested/done to the first one, you need to apply the changes to the descending ones
To address this, you had to take on additional tools such as Graphite. Now, GitHub has introduced, as a private preview, the concept of stacked PRs.
A stack is an ordered sequence of branches, each building on the one below it, all targeting a single trunk branch (typically main). Each branch becomes its own PR, and the dependency order is explicit; reviewers always know which PR to read first.
While not mandatory, I think it works well with vertical slices. Rather than cutting your work horizontally by layer (all the database changes in one PR, all the API changes in another, all the UI in a third), you cut vertically, and each PR delivers one coherent slice of behavior, end-to-end, even if some parts are initially stubbed.
Take the Fraud Risk feature as an example. A horizontal split would have reviewers staring at schema migrations with no business context, then a wall of service code with no UI, then a UI with no backend to reason about. A vertical stack tells a story instead:
PR1 is pure infrastructure with the FraudRisk entity, the database migration, and the repository interface. No business logic yet, but it compiles, tests pass, and a reviewer can evaluate the data model.
PR2 adds the endpoint and wires everything together, but the fraud score is hardcoded to 0. The full request/response contract is visible and reviewable without needing to understand any scoring logic.
PR3 introduces the actual calculation, but PaymentScoringService is mocked. A reviewer can focus entirely on the scoring logic and how it flows through the system, without getting pulled into external integration concerns.
PR4 swaps the mock for the real PaymentScoringService and handles the integration edge cases. By now, a reviewer has all the context they need to evaluate whether the integration is correct.
Each PR is independently reviewable and tells a clear story. The mock isn’t a shortcut; it’s a boundary that keeps each slice focused.
So, with the stacked PRs, we would do this
Imagine the reviewer finds a point to update in one of the initial stacks:
For more workflows, check this page.
Hoping the feature matures soon enough and is available to all. And like most cases, there is an agent skill to help you with these operations: npx skills add github/gh-stack
Assessing Claude Mythos Preview’s cybersecurity capabilities
The discussion about using GenAI to uncover vulnerabilities in open source software has been at an all-time high, with companies revisiting their bug bounty programs due to the sheer volume of reports, bogus or not, largely accelerated via the use of AI.
Anthropic added to this by announcing a limited-access preview of its new model, dubbed Mythos. They discussed the capabilities already shown in the preview here, with the overall message that it could find and generate exploits for zero-day vulnerabilities, the oldest being a 27-year-old bug.
What was particularly interesting about this model is that the exploits went beyond the direct one (stack-related) and consisted of many vulnerabilities chained together to escape sandboxes from the layers involved.
This rapid evolution is worrying many sections, ranging from those who provide services and tools to perform security assessment/prevention, to governments that fear this being used to attack financial and infrastructure-related companies.
While models like Opus 4.6 can find and fix vulnerabilities, it has a bad track record (nearly 0%) at exploiting them. Mythos, on the other hand, had a much higher success rate (72%) on the same experiment.
Security software is not a new challenge, with the rhythm between finding vulnerabilities, fixing, and exploits being available following a known pattern. Toolkits to enable the creation of exploits have always existed, enabling targeted attacks from less technically savvy individuals.
Now this is potentially amped to a new level, with an attacker potentially identifying and creating the exploit overnight.
The article can’t mention details on all the cases, as only 1% of the reported issues have been patched, but just on Firefox, it was able to find 112 confirmed bugs.
While we can’t verify the results, unless you have access to Mythos, the advice is to double down on using GenAI to help you develop and secure your software. From triaging security reports to writing or applying fixes, you can reduce the opportunity window.





