Skip to content
Authority 11 min ·

Why automation needs snapshot-and-undo.

Automation in commerce has historically had two settings: off, or terrifying. The reason is architectural. Snapshot-and-undo is the primitive that makes automation deployable across a portfolio of 50 client shops.

By Sumeru Engineering · automation · safety · architecture

The default automation experience is "off"

Ask any agency operations lead how much automation they actually have running across their portfolio. The honest answer is "as little as possible."

It's not that operators don't want automation. It's that the automation tools available to them — Zapier, Make, custom-coded scripts, the built-in rules in their attribution platform — have a particular failure mode: when something goes wrong, nothing rolls back. The bid you accidentally pushed to 30% high stays at 30% high until someone notices and manually fixes it. The 47 products you accidentally paused stay paused. The audit trail says "user X ran this action at 14:32," which doesn't help when 14:32 was an automation, not user X.

So the rational response from operators is to use automation only for actions where the failure mode is benign. Alt-text generation? Sure — worst case the alt-text is wrong. Cart-recovery emails? OK — worst case is a slightly off-tone message. Anything touching paid spend? Touching product publish state? Touching pricing? Not a chance. Manual approval, every time.

The bottleneck is human attention. The blocker isn't the technology; it's the absent safety primitive.

What snapshot-and-undo actually is

Every action in Sumeru's automation engine exports three functions:

// Per-handler contract
export async function snapshot(ctx) {
  // Capture the full before-state needed to reverse this action.
  // Return a typed JSON object stored alongside the action record.
}

export async function handler(ctx, snapshot) {
  // Perform the mutation. May call Shopify, Google Ads, or any other API.
  // Should be idempotent — safe to retry if the worker crashes mid-flight.
}

export async function undoHandler(ctx, snapshot) {
  // Restore the system to the state described in `snapshot`.
  // Called when an operator clicks "undo" or an audit determines the action was wrong.
}

The contract is simple but the implications are substantial. Three things become true:

  1. Every action has a documented before-state. The snapshot is persisted; it's queryable; it's part of the audit log.
  2. Every action has a documented reverse path. The undo function is the formal specification of "what does it mean to undo this." There's no ambiguity, no "we'll figure it out."
  3. Every action is auditable end-to-end. Action record + snapshot + handler outcome + undo function = the complete description of what happened and how to reverse it.

This isn't novel as a software pattern; it's how databases have done transactional writes for 50 years. What's novel is treating cross-system commerce actions — "reduce Google Ads bid by 12% on campaign 47" — with the same rigour.

An example: reduce_paid_bid

The reduce_paid_bid action is one of Sumeru's 13 production handlers. It's also one of the highest-stakes — paid-spend mutations can cost real money fast. Here's its three-function contract, roughly:

// snapshot()
async function snapshot({ campaignId, account, shop }) {
  const campaign = await googleAds.getCampaign(account, campaignId);
  return {
    campaignId,
    accountId: account.id,
    shopId:    shop.id,
    beforeBid: campaign.bidStrategy.targetCpa,
    beforeBudget: campaign.dailyBudget,
    capturedAt: new Date().toISOString(),
  };
}

// handler()
async function handler({ adjustmentPercent, ...ctx }, snap) {
  const newBid = snap.beforeBid * (1 - adjustmentPercent / 100);
  await googleAds.updateCampaignBid(snap.accountId, snap.campaignId, newBid);
  return { newBid, appliedAt: new Date().toISOString() };
}

// undoHandler()
async function undoHandler({ accountId, campaignId, beforeBid }, _snap) {
  await googleAds.updateCampaignBid(accountId, campaignId, beforeBid);
  return { restoredBid: beforeBid, undoneAt: new Date().toISOString() };
}

Three functions, each under 20 lines, each independently testable. The persistence layer wires them together: the action record carries the snapshot, the handler outcome, and a pointer to the undo function. Every action is a transaction.

What this unlocks: 7-day mandatory dry-run for paid spend

Once every action has snapshot-and-undo, you can ship a stronger safety primitive: enforced dry-run for high-blast-radius actions.

Sumeru's runtime marks handlers as customerFacing=true when they affect paid spend or customer messaging. For these, every rule has a mandatory 7-day dry-run period: the rule runs, captures snapshots, computes what the mutation would have been — but doesn't actually call Google Ads. After 7 days of dry-run, an operator reviews the simulated mutations and decides whether to flip to live.

The cost of this is one week of delay before automation takes effect. The benefit is structural: most bad rules are caught in dry-run, not in production.

An anonymised example: a customer ops engineer wrote a reduce_paid_bid rule with an off-by-one threshold. In dry-run, the rule flagged 320 simulated bid mutations that would have zeroed out a $40k/wk campaign. The rule was caught and rewritten before it ever shipped. The next time anyone touches a high-stakes automation, that incident is the precedent.

The audit trail is the evidence layer

Snapshot-and-undo also fixes the audit-log problem.

Most automation platforms log actions like this: 2026-03-15 14:32 | user=system | action=reduce_bid | campaign=47. That's not useful. It tells you something happened; it doesn't tell you why, what the before-state was, or how to reverse it.

Sumeru's engineAudit() writes plain-language reason rows on every autonomous action. A typical row looks like this:

2026-03-15 14:32  reduce_paid_bid  campaign=47
Reason: Sustained ROAS drop -34% over 6 days, threshold 25%.
Before: bid=$2.40, budget=$8,400/day.
After:  bid=$2.11, budget=$8,400/day.
Trace:  attribution.engine.flagged → automation.rule.matched → snapshot.captured → handler.applied
Undo:   one-click via /app/automations/audit/01HXX...

This row is searchable, traceId-correlated, and self-contained. A compliance team reads these directly. There's no Looker required, no engineer required to interpret. 365-day default retention.

What it costs to build

Snapshot-and-undo is not free engineering. Every handler that ships to production requires three functions, three tests, and integration with the snapshot-persistence layer. Adding a 14th handler took us about 3 days of careful work, plus a week of dry-run in staging before promotion.

The cost is borne by the platform team. The benefit is borne by every operator who deploys the automation. Across a portfolio of 50 client shops, the math is overwhelmingly in favour of the investment.

The alternative — "we'll just be careful" — has a known failure rate. Snapshot-and-undo turns "be careful" into a structural property of the system, not an operational discipline.

The principle: safety primitives that compose

Snapshot-and-undo composes with the other safety primitives in Sumeru's runtime:

  • 7-day mandatory dry-run for customerFacing handlers — built on snapshot, because dry-run simulates the mutation without executing
  • Approval queues for agency-managed shops — built on the action record, because approvals route the action through human review before dispatch
  • Kill-switch for an entire shop — built on the action queue, because kill-switch pauses queued actions before they fire
  • BFCM-safe freeze windows — same primitive, scoped to a date range

None of these would be possible without snapshot-and-undo as the foundation. Once you have the snapshot, you can simulate (dry-run), defer (approval), pause (kill-switch), and reverse (undo). Without it, you have none of them.

The takeaway

If you're evaluating an automation platform — or building one — the diagnostic question is: does every action have a snapshot, a handler, and an undo function?

If the answer is "no" (or "kind of"), you're going to find yourself in the same place every operator does: automation only for benign actions, manual approval for everything else. The blocker isn't your team; it's the missing primitive.

The fix isn't a feature. It's an architectural choice you make once and then live with.


If you want to see snapshot+undo on a real Sumeru action, request a demo — we walk through firing a representative automation in dry-run, capturing the snapshot, then undoing it, end to end in 8 minutes. /contact or read the Automation Engine page.