February 25, 2025
Layers, Not Loops [In Progress]
A customer churns. The AE picks "Budget" from a dropdown. The CRM records it. Leadership reviews a pie chart of loss reasons next quarter and concludes pricing is the problem.
It's not that they're lying. A single-select dropdown can't capture what actually happened. Churn is multi-causal and unfolds over months. The loss field says "Budget" — but the emails show a frustrated VP who stopped getting value six months ago, the call transcripts reveal a competitor evaluation nobody flagged, and the usage metrics show an inflection point nobody noticed.
I built a system that triangulates all of this automatically: emails, call transcripts, and product usage metrics fed through a 4-stage LLM pipeline that produces polished HTML reports — per account, per quarter, and per year. No agents. Just structured data, careful prompt engineering, and a clean notebook triggered on closed-lost.
The result: evidence-backed churn reports that reveal the actual causal chain behind each loss — and for active accounts, the same signals surface early enough to act on.
The pipeline at a glance
Data flows from three sources — Gong for call transcripts, People AI for emails, and Domo for product usage metrics — through an ETL layer, into a Jupyter notebook that runs the 4-stage LLM pipeline. Each stage has clearly defined inputs and outputs, and most stages run in parallel across accounts.
Historical call recordings
Customer correspondence
Product interaction data
Filter, join, and stage datasets for the pipeline
Triggered on closed-lost opportunities
Clean → Extract signals
Extract signals
Z-score normalization
Deep reasoning → JSON analysis
JSON → HTML report
Chunk → JSON analysis
Consolidate → JSON report
JSON report → cohort HTML
HTML reports hosted and shared with CS leaders and executives
Before any LLM touches anything, we've already parsed email content, built rich context cards for each account, and structured every input so the model can focus purely on analysis. We built a custom SDK (domo_sdk) that wraps LLM calls and Domo data I/O into a clean interface — llm.prompt() for single calls, llm.parallel() for fanning out across accounts with configurable worker counts and built-in retry logic.
import boto3, json, concurrent.futures
bedrock = boto3.client("bedrock-runtime")
def call_bedrock(item):
body = json.dumps({
"anthropic_version": "bedrock-2023-10-25",
"max_tokens": 4096,
"messages": [{
"role": "user",
"content": build_prompt(item)
}]
})
resp = bedrock.invoke_model(
modelId="anthropic.claude-3-sonnet...",
body=body
)
result = json.loads(
resp["body"].read()
)
return json.loads(
result["content"][0]["text"]
)
results = []
with concurrent.futures.ThreadPoolExecutor(
max_workers=60
) as executor:
futures = {
executor.submit(call_bedrock, item): item
for item in email_records
}
for future in concurrent.futures.as_completed(
futures
):
try:
results.append(future.result())
except Exception as e:
# retry logic, backoff, etc.
handle_retry(futures[future], e)results = llm.parallel(
items=email_records,
prompt_func=process_email,
max_workers=60,
)This is the foundation that makes rapid iteration possible — you focus on improving prompts instead of debugging LLM boilerplate.
Making metrics LLM-readable with z-scores
Raw metrics are meaningless to an LLM. "694 Unique User Logins" — is that good or bad? The model has no idea. It doesn't know that this account normally has 850 logins, or that 694 represents a significant decline.
Instead, we z-score every metric against the account's own historical baseline and embed the z-score directly in the value string. The format 694(z=-1.8) packs the raw value and its deviation from baseline into a compact string. A z-score of -1.8 immediately tells the model "nearly two standard deviations below normal."
Z-scores also normalize across metrics and accounts. Card Loads and Unique User Logins have wildly different magnitudes, but z=-1.8 means the same thing for both. A $400K account creates far more dataflows than a $40K account, but a z-score of -2.0 means the same proportional decline. The model can compare across metrics and accounts without normalization gymnastics.
What the model actually sees
{
"Card Loads": {
"2024-11": "12470(z=0.3)",
"2024-12": "11890(z=0.1)",
"2025-01": "8940(z=-0.9)",
"2025-02": "6940(z=-1.8)",
"2025-03": "4120(z=-2.6)"
},
"Unique User Logins": {
"2024-11": "87(z=0.2)",
"2024-12": "82(z=0.0)",
"2025-01": "64(z=-1.1)",
"2025-02": "41(z=-2.3)",
"2025-03": "22(z=-3.1)"
}
}The prompt tells the model how to interpret them:
Z-scores: Values with z < -1.5 indicate significant declines, z > 1.5 indicate significant increases
Triangulation: Match email/call dates with corresponding days_to_cancel periods for signal-metric correlation
Signal extraction
Cleaning the mess
Enterprise email data is filthy — forwarding chains nested six deep, HTML signatures, legal disclaimers, auto-generated calendar confirmations. Every email goes through a cleaning prompt before analysis:
EMAIL_CLEANING_PROMPT = """
# TASK
Extract only the relevant email content, removing all formatting artifacts,
signatures, headers, and forwarding chains.
# INPUT
{email_content}
# QUALITY STANDARDS
- Preserve all numbers, dates, and specific details mentioned
- Keep conversational context that explains what's being discussed
- Remove only formatting junk, not substantive content
- If multiple messages in chain, separate with "---" between messages
- Keep direct quotes from customers intact
Return only the JSON object.
"""
# Process all emails for an account in parallel -- 60 workers
cleaned_emails = llm.parallel(
items=email_records,
prompt_func=process_single_email_clean,
max_workers=60,
)Each cleaning call is cheap and fast, so we fan out with 60 workers. The cleaned output is structured JSON with sender, date, and body — ready for signal extraction.
Extracting signals with confidence thresholds
Each email thread and call transcript gets analyzed for "loss signals" — specific statements indicating churn risk, extracted into 9 categories (Competitive Pressure, Budget & Cost, Consumption Model Concerns, Product Gaps, Value Realization, Support Problems, Implementation Challenges, Internal Changes, and Relationship Issues).
The critical design decision: a confidence threshold of >= 0.75, with explicit calibration baked into the prompt:
## CONFIDENCE SCORING & VALIDATION
Apply these confidence thresholds rigorously:
**0.85-1.0 (Certainly Drives Loss)**
- Formal cancellation notice or explicit non-renewal statement
- Executive-level escalation with documented decision criteria
- Documented competitive evaluations with specific alternatives named
**0.75-0.84 (Likely Contributes to Loss)**
- Strong budget pressure language from decision-maker
- Explicit feature gap complaints tied to business impact
**Below 0.75 (Exclude)**
- Routine operational questions or minor issues
- Concerns that appear to be addressed or resolved in thread
- Vague dissatisfaction without actionable specifics
Each extracted signal must include a verbatim quote, sender name and role, date, conversational context, and loss implication:
{
"type": "Competitive Pressure",
"verbatim_quote_or_statement": "We've been evaluating Power BI as part of our Microsoft consolidation strategy. The bundled pricing is hard to argue against internally.",
"email_date": "2024-11-08",
"sender_name": "David Park, VP of Analytics",
"context_and_trigger": "In response to renewal discussion, VP mentioned...",
"loss_implication": "Active competitive evaluation driven by platform consolidation.",
"confidence_score": 0.92,
"competitor_name": "Power BI"
}Stage 1 prompts are only responsible for extracting raw signals — no synthesis, no narratives, no connecting dots across sources. That's Stage 2's job.
Account-level triangulation
Chain-of-thought as a first-class output field
Stage 2 takes all Stage 1 outputs — email signals, call signals, z-scored metrics — and synthesizes them into a unified analysis. The key design decision: reasoning_scratchpad is a required field in the output JSON, populated before the final analysis fields.
# CORE TASK
Synthesize the provided email analysis, call analysis, and quantitative metric data
into an evidence-backed JSON report. You will first create a "reasoning_scratchpad"
to build your chain of thought, then use that reasoning to generate the four core
components: 1) Loss Summary, 2) Key Inflection Points, 3) Loss Narrative,
and 4) Missed Signals.
## Step 1: Pre-Analysis Reasoning & Triangulation (Chain of Thought)
- Scan All Evidence: Review all inputs to get a holistic view.
- Identify Key Signals: List the 3-5 most critical qualitative signals.
- Identify Key Inflections: List the 2-4 most significant metric inflection
points (e.g., "Unique User Logins z=-2.5 at 120 days_to_cancel").
- Form Triangulations: Explicitly connect signals and metrics.
Example: "The customer's email about 'implementation challenges' on Jan 15
(110 days_to_cancel) lines up perfectly with 'Cards Created' falling to
0 (z=-3.1) in that same period."
- Hypothesize Root Cause: Formulate a 1-2 sentence hypothesis.
- Populate reasoning_scratchpad: Place this analysis in the output JSON.
Making the scratchpad a required JSON field means the model must reason before concluding. And because it's in the output, we can inspect the reasoning when a report looks off. Chain-of-thought that's auditable, not hidden.
report_json_str = llm.prompt(
STAGE2_TRIANGULATION_PROMPT,
account_name=account_name,
cancellation_date=cancellation_date,
email_analysis=email_analysis,
call_analysis=call_analysis,
metrics_json=metrics_json,
context_card=context_card
)Six variables. One prompt. One structured JSON output.
Separating analysis from formatting
Early versions tried to do analysis and HTML generation in one prompt. The model would hallucinate data points while trying to make the HTML look good, or produce ugly HTML while reasoning carefully. Two competing objectives, both suffered.
The fix: split into two sub-stages.
- Stage 2a (Triangulation): Analyze everything, output structured JSON. No formatting concerns.
- Stage 2b (Formatter): Take the JSON, populate an HTML template. No new analysis allowed.
STAGE2B_FORMATTER_PROMPT = """
# ROLE
You are an expert HTML report generator. Your *only* task is to take the provided
account metadata and the JSON analysis object and perfectly populate a final HTML
report template. You must not add any new analysis, opinions, or text that isn't
derived from the inputs.
"""With separation, debugging is straightforward — if the analysis is good but the HTML is bad, fix the formatter. If the HTML is fine but the analysis is wrong, fix the triangulation prompt. Each stage is independently testable.
Cohort aggregation via map/reduce
With 34 churned accounts in a quarter, each producing a multi-thousand-token analysis JSON, the total blows past any reasonable context window. The solution: map/reduce.
CHUNK_SIZE = 8
chunks = split_dataframe_into_chunks(stage2_df, CHUNK_SIZE)
# Map: generate sub-cohort summaries in parallel
stage3a_results_list = llm.parallel(
items=chunks,
prompt_func=process_3a_chunk,
max_workers=4,
)
# Reduce: consolidate all sub-cohort JSONs into master report
final_cohort_json = llm.prompt(
STAGE3B_CONSOLIDATION_PROMPT,
agg_cohort_context=context_str,
aggregate_sub_cohort_json=aggregate_sub_cohort_json_str
)Stage 3a (map) processes ~8 accounts at a time, aggregating loss theme distributions and selecting representative highlights. Stage 3b (reduce) merges all sub-cohort reports into a single master JSON. Stage 3c formats the master JSON into a cohort-level HTML report, following the same analysis-free formatting pattern as Stage 2b.
What the output looks like
Here's what a CS leader sees. This is a fabricated example for "Meridian Financial Group" — a fictional mid-market financial services company — to illustrate the output format without exposing real customer data.
Meridian Financial Group -- Account Loss Report
Lost ARR Tenure AE CSM Cancellation Date $185,000 3.2 years Michael Torres Sarah Chen April 15, 2025 Loss Summary
Primary Themes: Competitive Pressure, Budget & Cost Pressures
Meridian churned due to a Microsoft platform consolidation strategy where Power BI's bundled pricing undercut Domo's standalone value proposition, accelerated by a CFO-mandated 40% reduction in analytics spend. The competitive evaluation began in November 2024 — five months before cancellation — but was not flagged by the CS team until February 2025 when usage metrics had already declined past the point of recovery.
Timeline
- Nov 8, 2024 (158 days before cancel): VP of Analytics emails CSM mentioning "evaluating Power BI as part of our Microsoft consolidation strategy"
- Dec 2, 2024 (134 days): Unique User Logins drop to z=-1.1. Card Loads begin declining.
- Dec 12, 2024 (124 days): CFO states on QBR call: "We need to reduce our analytics spend by 40%"
- Jan 15, 2025 (90 days): Card Loads hit z=-1.8. Dataflow Runs at z=-2.1. Power BI pilot expands to second business unit.
- Feb 20, 2025 (54 days): CSM notes competitive risk for first time. Unique User Logins at z=-3.1.
- Mar 28, 2025 (18 days): Formal non-renewal notice received.
Loss Narrative
The churn trajectory became visible in November 2024, when VP of Analytics David Park noted in a direct email to the CSM: "We've been evaluating Power BI as part of our Microsoft consolidation strategy. The bundled pricing is hard to argue against internally." At this point, Unique User Logins had already begun a subtle decline from 87 to 82 (z=0.0), but the real signal was qualitative — an active competitive evaluation driven by platform economics, not product dissatisfaction.
By December, the financial dimension crystallized. CFO Lisa Morgan stated on a quarterly business review call that the company needed to "reduce analytics spend by 40%," citing pressure from the board to consolidate vendor relationships. Card Loads dropped from 11,890 to 8,940 (z=-0.9) — not yet alarming in isolation, but the combination of a CFO budget mandate and an active competitive evaluation should have triggered an immediate retention response.
The window closed in January. Card Loads hit z=-1.8 and Dataflow Runs fell to z=-2.1 as the Power BI pilot expanded to a second business unit. By the time the CSM noted competitive risk in February, Unique User Logins were at z=-3.1 — the account was functionally already gone.
Missed Signals
Nov 8, 2024: CS missed the competitive evaluation signal when VP Park mentioned Power BI in an email. At this point, a competitive save offer or executive sponsor engagement could have changed the outcome. Unique User Logins were still at baseline.
Dec 12, 2024: CS missed the budget pressure signal when CFO Morgan stated "reduce analytics spend by 40%." This should have triggered immediate escalation to Sales leadership for a pricing conversation, before the competitive evaluation gained organizational momentum.
It's not a summary — it's a story with dates, names, quotes, and metric evidence. A CS leader can walk through the timeline in a meeting. An executive can read the loss summary in 30 seconds. The missed signals section turns retrospective analysis into prospective training.
Reports live in two places: as sharable HTML documents and inside a Domo app where the CS team can browse, filter, and search across all accounts. The cohort-level report aggregates patterns — "35% of Q4 losses involved competitive pressure, with Power BI and Tableau appearing in 60% of those cases" — giving leadership the systemic view.
Layers, not loops
An agent is an LLM with access to tools, operating in a loop where the model decides what to do next based on intermediate results. The key property is runtime branching — the execution path isn't predetermined.
This pipeline is the opposite. Stage 1 always runs before Stage 2. Stage 2 always runs before Stage 3. No branching, no tool selection. Each stage has a fixed input schema and a fixed output schema. The prompts are carefully engineered, but they are functions — not decision-makers.
| Property | Agent | Pipeline (this system) |
|---|---|---|
| Execution order | Runtime-determined | Fixed at design time |
| Tool selection | Model chooses | No tools, just prompts |
| Looping | Yes, model decides when to stop | No loops, linear stages |
| Failure mode | Unbounded cost/time, hard to debug | Bounded, each stage independently testable |
| Debugging | Trace through decision tree | Inspect input/output at each stage |
| When to use | Ambiguous tasks, open-ended exploration | Well-defined transformations with known data |
The distinction matters for production systems. An agent that goes off-track might burn through $50 of API calls before you notice. A pipeline stage that fails gives you a clear error at a known point, with the previous stage's output intact for debugging.
The limitation is real. This pipeline can't search for additional data sources when evidence is sparse, can't self-correct poor report quality, can't dynamically query different databases based on what it finds. For this use case, those tradeoffs are acceptable — the data sources are known, the analysis structure is well-defined, the output format is fixed. When your problem is well-structured, the pipeline wins on cost, latency, debuggability, and trust.
An agent decides what to do next. A pipeline knows what to do next. Know which one your problem needs.
What I'd do differently
HTML chunking is a hack. Domo's dataset columns have character limits, so we split HTML reports into 7 columns and reassemble on the app side. A proper document storage layer would be cleaner.
Email cleaning could be cheaper. A regex + heuristic pre-pass could handle 80% of formatting cleanup before burning an LLM call on every email — the highest-volume stage at 60 workers.
Confidence thresholds are hand-tuned. The 0.75 cutoff feels right from iteration, but a small labeled dataset would let us calibrate empirically.
The real prize is real-time. This system is retrospective. The same architecture could power an early warning system: run Stage 1 continuously on active accounts, flag when signal density or z-scores cross thresholds, and surface at-risk accounts before the cancellation notice arrives.
The code for this system runs on Domo's internal infrastructure using our custom SDK. The prompts, schemas, and architecture patterns described here are from the production system, with customer data replaced by fictional examples.