Technical Deep Dive · 2026

Gauntlet's
Data Platform Migration

a fragile ETL platform → a reliable scaled orchestrator

Simon Frid
Head of Infra, Gauntlet
Circa 2023

Preface

Paradox of Choice

Aera V3Most technical · on-chain / off-chain hybrid vault protocol. Near-realtime oracle, powered by realtime data arch. Strict security. dApp. SDKs. $100M+ TVL in 6mo.
Wells Fargo Compliance SystemMost operational · Bespoke on-prem legal & compliance system, powered by a TB-scale versioned international legal-doc datastore. 100s+ data entry/qa and eng involved
COVID Act NowMost globally impactful · public data platform scaled 0 → 4M users in 48h; cited in White House briefings.
GigsterMost platform-shaped · Gig CLI + self-managed k8s (OIDC), multi-cloud AWS/GCP, ~1,000 freelance engineers.

the Gauntlet data-platform migration Kubeflow → Dagster - ~2,000 jobs transpiled, zero-downtime cutover.

→ Real-time Data Arch EVM & Solana ingestion, transformation & calculation - half-PB real-time aggregation on Rust, Kafka & ClickHouse.

What's at stake

Risk management for the biggest protocols in DeFi

Gauntlet's quants and data platform - data ingestion, risk simulations, parameter optimization - managing risk across Aave, Moonwell, Uniswap, Euler and Compound; $10+ billion in TVL.

i.e. what risk parameters keep each protocol solvent under stress?

The problem

A daily ETL job with critical business logic that never finished. 😅

·
1 successful run in a nine-month period.
·
Each nightly run took ~18h and fired 1,000+ operations - the full suite determined dynamically by the experiments.
·
Unrenderable UI - the batch topology had no observability or debuggability
·
Idempotency not guaranteed - reruns and backfills had to be managed manually.
·
Quants added more ops each day; backfills launched from local laptops piled on compounding scale.
·
etcd request-size limit hit periodically - the Argo/KFP workflow manifest outgrew etcd's ~1.5 MB per-object ceiling.
·
Backfills spawned ~100k pods in a single zonal cluster → the k8s master API went unresponsive.
·
Urgent forcing function: GKE v1.21 hit EOL in January - by mid-April a forced upgrade would break the cluster.

My approach

Share a 1 pager, with a phased roadmap

Notion · No Mo Kubeflo

No Mo Kubeflo

How I Learned to Stop Worrying and Love the Orchestrator

Goal

Replace Kubeflow with either Argo or Dagster.

Why

Kubeflow doesn't scale - manifest too big to submit as one DAG.
Need better observability, reproducibility & reliability.
Pay down infra-level tech debt from our Kubeflow distro.

Evaluation - 6 dimensions

Scalability	50k+ tasks/job · ~1,000 concurrent (vs <500 today)
Operator UX	UI, run + asset lineage, debugging
Performance	spin-up latency, compile/submit time
Efficiency	pod packing, container reuse, cost/run
Developer UX	local dev loop, fast iteration, dbt + CI
Maintainability	community health, abstraction, upgrade path

Pick the best tool, not the easiest path.
Barbell strategy: POC only Argo + Dagster to cover the widest solution space.

Plan

JanFebMarApr

Phase 1 ✓Standardize setupJan

Phase 2 ✓Measure capabilitiesJan–Feb

Phase 3 ✓Migrate to DagsterFeb–Mar

Phase 4 ✓Launch orchestratorMar–Apr

Phase 5 ✓Clean up & consolidateApr →

Trade-offs

Orchestrator Comparison Matrix

		Airflow	Argo	Dagster
Sizing	Migration speed	6–9 mo estimate	<3 mo estimate	unclear
Sizing	Key Risk	slowest learning	preserved old model	more adaptation, less drag
Architecture	Execution model	task DAG	task/container	asset graph
	Footprint	bloated, heavy deps	thin, k8s-coupled	minimal, lean core
	Job definition	shared env, dep conflicts	container-level only	isolated code-locations + deps per team
Performance & capability	Scalability proof	not POC'd	failed - ~1h compile time for YAML	designed for 50k+ tasks/job
	Concurrency	scheduler limits at fan-out	k8s-bound, heavy	~1,000+ concurrent
	Integration	BigQuery via providers	bring-your-own glue	native BigQuery + pandas/SQL
	Data quality	external tooling	none built-in	asset checks + lineage native
Operations & cost	Dev experience	boilerplate, slow iteration	YAML, rebuild to test	typed Python, real unit tests
	Operability	mature, clunky	low-level k8s	UI + lineage + local loop
	Cost	heavy ops headcount	low infra, high upkeep	upfront migration, lower run cost
	Ecosystem & hiring	largest community, easy hiring	CNCF/k8s niche	younger, smaller talent pool
	Verdict	✕ too much scope	✕ easy path failed	✓ chosen

Execution

Mobilize and Act

Tech

Transpiler compiled ~2,000 legacy jobs → Dagster assets - no hand-porting
Re-targeted our in-house simulation engine onto Dagster - reused its graph compiler instead of rewriting the sims
BigQuery adapters for pandas + SQL workloads
Build efficiency - forkserver w/ preload, Python packaging, and Docker caching
Fresh GKE w/ reproducible, declarative IaC and a local-dev for infra

People

Evangelize the vision, need and solution - met skepticism!
Rallied a cross-functional team of 8 engineers around the plan
Clear owner per workstream
Brought in both quants and eng
Pairing as escalation when blocked

Process

Measure. Iterate through deltas and errors, until parity*
Broke the migration into parallel workstreams
Weekly team planning
Daily standups to aggressively unblock
Daily communications for company-wide updates and for executives

QA control

Drive to provable parity

Pairwise BQ → BQ diff (legacy vs new) - full-row INTERSECT/EXCEPT per table × partition, rolled up across ~2,000 tables
Manual investigation of stochastic outputs (sims)
Run old + new in parallel instead of asking the team to trust a rewrite
Monitor 7+ days: table diffs, failed jobs, and discrepancy tickets
2-day code freeze before the cutover
Integrate on-call before switching the production writer

parity-review · BQ → BQ

Table	prod	new	common
chain.transactions	#	#	#
dex.swaps	#	#	#
risk.exposures	#	#	#
risk.user_fact	#	#	#
chain.blocks	#	#	#
collateral.balances	#	#	#
oracle.prices	#	#	#
lending.positions	#	#	#
wallets.active	#	#	#
borrow.rates	#	#	#
governance.aip_events	#	#	#
liquidations	#	#	#
sim.var_curves	#	#	#

Launch

Zero downtime.

The platform was live; the team had to keep moving the same day.

·
Company demo - set the stage; the company adopted the new paradigm.
·
Team retro - lessons learned; solidify morale.

Aftermath

Harden the platform, ramp the team

Onboarding & ramping the company

Tutorials & quickstart
Office hours & 1on1 support

Operational management

On-call guide
Several hours/day → <20 min/day

Monitoring & alerting

Sensor-based alerting aggregated summaries in Slack
Sentry error aggregation
Custom asset-health dashboards

Self-serve: one declaration provides an asset per protocol

# one declaration → a VaR asset per protocol
SIMPSON_LAR_VAR = {
    protocol: FunctionBasedAsset(
        logical_name=f"{protocol.value}SimpsonLarVar",
        materialize_fn=_fetch_data(protocol, info),
        data_schema=SimpsonLarVarSchema,
        table_name=f"{protocol.value.protocol_str}_lar_var",
        grouping=AssetGrouping.SIMULATION,
    )
    for protocol in PROTOCOLS   # compound · euler · moonwell
}

40+

engineers shipping
on Dagster

Continuous improvement

Incrementally improved over next 6mo

	Before · Kubeflow	After · Dagster
Freshness	daily SLA · best-effort	daily & hourly DAGs; eventually 5min
Success Rate	approx ~85%	~95%+ initially; 99%+ eventually
Operator UI	couldn't render a day's run	loads in ~1 s
Compile / load	~15 min full compile	2 min initially; 10s eventually, via added optimization
time on-call	multiple hours/day	under 20 minutes/day
Job Duration	18 hours	4 hr daily initial · <1 hr eventual
Backfills	manual · high cognitive load	partition-aware · transitive replay
Access	manual user provisioning w/ global permissions	IAP-based, team-owned DAGs, code-location multitenancy
Compute cost	~$100k / mo	~$50k / mo initially → $20k / mo eventually
Plugins	bespoke	dbt toolkit
Dashboarding	ipython notebooks, metabase	mode, hex
Rituals	irregular	regular eng/platform cost reviews

With compute efficiency greatly up, BigQuery spend began to grow over the following years. Net platform cost, however, was overall ~50% lower.

Next Phase

REALISM - Real-time Streaming

Reliable Events And Logic In Six Minutes

Solana & EVM re-org aware ws rust-based indexers
Continuous complex window aggregations in 15s
Cascade of 20+ materialized views
idempotent processing ½ PB scale
table-aware metric APIs with caching
python asyncio streaming subgraph for logical data eng framework

Real-time streaming architecture: a WebSocket source (Solana + EVM) feeds Rust-based indexers, which publish incoming events to Kafka; a Python async streaming graph consumes them, a sync service writes to ClickHouse materialized views, and an API (driven by parameters) serves a dashboard.

Next Phase · REALISM

Progress through Iteration.

Process - experimental & iterative

Benchmarked ClickHouse vs TimescaleDB for the time-series store
Started with GCP Pub/Sub → Kafka on Confluent
Syncing: Kafka table engine / ClickPipes → bespoke python service → Datadog's Vector
Serialization: Protobuf → JSON → Avro
Transformation: Flink / Bytewax → Rust + Python
Integrated ClickHouse · Dagster · BigQuery with DBA best practices

Team - grew over time

1–2 engineers prototyping → 12 engineers, matrixed across app dev, SDK, data eng, and strategy

Longterm Handoff

New team gradually began to own infra, batch & ETL, and streaming services - as I shifted my role to managing Gauntlet's on-chain protocol.

Lessons

Lasting Retrospective

·
Prioritizing DevX was the most reliable bet.
·
Follow through after crisis was important.
Fully materialized wins came after 6+ months of continuous improvement.
·
Could have done better - Closed the loop for teams to reliably address their own dags sooner.

Q&A - Objections

Ask me anything

"Why not just use Argo?"

"Why self-host over Dagster Cloud?"

"Was zero-downtime worth a transpiler?"

Appendix - under the hood

Transpiler: ~2,000 legacy jobs → Dagster, by graph

class LegacyRegistryCompiler:
    # compile every legacy Kubeflow job into a Dagster asset
    def from_registry(self, registry: PlanRegistry, selections=None):
        return [
            self.transpile_map[type(entry)](entry, refs)
            # walk the dependency graph in topological order
            for generation in nx.topological_generations(
                registry.inverted_graph()
            )
            for entry in generation
            if selections is None or entry.logical_name in selections
        ]

One pass over the legacy registry, topologically sorted - no hand-porting, dependency order preserved.