Technical Deep Dive · 2026
Gauntlet's
Data Platform Migration
a fragile ETL platform → a reliable scaled orchestrator
Simon Frid
Head of Infra, Gauntlet
Circa 2023
Head of Infra, Gauntlet
Circa 2023
Preface
Paradox of Choice
- Aera V3Most technical · on-chain / off-chain hybrid vault protocol. Near-realtime oracle, powered by realtime data arch. Strict security. dApp. SDKs. $100M+ TVL in 6mo.
- Wells Fargo Compliance SystemMost operational · Bespoke on-prem legal & compliance system, powered by a TB-scale versioned international legal-doc datastore. 100s+ data entry/qa and eng involved
- COVID Act NowMost globally impactful · public data platform scaled 0 → 4M users in 48h; cited in White House briefings.
- GigsterMost platform-shaped · Gig CLI + self-managed k8s (OIDC), multi-cloud AWS/GCP, ~1,000 freelance engineers.
the Gauntlet data-platform migration
Kubeflow → Dagster - ~2,000 jobs transpiled, zero-downtime cutover.
→ Real-time Data Arch
EVM & Solana ingestion, transformation & calculation - half-PB real-time aggregation on Rust, Kafka & ClickHouse.
What's at stake
Risk management for the biggest protocols in DeFi
i.e. what risk parameters keep each protocol solvent under stress?
The problem
A daily ETL job with critical business logic that never finished. 😅
- ·1 successful run in a nine-month period.
- ·Each nightly run took ~18h and fired 1,000+ operations - the full suite determined dynamically by the experiments.
- ·Unrenderable UI - the batch topology had no observability or debuggability
- ·Idempotency not guaranteed - reruns and backfills had to be managed manually.
- ·Quants added more ops each day; backfills launched from local laptops piled on compounding scale.
- ·etcd request-size limit hit periodically - the Argo/KFP workflow manifest outgrew etcd's ~1.5 MB per-object ceiling.
- ·Backfills spawned ~100k pods in a single zonal cluster → the k8s master API went unresponsive.
- ·Urgent forcing function: GKE v1.21 hit EOL in January - by mid-April a forced upgrade would break the cluster.
My approach
Share a 1 pager, with a phased roadmap
Notion · No Mo Kubeflo
No Mo Kubeflo
How I Learned to Stop Worrying and Love the Orchestrator
Goal
Replace Kubeflow with either Argo or Dagster.
Why
- Kubeflow doesn't scale - manifest too big to submit as one DAG.
- Need better observability, reproducibility & reliability.
- Pay down infra-level tech debt from our Kubeflow distro.
Evaluation - 6 dimensions
| Scalability | 50k+ tasks/job · ~1,000 concurrent (vs <500 today) |
| Operator UX | UI, run + asset lineage, debugging |
| Performance | spin-up latency, compile/submit time |
| Efficiency | pod packing, container reuse, cost/run |
| Developer UX | local dev loop, fast iteration, dbt + CI |
| Maintainability | community health, abstraction, upgrade path |
Pick the best tool, not the easiest path.
Barbell strategy: POC only Argo + Dagster to cover the widest solution space.
Barbell strategy: POC only Argo + Dagster to cover the widest solution space.
Plan
JanFebMarApr
Phase 1 ✓Standardize setupJan
Phase 2 ✓Measure capabilitiesJan–Feb
Phase 3 ✓Migrate to DagsterFeb–Mar
Phase 4 ✓Launch orchestratorMar–Apr
Phase 5 ✓Clean up & consolidateApr →
Trade-offs
Orchestrator Comparison Matrix
| Airflow | Argo | Dagster | ||
|---|---|---|---|---|
| Sizing | Migration speed | 6–9 mo estimate | <3 mo estimate | unclear |
| Key Risk | slowest learning | preserved old model | more adaptation, less drag | |
| Architecture | Execution model | task DAG | task/container | asset graph |
| Footprint | bloated, heavy deps | thin, k8s-coupled | minimal, lean core | |
| Job definition | shared env, dep conflicts | container-level only | isolated code-locations + deps per team | |
| Performance & capability | Scalability proof | not POC'd | failed - ~1h compile time for YAML | designed for 50k+ tasks/job |
| Concurrency | scheduler limits at fan-out | k8s-bound, heavy | ~1,000+ concurrent | |
| Integration | BigQuery via providers | bring-your-own glue | native BigQuery + pandas/SQL | |
| Data quality | external tooling | none built-in | asset checks + lineage native | |
| Operations & cost | Dev experience | boilerplate, slow iteration | YAML, rebuild to test | typed Python, real unit tests |
| Operability | mature, clunky | low-level k8s | UI + lineage + local loop | |
| Cost | heavy ops headcount | low infra, high upkeep | upfront migration, lower run cost | |
| Ecosystem & hiring | largest community, easy hiring | CNCF/k8s niche | younger, smaller talent pool | |
| Verdict | ✕ too much scope | ✕ easy path failed | ✓ chosen |
Execution
Mobilize and Act
Tech
- Transpiler compiled ~2,000 legacy jobs → Dagster assets - no hand-porting
- Re-targeted our in-house simulation engine onto Dagster - reused its graph compiler instead of rewriting the sims
- BigQuery adapters for pandas + SQL workloads
- Build efficiency - forkserver w/ preload, Python packaging, and Docker caching
- Fresh GKE w/ reproducible, declarative IaC and a local-dev for infra
People
- Evangelize the vision, need and solution - met skepticism!
- Rallied a cross-functional team of 8 engineers around the plan
- Clear owner per workstream
- Brought in both quants and eng
- Pairing as escalation when blocked
Process
- Measure. Iterate through deltas and errors, until parity*
- Broke the migration into parallel workstreams
- Weekly team planning
- Daily standups to aggressively unblock
- Daily communications for company-wide updates and for executives
QA control
Drive to provable parity
- Pairwise BQ → BQ diff (legacy vs new) - full-row
INTERSECT/EXCEPTper table × partition, rolled up across ~2,000 tables - Manual investigation of stochastic outputs (sims)
- Run old + new in parallel instead of asking the team to trust a rewrite
- Monitor 7+ days: table diffs, failed jobs, and discrepancy tickets
- 2-day code freeze before the cutover
- Integrate on-call before switching the production writer
parity-review · BQ → BQ
| Table | prod | new | common |
|---|---|---|---|
| chain.transactions | # | # | # |
| dex.swaps | # | # | # |
| risk.exposures | # | # | # |
| risk.user_fact | # | # | # |
| chain.blocks | # | # | # |
| collateral.balances | # | # | # |
| oracle.prices | # | # | # |
| lending.positions | # | # | # |
| wallets.active | # | # | # |
| borrow.rates | # | # | # |
| governance.aip_events | # | # | # |
| liquidations | # | # | # |
| sim.var_curves | # | # | # |
Launch
Zero downtime.
The platform was live; the team had to keep moving the same day.
- ·Company demo - set the stage; the company adopted the new paradigm.
- ·Team retro - lessons learned; solidify morale.
Aftermath
Harden the platform, ramp the team
Onboarding & ramping the company
- Tutorials & quickstart
- Office hours & 1on1 support
Operational management
- On-call guide
- Several hours/day → <20 min/day
Monitoring & alerting
- Sensor-based alerting aggregated summaries in Slack
- Sentry error aggregation
- Custom asset-health dashboards
Self-serve: one declaration provides an asset per protocol
# one declaration → a VaR asset per protocol SIMPSON_LAR_VAR = { protocol: FunctionBasedAsset( logical_name=f"{protocol.value}SimpsonLarVar", materialize_fn=_fetch_data(protocol, info), data_schema=SimpsonLarVarSchema, table_name=f"{protocol.value.protocol_str}_lar_var", grouping=AssetGrouping.SIMULATION, ) for protocol in PROTOCOLS # compound · euler · moonwell }
40+
engineers shipping
on Dagster
on Dagster
Continuous improvement
Incrementally improved over next 6mo
| Before · Kubeflow | After · Dagster | |
|---|---|---|
| Freshness | daily SLA · best-effort | daily & hourly DAGs; eventually 5min |
| Success Rate | approx ~85% | ~95%+ initially; 99%+ eventually |
| Operator UI | couldn't render a day's run | loads in ~1 s |
| Compile / load | ~15 min full compile | 2 min initially; 10s eventually, via added optimization |
| time on-call | multiple hours/day | under 20 minutes/day |
| Job Duration | 18 hours | 4 hr daily initial · <1 hr eventual |
| Backfills | manual · high cognitive load | partition-aware · transitive replay |
| Access | manual user provisioning w/ global permissions | IAP-based, team-owned DAGs, code-location multitenancy |
| Compute cost | ~$100k / mo | ~$50k / mo initially → $20k / mo eventually |
| Plugins | bespoke | dbt toolkit |
| Dashboarding | ipython notebooks, metabase | mode, hex |
| Rituals | irregular | regular eng/platform cost reviews |
With compute efficiency greatly up, BigQuery spend began to grow over the following years. Net platform cost, however, was overall ~50% lower.
Next Phase
REALISM - Real-time Streaming
Reliable Events And Logic In Six Minutes
- Solana & EVM re-org aware ws rust-based indexers
- Continuous complex window aggregations in 15s
- Cascade of 20+ materialized views
- idempotent processing ½ PB scale
- table-aware metric APIs with caching
- python asyncio streaming subgraph for logical data eng framework
Next Phase · REALISM
Progress through Iteration.
Process - experimental & iterative
- Benchmarked ClickHouse vs TimescaleDB for the time-series store
- Started with GCP Pub/Sub → Kafka on Confluent
- Syncing: Kafka table engine / ClickPipes → bespoke python service → Datadog's Vector
- Serialization: Protobuf → JSON → Avro
- Transformation: Flink / Bytewax → Rust + Python
- Integrated ClickHouse · Dagster · BigQuery with DBA best practices
Team - grew over time
- 1–2 engineers prototyping → 12 engineers, matrixed across app dev, SDK, data eng, and strategy
Longterm Handoff
New team gradually began to own infra, batch & ETL, and streaming services - as I shifted my role to managing Gauntlet's on-chain protocol.
Lessons
Lasting Retrospective
- ·Prioritizing DevX was the most reliable bet.
- ·Follow through after crisis was important.
Fully materialized wins came after 6+ months of continuous improvement. - ·Could have done better - Closed the loop for teams to reliably address their own dags sooner.
Q&A - Objections
Ask me anything
"Why not just use Argo?"
"Why self-host over Dagster Cloud?"
"Was zero-downtime worth a transpiler?"
Appendix - under the hood
Transpiler: ~2,000 legacy jobs → Dagster, by graph
class LegacyRegistryCompiler: # compile every legacy Kubeflow job into a Dagster asset def from_registry(self, registry: PlanRegistry, selections=None): return [ self.transpile_map[type(entry)](entry, refs) # walk the dependency graph in topological order for generation in nx.topological_generations( registry.inverted_graph() ) for entry in generation if selections is None or entry.logical_name in selections ]
One pass over the legacy registry, topologically sorted - no hand-porting, dependency order preserved.
Appendix - under the hood
Hard-won infra gotchas
Multiprocessing start method is per-environment
forkserveron k8s - avoids zombie subprocesses; preload modules so each fork starts warmspawnon Apple Silicon - forkserver hits malloc errors locally. Never plainfork.
safe-to-evict:"false" is narrower than it looks- Only blocks the cluster-autoscaler's voluntary scale-down of a live run's pod
- Does nothing against OOMKill, node-pressure, preemption, or
kubectl drain(that needs a PDB)
Anti-eviction trades cost for reliability
- One long run pins a whole half-idle node → slow scale-down
- Deliberate: never lose a multi-hour sim just to reclaim a node
Throttle shared upstreams, not just pods
- Pod / step parallelism will happily DoS a rate-limited API
tag_concurrency_limitsis the global governor - DEX aggregator capped to 1 across all runs
Resume the evicted, retry the rest
- Run-monitoring auto-resumes dead pods - up to 2 attempts
- Op-level retry policy (max_retries=2) catches transient step failures
etcd's ~1.5 MB per-object ceiling
- The original Kubeflow killer - the Argo/KFP manifest outgrew it
- Asset graphs sidestep it - no single giant DAG object to submit