Technical Deep Dive · 2026

Gauntlet's
Data Platform Migration

a fragile ETL platform → a reliable scaled orchestrator

Simon Frid
Head of Infra, Gauntlet
Circa 2023
Preface

Paradox of Choice

  • Aera V3Most technical · on-chain / off-chain hybrid vault protocol. Near-realtime oracle, powered by realtime data arch. Strict security. dApp. SDKs. $100M+ TVL in 6mo.
  • Wells Fargo Compliance SystemMost operational · Bespoke on-prem legal & compliance system, powered by a TB-scale versioned international legal-doc datastore. 100s+ data entry/qa and eng involved
  • COVID Act NowMost globally impactful · public data platform scaled 0 → 4M users in 48h; cited in White House briefings.
  • GigsterMost platform-shaped · Gig CLI + self-managed k8s (OIDC), multi-cloud AWS/GCP, ~1,000 freelance engineers.
the Gauntlet data-platform migration Kubeflow → Dagster - ~2,000 jobs transpiled, zero-downtime cutover.
Real-time Data Arch EVM & Solana ingestion, transformation & calculation - half-PB real-time aggregation on Rust, Kafka & ClickHouse.
What's at stake

Risk management for the biggest protocols in DeFi

Gauntlet's quants and data platform - data ingestion, risk simulations, parameter optimization - managing risk across Aave, Moonwell, Uniswap, Euler and Compound; $10+ billion in TVL.

i.e. what risk parameters keep each protocol solvent under stress?

The problem

A daily ETL job with critical business logic that never finished. 😅

  • ·
    1 successful run in a nine-month period.
  • ·
    Each nightly run took ~18h and fired 1,000+ operations - the full suite determined dynamically by the experiments.
  • ·
    Unrenderable UI - the batch topology had no observability or debuggability
  • ·
    Idempotency not guaranteed - reruns and backfills had to be managed manually.
  • ·
    Quants added more ops each day; backfills launched from local laptops piled on compounding scale.
  • ·
    etcd request-size limit hit periodically - the Argo/KFP workflow manifest outgrew etcd's ~1.5 MB per-object ceiling.
  • ·
    Backfills spawned ~100k pods in a single zonal cluster → the k8s master API went unresponsive.
  • ·
    Urgent forcing function: GKE v1.21 hit EOL in January - by mid-April a forced upgrade would break the cluster.
My approach

Share a 1 pager, with a phased roadmap

Notion · No Mo Kubeflo
No Mo Kubeflo
How I Learned to Stop Worrying and Love the Orchestrator
Goal
Replace Kubeflow with either Argo or Dagster.
Why
  1. Kubeflow doesn't scale - manifest too big to submit as one DAG.
  2. Need better observability, reproducibility & reliability.
  3. Pay down infra-level tech debt from our Kubeflow distro.
Evaluation - 6 dimensions
Scalability50k+ tasks/job · ~1,000 concurrent (vs <500 today)
Operator UXUI, run + asset lineage, debugging
Performancespin-up latency, compile/submit time
Efficiencypod packing, container reuse, cost/run
Developer UXlocal dev loop, fast iteration, dbt + CI
Maintainabilitycommunity health, abstraction, upgrade path
Pick the best tool, not the easiest path.
Barbell strategy: POC only Argo + Dagster to cover the widest solution space.
Plan
JanFebMarApr
Phase 1 Standardize setupJan
Phase 2 Measure capabilitiesJan–Feb
Phase 3 Migrate to DagsterFeb–Mar
Phase 4 Launch orchestratorMar–Apr
Phase 5 Clean up & consolidateApr →
Trade-offs

Orchestrator Comparison Matrix

AirflowArgoDagster
SizingMigration speed6–9 mo estimate<3 mo estimateunclear
Key Riskslowest learningpreserved old modelmore adaptation, less drag
ArchitectureExecution modeltask DAGtask/containerasset graph
Footprintbloated, heavy depsthin, k8s-coupledminimal, lean core
Job definitionshared env, dep conflictscontainer-level onlyisolated code-locations + deps per team
Performance & capabilityScalability proofnot POC'dfailed - ~1h compile time for YAMLdesigned for 50k+ tasks/job
Concurrencyscheduler limits at fan-outk8s-bound, heavy~1,000+ concurrent
IntegrationBigQuery via providersbring-your-own gluenative BigQuery + pandas/SQL
Data qualityexternal toolingnone built-inasset checks + lineage native
Operations & costDev experienceboilerplate, slow iterationYAML, rebuild to testtyped Python, real unit tests
Operabilitymature, clunkylow-level k8sUI + lineage + local loop
Costheavy ops headcountlow infra, high upkeepupfront migration, lower run cost
Ecosystem & hiringlargest community, easy hiringCNCF/k8s nicheyounger, smaller talent pool
Verdict✕ too much scope✕ easy path failed✓ chosen
Execution

Mobilize and Act

Tech

  • Transpiler compiled ~2,000 legacy jobs → Dagster assets - no hand-porting
  • Re-targeted our in-house simulation engine onto Dagster - reused its graph compiler instead of rewriting the sims
  • BigQuery adapters for pandas + SQL workloads
  • Build efficiency - forkserver w/ preload, Python packaging, and Docker caching
  • Fresh GKE w/ reproducible, declarative IaC and a local-dev for infra

People

  • Evangelize the vision, need and solution - met skepticism!
  • Rallied a cross-functional team of 8 engineers around the plan
  • Clear owner per workstream
  • Brought in both quants and eng
  • Pairing as escalation when blocked

Process

  • Measure. Iterate through deltas and errors, until parity*
  • Broke the migration into parallel workstreams
  • Weekly team planning
  • Daily standups to aggressively unblock
  • Daily communications for company-wide updates and for executives
QA control

Drive to provable parity

  • Pairwise BQ → BQ diff (legacy vs new) - full-row INTERSECT/EXCEPT per table × partition, rolled up across ~2,000 tables
  • Manual investigation of stochastic outputs (sims)
  • Run old + new in parallel instead of asking the team to trust a rewrite
  • Monitor 7+ days: table diffs, failed jobs, and discrepancy tickets
  • 2-day code freeze before the cutover
  • Integrate on-call before switching the production writer
parity-review · BQ → BQ
Tableprodnewcommon
chain.transactions###
dex.swaps###
risk.exposures###
risk.user_fact###
chain.blocks###
collateral.balances###
oracle.prices###
lending.positions###
wallets.active###
borrow.rates###
governance.aip_events###
liquidations###
sim.var_curves###
Launch
Zero downtime.

The platform was live; the team had to keep moving the same day.

  • ·
    Company demo - set the stage; the company adopted the new paradigm.
  • ·
    Team retro - lessons learned; solidify morale.
Aftermath

Harden the platform, ramp the team

Onboarding & ramping the company
  • Tutorials & quickstart
  • Office hours & 1on1 support
Operational management
  • On-call guide
  • Several hours/day → <20 min/day
Monitoring & alerting
  • Sensor-based alerting aggregated summaries in Slack
  • Sentry error aggregation
  • Custom asset-health dashboards
Self-serve: one declaration provides an asset per protocol
# one declaration → a VaR asset per protocol
SIMPSON_LAR_VAR = {
    protocol: FunctionBasedAsset(
        logical_name=f"{protocol.value}SimpsonLarVar",
        materialize_fn=_fetch_data(protocol, info),
        data_schema=SimpsonLarVarSchema,
        table_name=f"{protocol.value.protocol_str}_lar_var",
        grouping=AssetGrouping.SIMULATION,
    )
    for protocol in PROTOCOLS   # compound · euler · moonwell
}
40+
engineers shipping
on Dagster
Continuous improvement

Incrementally improved over next 6mo

Before · KubeflowAfter · Dagster
Freshnessdaily SLA · best-effortdaily & hourly DAGs; eventually 5min
Success Rateapprox ~85%~95%+ initially; 99%+ eventually
Operator UIcouldn't render a day's runloads in ~1 s
Compile / load~15 min full compile2 min initially; 10s eventually, via added optimization
time on-callmultiple hours/dayunder 20 minutes/day
Job Duration18 hours4 hr daily initial · <1 hr eventual
Backfillsmanual · high cognitive loadpartition-aware · transitive replay
Accessmanual user provisioning w/ global permissionsIAP-based, team-owned DAGs, code-location multitenancy
Compute cost~$100k / mo~$50k / mo initially → $20k / mo eventually
Pluginsbespokedbt toolkit
Dashboardingipython notebooks, metabasemode, hex
Ritualsirregularregular eng/platform cost reviews
With compute efficiency greatly up, BigQuery spend began to grow over the following years. Net platform cost, however, was overall ~50% lower.
Next Phase

REALISM - Real-time Streaming

Reliable Events And Logic In Six Minutes

  • Solana & EVM re-org aware ws rust-based indexers
  • Continuous complex window aggregations in 15s
  • Cascade of 20+ materialized views
  • idempotent processing ½ PB scale
  • table-aware metric APIs with caching
  • python asyncio streaming subgraph for logical data eng framework
Real-time streaming architecture: a WebSocket source (Solana + EVM) feeds Rust-based indexers, which publish incoming events to Kafka; a Python async streaming graph consumes them, a sync service writes to ClickHouse materialized views, and an API (driven by parameters) serves a dashboard.
Next Phase · REALISM

Progress through Iteration.

Process - experimental & iterative
  • Benchmarked ClickHouse vs TimescaleDB for the time-series store
  • Started with GCP Pub/Sub → Kafka on Confluent
  • Syncing: Kafka table engine / ClickPipes → bespoke python service → Datadog's Vector
  • Serialization: Protobuf → JSON → Avro
  • Transformation: Flink / Bytewax → Rust + Python
  • Integrated ClickHouse · Dagster · BigQuery with DBA best practices
Team - grew over time
  • 1–2 engineers prototyping → 12 engineers, matrixed across app dev, SDK, data eng, and strategy
Longterm Handoff
New team gradually began to own infra, batch & ETL, and streaming services - as I shifted my role to managing Gauntlet's on-chain protocol.
Lessons

Lasting Retrospective

  • ·
    Prioritizing DevX was the most reliable bet.
  • ·
    Follow through after crisis was important.
    Fully materialized wins came after 6+ months of continuous improvement.
  • ·
    Could have done better - Closed the loop for teams to reliably address their own dags sooner.
Q&A - Objections

Ask me anything

"Why not just use Argo?"
"Why self-host over Dagster Cloud?"
"Was zero-downtime worth a transpiler?"
Appendix - under the hood

Transpiler: ~2,000 legacy jobs → Dagster, by graph

class LegacyRegistryCompiler:
    # compile every legacy Kubeflow job into a Dagster asset
    def from_registry(self, registry: PlanRegistry, selections=None):
        return [
            self.transpile_map[type(entry)](entry, refs)
            # walk the dependency graph in topological order
            for generation in nx.topological_generations(
                registry.inverted_graph()
            )
            for entry in generation
            if selections is None or entry.logical_name in selections
        ]

One pass over the legacy registry, topologically sorted - no hand-porting, dependency order preserved.

Appendix - under the hood

Hard-won infra gotchas

Multiprocessing start method is per-environment
  • forkserver on k8s - avoids zombie subprocesses; preload modules so each fork starts warm
  • spawn on Apple Silicon - forkserver hits malloc errors locally. Never plain fork.
safe-to-evict:"false" is narrower than it looks
  • Only blocks the cluster-autoscaler's voluntary scale-down of a live run's pod
  • Does nothing against OOMKill, node-pressure, preemption, or kubectl drain (that needs a PDB)
Anti-eviction trades cost for reliability
  • One long run pins a whole half-idle node → slow scale-down
  • Deliberate: never lose a multi-hour sim just to reclaim a node
Throttle shared upstreams, not just pods
  • Pod / step parallelism will happily DoS a rate-limited API
  • tag_concurrency_limits is the global governor - DEX aggregator capped to 1 across all runs
Resume the evicted, retry the rest
  • Run-monitoring auto-resumes dead pods - up to 2 attempts
  • Op-level retry policy (max_retries=2) catches transient step failures
etcd's ~1.5 MB per-object ceiling
  • The original Kubeflow killer - the Argo/KFP manifest outgrew it
  • Asset graphs sidestep it - no single giant DAG object to submit