A real-world account of how we rewrote a client's Node.js payment validation service in Rust — and the numbers that followed: p99 latency dropped from 45ms to 3ms, memory from 512MB to 48MB per container, and the monthly AWS bill fell by 70%.

In late 2025, one of our Estonian fintech clients came to us with a problem that sounds familiar to many engineering teams: their Node.js payment validation service was becoming expensive to run and difficult to scale under peak load. Transaction volumes had grown 4x over two years, and the infrastructure bill had grown to match — without a proportional improvement in performance.

We proposed a full rewrite in Rust. The client was skeptical — Rust has a reputation for steep learning curves and slow development velocity. This post is an honest account of what happened: what we built, what we found, and what the numbers looked like after six months in production.

The Starting Point: A Node.js Validation Service

The service in question handled payment validation — checking transaction limits, fraud signals, merchant rules, and account balance pre-checks before forwarding to the payment processor. It processed roughly 2,000 requests per second at peak, with strict SLA requirements: p99 latency under 50ms, 99.99% uptime.

The Node.js implementation had evolved organically over four years. It worked, but it showed its age:

Memory usage: Each instance consumed 512MB at rest, scaling to 900MB under load. Running 12 instances for redundancy cost ~$1,400/month in EC2 alone.
GC pauses: V8's garbage collector caused unpredictable latency spikes — p99 was usually around 45ms but spiked to 200ms+ several times per day.
CPU overhead: A single Node.js worker could handle ~300 requests/second before saturating. They needed 8 workers per instance for headroom.

The codebase was also carrying four years of incremental patches. Adding new validation rules required understanding an increasingly tangled chain of async callbacks and shared state.

The Rewrite: Axum + Tokio on Stable Rust

We spent two weeks in discovery — mapping every validation rule, every edge case, every error code the existing service produced. This was non-negotiable: behavior parity had to be 100% before we could switch traffic.

The new stack:

Axum (0.8) as the HTTP framework — ergonomic, async-native, excellent Tower middleware ecosystem
Tokio as the async runtime — battle-tested, predictable performance characteristics
sqlx for database access — compile-time verified SQL queries, zero runtime surprises
serde + serde_json for serialization — efficient deserialization of validation payloads
Redis client (bb8 + redis-rs) for distributed rate limiting and rule caching

The validation logic itself was modeled as a pipeline of composable Rule traits. Each rule implements a validate(payload: &PaymentPayload) -> Result<(), ValidationError> interface. New rules are added by implementing the trait — no shared mutable state, no callback chains, no hidden dependencies.

pub trait ValidationRule: Send + Sync {
    fn name(&self) -> &'static str;
    async fn validate(&self, payload: &PaymentPayload, ctx: &ValidationContext) -> Result<(), ValidationError>;
}

This design made the codebase dramatically easier to reason about. Every rule is an isolated unit that can be tested independently, added without touching existing code, and ordered declaratively.

Development Timeline

The total rewrite took 11 weeks for a team of two senior Rust engineers:

Weeks 1–2: Discovery, behavior documentation, test harness construction (we captured 50,000 real request/response pairs from production for regression testing)
Weeks 3–7: Core service implementation and rule migration
Weeks 8–9: Load testing, performance tuning, Redis integration
Weeks 10–11: Canary deployment (5% traffic → 25% → 100%), monitoring, cutover

One important note on velocity: Rust's compiler is strict, and the initial learning investment is real. But we found that after the first 3 weeks, development velocity stabilized. The type system caught issues that would have been subtle runtime bugs in Node.js — null handling, error propagation, data race conditions in concurrent processing. The total number of memory-safety or concurrency-related production incidents after cutover: zero in the first three months.

The Numbers After Six Months

We measured everything. Here are the results after six months of full production load:

Latency:

p50: 0.8ms (was 8ms) — 90% reduction
p99: 3ms (was 45ms) — 93% reduction
p99.9: 7ms (was 200ms+) — 97%+ reduction
GC pauses: eliminated entirely

Memory:

Idle: 48MB per instance (was 512MB) — 91% reduction
Under peak load: 72MB per instance (was 900MB) — 92% reduction
Instances required: reduced from 12 to 4 (with more headroom than before)

Throughput:

Single instance peak: 3,400 req/s (was 300 req/s per worker, or ~2,400 with 8 workers)
No request queuing observed below 2,800 req/s

Cost:

EC2 compute (payment validation only): $420/month (was $1,400/month) — 70% reduction
With savings across load balancers, data transfer, and CloudWatch metrics: ~$1,100/month total savings

What Made the Difference

Three factors drove the performance gains:

1. No garbage collector. Rust's ownership system deterministically frees memory when values go out of scope. There are no stop-the-world pauses, no unpredictable latency spikes. For a service with a hard p99 SLA, this is transformative.

2. True multi-threading. Node.js runs a single-threaded event loop per worker process. Tokio on Rust uses a work-stealing thread pool that saturates all available CPU cores. A single Rust instance replaced 8 Node.js workers.

3. Zero-cost serialization. The serde ecosystem deserializes JSON payloads directly into strongly-typed structs with no intermediate allocation. The old Node.js service was doing unnecessary object copies at every step of the validation pipeline.

Honest Trade-offs

Rust is not the right choice for every service. Here is what we tell clients honestly:

Higher initial engineering cost. Rust development takes longer than Node.js for engineers new to the language. Budget 20–30% extra time for the first Rust project on a team.
Smaller hiring pool. There are fewer Rust engineers than Node.js engineers. This is improving rapidly — but it is a real consideration for team planning.
Slower iteration for non-critical services. For a simple CRUD API with minimal performance requirements, the overhead of Rust's type system provides less return on investment.

The formula we use: if a service handles more than 500 req/s, has strict latency SLAs (p99 < 20ms), or runs persistently in production with high memory allocation — Rust pays for itself within 6–12 months through cloud savings alone.

What We Learned

This project reinforced three convictions we now treat as engineering policy:

Capture production traffic before rewriting. The 50,000 request/response pairs we captured were the most valuable artifact of the entire project. They revealed edge cases we never would have found by reading the old code.
Canary deployments are non-negotiable for financial services. Running at 5% traffic for a week before increasing gave us confidence that was worth more than any amount of load testing.
Rust's compile-time guarantees eliminate an entire category of production incidents. In 6 months post-cutover: zero null pointer panics, zero data race conditions, zero memory leaks. The type system enforced correctness that would have required extensive runtime monitoring to catch in a dynamically typed language.

For teams evaluating Rust for their performance-critical services: the learning curve is real, the payoff is real, and the compounding benefits — fewer production incidents, lower cloud costs, simpler reasoning about concurrent code — make it the right choice for the right problems.

Rust in Production: How We Rewrote a Fintech API and Cut AWS Costs by 70%