Mr.PlanB

I spun up what felt like a monster on paper: a three-node Proxmox cluster, four NVMe OSDs per node, all enterprise-grade. Separate 25Gb NICs and switches for Ceph public and private traffic. Clean. Overbuilt. Fast. Or at least it was supposed to be. Then the benchmark numbers came back. Bandwidth: 2756 MB/sec. IOPS: 689. Average latency: 23ms. And suddenly, that shiny 25Gb network didn’t feel so shiny anymore . This wasn’t a production system. It was a learning lab. A playground. But when you spend real money on NVMe drives and 25Gb networking, you expect fireworks. Instead, I got a ceiling. And the comments started rolling in. --- ### The Setup: It Should’ve Screamed Here’s the hardware: - 3x MS-02 Ultra 285HX nodes - 64GB DDR5 5600 per node - 4x Micron 7450 Pro NVMe drives per node - PCIe Gen4 - Dual 25Gb networking (separate public and cluster networks) On paper? This thing should fly. The write test was run with: ``` Coderados bench -p ultra-pool 20 write -t 16 --object_size=4MB --no-cleanup ``` And the result was consistent. Around 2.7 to 2.8 GB/sec. Not spiky. Not unstable. Just… capped. One reply summed it up bluntly: *You’re hitting the network limit.* Another user did the math out loud: almost 3 GBytes/sec is about 24 Gbits. That’s basically 25GbE maxed out. It stings when someone else says it. Because deep down, you already know. --- ### “Ceph Shines at Scale” — And That’s the Catch One of the first responses cut through the noise: > It’s hard to tell. But I’d say the small size of the cluster. Ceph shines at scale. 3 nodes is the very bare minimum. That’s the part people don’t always want to hear. Three nodes isn’t “a cluster.” It’s the minimum viable cluster. It’s the point where Ceph barely starts being Ceph. Distributed storage doesn’t flex its muscles until you give it room to breathe. Add more OSDs. Add more nodes. Spread the load wider. Then you start to see real parallelism. With only three nodes, you’re boxed in. Replication traffic bounces between the same machines. Network saturation shows up fast. There’s nowhere for the data to hide. This isn’t Ceph being slow. This is Ceph being honest. --- ### The IOPS Fear What really scared one commenter wasn’t the bandwidth. It was the IOPS. 689 average IOPS. Someone chimed in saying they get roughly 127 IOPS out of their SAS SSDs — so comparatively, this looks better. But let’s be real: these are Micron 7450 Pro NVMe drives. Locally, they can do absurd numbers. So why does distributed storage look so… ordinary? Because this isn’t local storage. It’s not a single NVMe talking to a CPU over PCIe. It’s: - Client write - Network hop - Primary OSD write - Replica OSD write(s) - Acknowledgment chain - Journaling - Commit Every 4MB object goes through ceremony. Through consensus. Through safety. You don’t get NVMe marketing numbers in a replicated cluster. You get durability. You get fault tolerance. You get survival. There’s a price for that. --- ### Is 25GbE the Villain? Let’s talk about the network. 25GbE sounds fast. And it is. Until you start pushing multiple 4MB objects across replication streams. At ~2.8 GB/sec, you’re basically saturating a 25Gb link when overhead is included. You’re not “close.” You’re there. One commenter offered a practical suggestion: LAG or ECMP. > Individual streams won’t exceed 25Gb. But you can run 50Gb per second no problem. That’s the nuance. A single TCP stream won’t magically exceed 25Gb. But multiple flows across multiple ports? Now you’re talking. ECMP can spread traffic across multiple links. LAG can work too, but it gets weird past two ports on some gear. Suddenly you’re not just benchmarking storage — you’re deep in networking architecture. This is how it happens. You start with a cluster and end up redesigning your switching fabric. --- ### The 4K Block Size Rabbit Hole Then things got more interesting. Someone asked: switch or full mesh network? Then came the suggestion: reformat NVMe to 4K block size. That’s when you know you’re officially in the weeds. The Micron 7450 drives might be running 512e instead of native 4Kn. That mismatch can cost performance. Especially in distributed storage systems that care deeply about alignment. But reformatting NVMe drives isn’t casual. You don’t click a button. You: - Remove an OSD - Wait for rebalance - Reformat drive - Re-add as new OSD - Wait for rebalance - Repeat Slowly. Carefully. One disk at a time. That’s not just tuning. That’s surgery. And here’s the emotional reality: when you’re new to Ceph, even asking whether you need to redo the whole cluster feels overwhelming. You don’t want to blow up the lab you just built. But that’s how you learn. --- ### Benchmarks vs Reality Another voice of reason popped up: > The best benchmark is probably your real workload or a fio benchmark which resembles your workload. That’s the uncomfortable truth about storage benchmarks. `rados bench` is synthetic. It’s clean. It’s controlled. It doesn’t look like VM traffic. It doesn’t look like databases. It doesn’t look like backups. You can chase benchmark numbers forever. Or you can ask: does this handle what I actually plan to run? If your workload isn’t saturating 25Gb, maybe you’re not bottlenecked in real life. Maybe this “limit” only exists in lab stress tests. That perspective matters. --- ### Distributed Storage Isn’t “Bad.” It’s Expensive. There was a moment in the discussion where someone basically asked: Is distributed storage really that bad? It’s not bad. It’s expensive. Not just financially. Architecturally. You’re trading: - Raw local NVMe speed for - Redundancy - Self-healing - Data distribution - Node failure tolerance Ceph isn’t designed to win single-drive benchmarks. It’s designed to survive a node dying at 2am without you noticing. That’s the real performance metric. --- ### The Beginner Phase Is the Best Phase One comment stuck with me: > New to all this, but it’s been fun learning so far. That’s the energy that matters. Because this phase — where you question everything — is where the deep understanding forms. Why does Ceph scale with nodes? Why does replication amplify network load? Why does block size alignment matter? Why does ECMP change throughput characteristics? You don’t really get distributed systems until they disappoint you. Then you start pulling threads. --- ### So What’s Actually Happening Here? Let’s zoom out. - ~2.7 GB/sec write bandwidth - ~24 Gbit/sec effective throughput - 25Gb network - 3-node replicated cluster - 4MB objects - 16 threads This is almost textbook network saturation. The drives aren’t the bottleneck. PCIe Gen4 isn’t the bottleneck. The CPUs (285HX) aren’t gasping. The fabric is. And with only three nodes, there’s limited horizontal scaling. Add more nodes? Performance increases because replication spreads wider and aggregate bandwidth grows. Add more links? You break past the single-link ceiling. Tune block size and alignment? You might squeeze extra efficiency. But nothing here screams “broken.” It screams “physics.” --- ### The Hard Truth About “Overkill” Labs There’s something humbling about realizing your “overbuilt” lab is already maxed out. You thought 25Gb was huge. It’s not infinite. You thought enterprise NVMe meant absurd cluster numbers. It doesn’t override replication math. You thought three nodes was serious scale. It’s the starting line. That’s not failure. That’s distributed systems pulling the curtain back. --- ### The Real Win Here’s the part that matters: The cluster is stable. The bandwidth is consistent. The latency isn’t chaotic. The system behaves predictably. Predictability is gold in storage. You didn’t uncover a flaw. You uncovered a limit. And limits are where real architecture decisions begin. Do you: - Add nodes? - Add network links? - Switch to 100Gb? - Tune OSD settings? - Change replication? - Accept the ceiling? That’s not a benchmark question anymore. That’s design. --- Three nodes. Twelve NVMe drives. 25GbE. Nearly 3 GB/sec writes. Not a failure. Just the moment you realize distributed storage doesn’t care about your expectations — only your topology. And honestly? That’s what makes it addictive.

Your 25GbE Dream Is Slamming Into a Wall: The Brutal Truth About My 3-Node Ceph Cluster

Keep Exploring