Navigation
Search
|
More hardware won’t fix bad engineering
Monday September 15, 2025. 11:00 AM , from InfoWorld
As an industry, we’ve gotten good at buying our way out of bad decisions. Need more throughput? Add instances. Tail latencies get spiky? Add a cache in front of the cache. Kelly Sommers nails the root cause: Pattern-driven architectures can be organizationally tidy yet computationally wasteful. The fix isn’t another layer—it’s fundamentals. If you fund or run a backend team, data structures and algorithms aren’t an interview hoop. They are operating leverage for service-level objectives (SLOs) and cost of goods sold (COGS).
Deep down, developers already know this. Technical leaders often feel it in the COGS line when the cloud bill swells. In both cases, the antidote is the same: build a culture where choosing and shaping data structures is a first-class architectural decision, and where algorithmic trade-offs are measured the way finance measures ROI. We need, as Sommers stresses, “developers to build clean, maintainable systems that actually respect how computers work.” Fundamentals aren’t about nostalgia Start with a simple premise: At scale, small inefficiencies become whole features’ worth of cost and user pain. Jeff Dean’s well-worn “latency numbers” cheat sheet exists for a reason. A main-memory access is hundreds of times slower than an L1 cache hit; a trip across a data center is orders of magnitude slower again. If your hot paths bounce around memory or the network without regard to locality, the user pays with time, and you pay with dollars. It turns out that basic physics matters. A lot. Pair that with what Dean and Luiz André Barroso called the “tail at scale” back in 2013. The 99th percentile latency is where your SLAs (service-level agreements) go to die because in a fan-out service, even rare hiccups become common. Tail-tolerant systems are as much about algorithmic and data-layout choices as they are about replicas and retries. In other words, fundamentals show up on the right side of your SLOs and the left side of your financials. If this sounds abstract, consider Java’s HashMap. Pre-Java 8, an attacker who forced many keys into the same bucket could degrade lookups from an average time of O(1) to a worst-case of O(n), hobbling performance or enabling a denial of service. The Java team fixed this in JEP 180 by “tree-ifying” long collision chains into balanced red-black trees, improving the worst case to O(log n). That’s an algorithm/data structure decision, not a micro-optimization—and it changed the security and performance profile of one of the most used collections on earth. If you’re a VP of architecture, that’s the kind of “fundamentals” discussion you want in your design reviews. CS101 teaches Big O notation, but in production, memory rules. Ulrich Drepper’s classic paper from 2007 explains why code that looks linear can behave super-linearly once you thrash caches or wander across NUMA boundaries. Data structures and access patterns that maximize locality (think B-trees with page-sized nodes, Structure of Arrays (SoA) versus Array of Structures (AoS) layouts, ring buffers) are not academic details—they’re the difference between CPUs working and CPUs waiting. Here’s the executive version: Cache-friendly data structures turn compute you’re already paying for into throughput you can actually use. Storage engines are data structures with budgets Every database storage engine is a data structure with a profit and loss balance sheet. Storage engines such as B+ trees, which are optimized for fast, disk-based reads and range scans, trade higher write costs (write amplification) for excellent read locality; log-structured merge-trees (LSM trees) flip that, optimizing for high write rates at the cost of compaction and read amplification. Neither is better. Each is a conscious algorithmic trade-off with direct operational consequences (IOPS, SSD wear, CPU burn during compaction). If your workloads are heavy writes with batched reads, LSM makes sense. If your workload is read-latency sensitive with range scans, B+ trees often win. Your choice is a data-structure selection problem mapped onto cloud bills and SLOs. Treat it that way. Not convinced? There’s an interesting paper by Frank McSherry, Michael Isard, and Derek Murray that asks a blunt question: How many machines do you need before your hip, cool parallel system beats a competent single thread? They call the metric “COST” (configuration that outperforms a single thread), and the answer for many published systems is “a lot”—sometimes hundreds of cores. If a better algorithm or data structure obliterates your need for a cluster, that’s not simply an engineering flex; it’s millions of dollars saved and an attack surface reduced. You don’t even have to look far for a pure algorithmic win. Facebook’s switch to Zstandard (zstd) wasn’t “premature optimization.” It was a deliberate algorithm choice yielding better compression and faster (de)compression than zlib, improving performance and reducing storage/egress costs at enormous scale. Again: fundamentals with a business case. “But AI changes all this…” Some developers think AI alters the equation, and the answer is sort of. The equation simply favors the fundamentals of sound data structures even more. Machine learning pipelines are just data structures in motion: columnar formats, vector indexes, bloom filters, segment trees, message queues, cache layers. Poor choices cascade: ETL jobs that churn because of unbounded joins, vector stores with pathological recall/latency trade-offs, inference paths dominated by serialization overhead rather than model compute. The fastest optimization in many AI systems isn’t a bigger GPU; it’s picking the right index and batch size, structuring features for cache locality, and designing data movement like you pay for it—because you do. If you run a backend engineering team and your design docs aren’t making data-structure choices explicit—complete with measured trade-offs—you’re probably compensating for fundamentals with infrastructure expensed elsewhere on the balance sheet. All that said, Sommers is insistent but not fanatical on the topic. Fundamentals matter, but sometimes the right answer is to get as much good as a team will allow into their architecture: “Sometimes the best architecture isn’t about being right, it’s about sneaking good fundamentals into whatever framework your team already loves.” Sommers is right to drag our attention back to basics. The fundamentals of computing—not the latest framework—determine whether your backend is fast, predictable, and cost-effective. If your team only hits SLOs when your “performance person” breaks out perf at midnight, you’ve built a lottery system. If fundamentals are routine, if everyone understands why the main index is a B+ tree with 4KB pages and knows where the compaction debt hides, you get predictability. Predictability is what you sell to your customers and your CFO. It’s seductively easy to paper over fundamentals with more hardware. But in the long run, algorithmic clarity and thoughtful data structures compound like interest. They’re how you keep the promises you make to users—and to your P&L.
https://www.infoworld.com/article/4056660/more-hardware-wont-fix-bad-engineering.html
Related News |
25 sources
Current Date
Sep, Mon 15 - 14:18 CEST
|