Optimizing Too Early

How to know when to stop shooting big and focus on incremental changes

Feb 12, 2021

One of the biggest frustrations I have as a software engineer is when one of my coworkers “Requests Changes” on one of my pull requests because I don’t handle a particular edge case that hasn’t come up yet. Or maybe they like my feature, but think that I should go ahead and make it a little more optimized before it gets in front of customers.

There are times, obviously, when concerns like this are valid. When working for a big tech company (the setting that engineers are trained to operate in) the slightest inefficiencies in database operations can be very expensive. Similarly, if your code is likely to be used by teams that won’t know the intricacies of what your designing, handling edge cases that aren’t yet expected can be important to providing a good abstraction boundary for your distant coworkers. But there are many cases, especially in close-nit startup teams, when these conditions…simply don’t matter.

You have to know when writing code or building a system/product of any type whether it’s time to experiment or time to optimize. If it’s time to experiment, then you’re better off ignoring these types of edge-case comments. If you know what you’re working on is going to immediately be deployed to thousands or millions of people, fine — you can sweat the details all day long.

Although the last twenty years or so of product development in Silicon Valley has recognized the value of “fail fast” and “move fast and break things” experimentation, I don’t think that the day to day experience of engineers is optimized to exploit this. It’s very hard, if not impossible, to tell a coworker frankly that they need to spend less time thinking about potential side effects of their code and instead focus on getting things shipped fast. Usually managers are more focused on negative downside than positive upside and/or were trained in a different environment where mistakes were more expensive. This leads to engineers over-optimizing their code instead of focusing on what startups really need — faster iteration.

There are some really interesting examples of premature optimization, both within tech and the world broadly that I can’t help but to bring up. My favorite of which actually comes from one of my hobbies, cross-country skiing.

Within nordic skiing, there are two styles, called “classic” and “skate”. While classic skiing, your skies stay parallel, kicking one leg forward at a time in tandem with your arms. This style has been perfected and optimized for centuries. There’s special optimally-sticky wax (distinct for every snow type and temperature), professionalized equipment, and scientific training regimens focused on getting every bit of speed out of this style.

For a while, everyone sort of assumed this was the fastest way to move yourself on skies across land. This all changed in the 1970’s when an amateur American athlete, Bill Koch, figured out that if you wax your skies smooth like downhill skies and move your legs like you would if you were ice skating, you could actually go pretty significantly faster (about 20% faster in fact). Since there was no style-guide to how you were allowed to competitively ski back in those days, Koch used this totally wild style to win a silver medal in the 1976 Olympics, shocking the world and creating a new category no one ever anticipated would exist.

The nordic ski world, as understood by people before Koch, was figured out. It had become a science focused on small-impact stuff, the end-state of any old industry or product today. But what Koch did, as an amateur from a country that had never been competitive in nordic skiing events in the past (USA had never won a single medal in the event before Koch), was rethink from the beginning what makes for speed. It’s a beautiful, risky, and maybe uniquely American accomplishment. Today, most new skiers are introduced to skate skiing first — after all it’s easier, faster, and trails for nordic skiing require far less maintenance than do classic ski trails.

This happens even in supposedly well-developed industries like computer chips. The modern war between old-style CISC (complex instruction set computer) hardware designs pioneered by old-guard players like Intel and newer RISC (reduced instruction set computer) designs is a perfect example. CISC hardware was built when most programs were written by engineers who understood low-level computer code. It was built to minimize the number of physical execution cycles that had to be performed per operation. It did this by providing hardware that allowed for variable length instructions — so that some complicated instructions could be inputted by clever programmers looking to do risque things like modify the value of one memory location while simultaneously changing another memory location’s value based on a computation. This model worked great when most programmers understood how to use these complicated instructions in their programs and before anyone imagined that future computers would implement the parallel processing of instructions.

Things have changed, though. Now, most programmers (including myself lol) have no idea how to utilize these CISC hardware optimizations, many of which were optimizations built for situations uncommonly encountered in real-world situations. Not only that, we rely on compilers to translate our abstracted-out code into the bits that get fed into a processor and dictate how tiny circuits of silicon manipulate the data we tell them to work on. At the same time, processors have become so advanced that they have reached physical limitations on what’s called the clock cycle speed (the number of times per second processors can compute a single task at run time). To move forward, and deliver faster speeds computers have found clever ways to split instructions up into computations that don’t depend on each other, and can thus be run in parallel by different processors on the computer.

This is where CISC has found itself deeply in trouble. Since CISC, in an effort to optimize the old paradigm, allowed for variable length instructions it has proven really difficult to build parallel processing natively into CISC architectures. Before a program can decide to process some instructions in one location and other instructions in another location, it must first figure out boundaries for such instructions — these boundaries are not easily computed in advance when each low-level instruction might have a different byte length like they are in CISC architectures. In RISC architectures, by contrast, each instruction is a fixed length in memory and the job of figuring out how to split instructions between parallel jobs is much easier and is oftentimes even handled at the compiler level.

This is why Intel stock has been a terrible buy recently and why Apple is investing in designing and building its own M7 chips instead of relying on Intel internals for the job.

I guess the lesson for these stories is that architects should always be aware of the assumptions on which they are working to optimize the performance of the product they are building. These assumptions are oftentimes weaker than assumed. Blindly optimizing a flawed design will only lead you to a local maximum in output, and blind you and your organization to the really important metrics.

Engineers, by nature, like to optimize and engineering culture does not naturally incorporate enough of the big-picture thinking that can lead an organization to think about whether it’s optimizing the right things.

So please, don’t Request Changes on my PR if it’s not gonna break anything…

\mathbb{R}

Discussion about this post