Scale and Scalability [2010]

May 3

Few people are granted the gift of observing the onset, the heyday, and the downfall of multiple eras of human endeavor. I’ve witnessed and chronicled what would appear to be the lifespan of global social media. I also stood at the precipice of another technological time span. Today, we are seeing a period of genuine innovation in database infrastructure. It’s good to see something alive and kicking. The beginning of this era, and the situation that gave rise to it, seems like just the other day. And that’s because, as Bill Murray once sang, it was.

This essay ran in the software development blog Input/Output, published by Hewlett-Packard (prior to the HP / HPE split) in 2010.

Chawton, Jane Austen’s home, as photographed by Simon Burchell in 2013 [*Wikimedia Commons*]

“Tell you what,” the development tools vendor told me on the day Microsoft officially launched Windows Azure, “this solves the whole scalability problem forever, doesn’t it? Just stick your application in the cloud. Infinite scalability!”

Scalability, we’ve been told, is the inherent ability of an information system to acquire more resources and continue to perform normally. But before a business invests in any bigger or better resource — for example, Microsoft Exchange or SharePoint or, for that matter, Windows Server itself — it’s sold on the premise that when the business gets bigger, the resource can, too. The system will be just as affordable/efficient/practical in five years’ time as it is today. This simple, if simplistic, notion of scalability made perfect sense just five years ago.

But there are fundamental presumptions implicit in that notion which, in the circumstances we find ourselves in today, no longer truly apply. The availability of cloud computing, the versatility of virtualized processing, the global accessibility of the Internet, and the commoditization of both processors and storage, are like four cosmic forces that converged at the same moment. As a result, the capacity of a company’s information systems is no longer directly proportional to its mass. For segments of the economy where information itself is the key product, the very meanings of “big” and “small” have become skewed. Aside from “Web x.0” markets like social networking and Web publishing, the enormous potential of individual and partnership initiatives to produce goods and services on a competitive level with global enterprises, has forced companies to stop putting off the inevitable, and rethink the very meaning of their businesses.

The Beginning of a Universe

Penn State University published a brilliant 2003 case study of exactly this phenomenon, as described by Robert Marti, a systems architect for leading reinsurance firm Swiss Re. The scope of change Marti foresaw as necessary for Swiss Re during this pre-cloud period was something he called the “Big Bang.” But since he could not make the business case for divorcing his company from decades of business logic set in stone, even if the benefits could be measured in francs, he advised an alternate methodology that met with the more granular, “applications-fielded” (read: siloed) demands of his employer.

“While the need for integrating the various information islands both along the value chain as well as across product lines and/or organizational units is unquestioned,” Marti wrote, “it is nearly impossible to make a sound business case for ‘big bang’ approaches such as developing (detailed) Enterprise Data Models or even a single integrated Enterprise Data Warehouse.” Budgeting alone, which takes place on departmental levels, rendered impossible the funding of any project that would do away with, let alone cross, departmental boundaries.

So the approach he suggested was twofold: First, build a new, comprehensive, high-level architectural framework — the way the system would have been devised had Swiss Re been founded yesterday instead of 1863. Second, slowly, granularly invite each individual department in turn to adapt the framework, making revisions where necessary, for what Marti himself dubbed “piecemeal development and integration.”

That was “scaling up” circa 2003, when Marti’s remodeling project was already four years old. The fundamental transformations in information systems since that time have changed the status of Marti’s goals from “impossible” to “urgent.” In a completely new project begun and completed in 2010, Swiss Re invested in what it called a “knowledge sharing” platform involving a social network platform and a sophisticated content management system, for employees to freely communicate their ideas with one another and feel more empowered. Swiss Re then deployed that platform, called Ourspace, on a leased cloud in April, in a project it said took only 65 days.

It is clearly an attempt to embrace “the new,” or at least some portion of “the new.” But it is not the fundamental re-architecture of business logic, the “Big Bang” that Marti suggested seven years earlier. In fact, you might say it’s a big, cloud-based suggestions box. It’s what “intranets” were for businesses during the 1990s: a way to develop a candy-coated shell that camouflages the problem.

The situation reminds me of how astronomers stared literally for years through the Hubble Telescope toward the edge of the universe, in hopes of gathering enough data to tell them the rate at which the expansion of the universe is slowing. Once they worked the problem backwards, they could approximate the age of the universe, and pinpoint the moment when the Big Bang occurred. When the data was finally computed, the result was a negative cosmological constant — implying that, the further back you go in time, the less the universe expanded, so that at the very beginning of time, nothing could possibly have happened at all. Not even the universe itself scaled up the way we expected. So much for the Big Bang. Years later, folks still cling to the original theory, and even to the phrase “Big Bang,” despite the big minus sign in front of the answer. A universe without our original preconceptions would just be too uncomfortable to deal with.

A universe where businesses fail to scale up, and computing resources fail to stretch like a tube sock to meet their needs, is uncomfortable, daunting, scary. For certain businesspeople, scalability has become as important and fundamental a principle as democracy, capitalism, or derivatives. Now, “the cloud” — once the penultimate solution to scalability in engineering — confronts both system architects and software developers with a harsh and painful reality: To do business in this universe, we have to start completely over.

The Re-architects

“With the people we’re talking to, the first step is helping them come to the conclusion that whatever they have isn’t working,” says Bradford Stephens. He’s an engineer and consultant whose startup firm, Drawn to Scale, is sought after by businesses coming to terms with this new, foreign universe. “They come to it the hard way. Either they’ve lost data or they’ve had to change their business model, which is surprisingly common. So once they make that realization, it’s more like, ‘Okay, how do you translate from talking about this relational world to talking about this scalable, big-table world?’”

Stephens is among the first of a new breed of system architects and developers (and, more frequently, both) who were the first to realize, in all the literal senses this implies, the scale of the problem at hand. He is also the first to admit there are no set solutions, no best practices, no templates — at least not yet — for remodeling business applications. There are simply too many unique factors in each case: Businesses scale down, they merge with one another, they get acquired, they shed departments, they absorb other departments, they outsource various tasks (sometimes seemingly at random), they cease to exist for certain intervals and are resurrected under new names.

“What we’ve found is, these companies who are experiencing these big data scalability pain points, of course, wish they had tackled the problem earlier, but these sorts of problems you don’t realize you have until you try to solve them.”

What cloud computing enabled within months and even, in some cases, weeks after its public availability was for workshop-sized companies — many of them fresh startups — to deploy service-oriented architectures, using lightweight and often open-source frameworks, establishing instant information services for clients on an Internet scale. No less importantly, what virtualization brought forth was a radical reorganization of the fundamentals of system architecture, such that the resources any company has at hand at any one time to process a job switched from a constant to a variable. Suddenly, someone’s basement business could have just enough processing power to address an equivalent number of customers as a multi-billion-dollar enterprise, for just the few days or hours it needed that power. But unlike the enterprise, it could drop that power when it no longer required it.

Notes Stephens: “FlightCaster, a little four-person startup, trolls through dozens of gigabytes of data a day to predict if your flight is going to be late. Not that long ago, information on that scale was only generated by really big businesses, because only they had the ability to generate it.”I know what you’re thinking: Isn’t that supposed to be a good thing? For FlightCaster, yes, but not for the enterprises like Sabre Travel — the modern culmination of American Airlines’ multi-billion-dollar Sabre network — that depend on the information services that FlightCaster’s iPhone app just outmoded. The problem with “disruptive technologies,” to borrow a phrase from Microsoft chief software architect Ray Ozzie, is that they’re so damn disruptive.

The sudden, and in some limited cases, catastrophic impact disruptions such as this had on businesses and the economy in which they function, has swept Microsoft itself into a role it never expected to play: counseling.

“The first thing every architect needs to know is, ‘You are not alone,’” reassures Justin Graham, Microsoft’s senior technical product manager for Windows Server. “Seek help from a Microsoft Partner and/or Microsoft Services to get on the right track. In the current environment, we understand budgets are tight, which is why Microsoft TechNet, MSDN, and User Group Communities are available to share information and assist. From a process perspective, make sure to not overlook the opportunity a re-architect provides to optimize the infrastructure. Thinking about it solely as, ‘How do I merge these technologies?’ [may not be as helpful as] ‘What is the best and most optimized infrastructure that can make the new organization successful?’”

For more and more businesses, the re-architect is the counterpart of the clinical psychologist. Whether hired as a consultant or full-time, he enters the situation knowing that the only way to find a path toward a solution — the only way he can “optimize the infrastructure” — is by divorcing the business from its own false perceptions and bad habits. One of those habits is throwing new hardware at the problem: the traditional route for “scaling up.”

“People need to change their mindsets from buying hardware by default, to that of a small company where they can’t afford to buy hardware, back in the day,” remarks Sean Leach. Newly installed as the Chief Technology Officer of Web registrar Name.com, Leach has had recent experience as an architect for what could be described as the ultimate high-scale application: UltraDNS, a real-time, high-security extension to the Internet’s Domain Name System, providing a constantly updated directory of verified DNS addresses to which businesses subscribe. As far as scaling up is concerned, Leach has been to the mountaintop.

“Cloud computing is actually making this problem a little bit worse,” states Leach, “because it is so easy just to throw hardware at the problem. But at the end of the day, you’ve still got to figure, ‘I shouldn’t have to have all this hardware when my site doesn’t get that much traffic.’”

Failure of Scale

The core of the problem — which Stephens, Graham, and Leach all attack, each in his separate way — is that existing business applications were not designed to scale, or mutate, or metamorphose in the way they’re being pressured to do. In an older world, a business would find a way to invest in more horsepower, buy new hardware, scale up. But that presumed that the organizational structure of the business was its constitution, and that as it grew — as all things seem to grow, linearly — the structure would simply magnify.

“Traditionally, not only do companies think about databases, but also their silos,” noted Bradford Stephens. “You’ve got your customer transaction database, your BI database, your HR database, and the database that powers your Web sites. And this data is copied and replicated so that you’ve got a customer in your billing system and one in your CRM. So people think about data not as data, but like vertical units. This is my BI data, my data warehouse data, my transaction data.” Regardless of the inefficiencies and redundancies introduced when relational databases aren’t put to use for the job they’re designed for — relating — each department’s ownership of its own pocket of data is respected at all costs. Scalability breaks here.

In a way, it’s impossible for this way of thinking not to have become ingrained into companies’ operations, because of the way they manage budgets. Each department, like a rival sibling, scuffles with all the others for bigger outlays. So to demonstrate to the CFO that it deserves more, it consumes more...more bandwidth, more gigabytes, more processors. “Years ago, if you were the only guy in an enterprise with lots of data, the ability to spend money was proportional to how much data you generated. It was linear,” says Stephens. As a result of this thinking, Microsoft, IBM, and Oracle have historically been only too happy to oblige.

Within a few years’ time, the rise of high-bandwidth media via the Internet has enabled one person, or small groups, to consume colossal amounts of data — terabytes per person — so that consumption rate no longer implies either relative size or worth.

It’s here, Stephens says, where scalability shows up where businesses want it least. When businesses simply relocate their existing information systems model to the cloud, their problems become magnified at Internet scale. “Twice as many connections generate four times as much data, and 10 times as many connections generate 100 times as much data... Little mistakes can have exponential impact. When you run up really large scales — for example, if your code isn’t efficient and you’ve got a wrong loop somewhere — not only are you increasing network traffic 200% across your little, tiny network of two machines...but you’re going to be increasing network 500% across [all these cloud] machines that you’ve rented. That’s an extremely costly mistake. So in a distributed, scalable world, you have to have metrics and you have to have cost analysis. You have to plan for that from the beginning.”

So Microsoft is advising these re-architects to turn an introspective mirror toward their own businesses. “A best practice of focusing on management will serve a system re-architect very well,” advises Justin Graham. “If the architect is trying to merge two organizations, or re-architect to meet the changed priorities of the business, any problems that existed in the past will exist in the future if a straight migration approach is taken. Think about how management and an optimized process can help you re-architect the infrastructure to be agile and scalable.”

Bradford Stephens agrees: “Scalability does not imply efficiency. You may have a million boxes doing something, and not doing it particularly well. When you build architectures, the first thing you have to worry about is scalability, because you can’t back-fill it. It’s nearly impossible.

“Efficiency is incredibly important because it saves you money; and in this cloud world, in fact, it’s actually more important to be efficient because the impact of inefficient code is so much greater, and it can be measured. If it takes me five boxes to handle 20 transactions per second, and then I can make it so I can handle 40 transactions per second, that’s something you can measure and you can justify spending engineering output on. In sort of a cloud-ish or scalable world, that translates directly into saved money and saved time.”

Sean Leach is happy with the notion of turning that mirror even closer towards oneself: “Ninety-nine percent of the time that I’ve seen performance problems that are blamed on the database, it’s actually the person who wrote the queries or [who wrote] the application on top of the database. So there’s no magic ‘fast = true’ flag you can set; every database is very similar. Some of them scale better with a lot of records... But at the end of the day, it’s the person who writes the application that will be the reason for it being slow, not the software itself.”

It’s not that scalability does not, or should not, exist. It’s that we should divorce the growth pattern of the business from that of its information systems. Rather than use tools such as Windows Workflow Foundation to model application tasks around what people do (especially if their departments may cease to exist in a few years), model the application instead around what the information systems should do. Let the cloud disrupt the way of thinking that binds users, and departments of users, to machines rather than resources. Then build front ends to model how people should use those systems. If a company’s information systems are designed well from the outset, our experts tell us, with a loosely coupled approach between processors and business methods, then the company could completely mutate to unrecognizable proportions, and yet its systems’ logic may remain sound.

However, Sean Leach does allow us some breathing room: “Sometimes you have to use hardware. But what you generally find is, the people up front don’t take the time to plan ahead. There’s two trains of thought: There’s just ‘get it out there,’ and if you have to scale, then that’s a good problem to have, worry about it later. Ninety-nine percent of applications never get any traffic, so they don’t have to scale, right? That’s one train of thought. The other one is, you can spend six months trying to figure out, ‘How am I going to scale this properly?’ You design it from the beginning...and then you don’t actually ever launch something.

“The hardest point, the trickiest part is finding that happy medium where you don’t spend all your time up front trying to figure this out, but you’re at least designing these systems such that it’s not a complete rewrite when the thing gets popular... What I’m saying is, design it up front so you don’t have to throw the hardware at it so early in the game. If there comes a time where you really do need to throw the hardware at it, then fine, make sure that your system can support it. But the goal should be that you shouldn’t have to throw that extra hardware at it until you really need it.”

Data Scales By Itself

Depending on the growth pattern of the business, conceivably, its logic may not ever have to scale. What will scale, almost regardless of the evolutionary path of the business, is its data. Thus, suggests Stephens, businesses should design applications that don’t require incremental rescaling just to account for periodic explosions in data consumption.

“Just because your application scales doesn’t mean your data does. Data is what drives businesses; data is the important part,” says Stephens. “As the world becomes more connected, and you get data from lots of different sources, we have to solve this data scalability problem now... You have to rethink everything you do with data from the bottom up... If your application is well-designed, you should only have to change your data layer. Your front end should be totally independent. But you’re going to have to go in and write queries, or make certain assumptions that you’re talking to a distributed cluster, and you’re going to re-architect your data layer — not your whole application. Re-architect your data layer for that new reality.”

Among the tools Microsoft has developed to that end is one that recognizes that these terabytes per person aren’t really relational databases at all, but rather documents tagged by records that clog those databases. So in Windows Server 2008 R2, Justin Graham tells us, the company implemented File Classification Infrastructure as a way for data layers based on document retention to evolve sensibly.

“FCI allows administrators to apply classification rules to documents on file servers,” said Graham. “These classified files can then have actions taken against them based on their classification. The best part, these classifications are carried to SharePoint if the file moves. This is one example of Microsoft taking a solutions approach to document management.”

Some of the alternative approaches Stephens suggests are indeed quite radical, including a frame of mind he calls “NoSQL” — avoiding the use of a relational database in circumstances where tabular frameworks (employee ID / e-mail sent to customer / customer ID / filename of document / document ID, send date, receive date...) are too binding. Just as inflexible business models stifle the scalability of applications, unfathomable schemas, Stephens believes, stifles the scalability of data. And as big as data is becoming, moving it to the cloud becomes nothing more than relocation.

“If you’re still using a traditional relational database, and at some point in time it’s not going to hold your data, or you think your business model might change, or growing your customers might add more data, that’s not really a cloud at all. That’s just renting a server. You may be able to scale your application as much as you want, but if you can’t scale your data, then you kinda suck. And it’s going to affect your business model.”

Leach points to the rise of new, relatively simplistic, non-relational, yet highly scalable database systems such as Apache’s Cassandra project, and the open source Redis project, as enabling business to deploy associative databases using simple key/value pairs (document ID -> document location). Both, he says, enable you to “mix and match technologies so you don’t just have to rely on a relational database. Relational databases are very good at certain things, but some things might be overkill, where you might need a simple key/value pair.”

Scale by Leaps and Bounds

One of the most frequently cited, modern scalability case studies involves the global messaging service Twitter. In a case which Twitter itself could not help but make public, it underwent at least four complete architectural overhauls just since its launch a few years ago, as systems designed for a few thousand simultaneous users suddenly found themselves servicing 350,000. Each of Twitter’s foundational components (especially the database framework Ruby on Rails) was blamed for what appeared at first to be scalability roadblocks.

However, in perfect retrospect, it’s entirely feasible that if Twitter’s architects had designed its system from the beginning to undergo these same changes — if they were planned rather than unanticipated — they may very well have made the exact same architectural choices each time. Each choice may have been the right one, if only for a few months.

Perhaps, as Name.com’s Sean Leach advises, there’s a lesson to be learned from Twitter: Rather than planning to scale incrementally — which, if a business finds or regains success, may now be impossible — it should plan to rework its fundamental architecture as needed, in phases, as old architectures that met earlier requirements are no longer applicable.

The coefficients of Leach’s formula may sound a bit obtuse, but in light of Twitter, perhaps the sky’s the limit: “Let’s say, the biggest you’re ever going to get is a trillion customers. But instead of designing for a trillion customers, design to a million customers so that when you get halfway there, you can redesign the system over time to be able to support a billion customers. Then when you get to almost a billion... just build it and get it out there, and then you spend your time up front and don’t worry about scaling. Plan in phases where you scale to X, and then when you get close to X, you start thinking about Y...as opposed to waiting until X happens before you worry about scaling.

“Take the time to sit down up front and say, ‘What would we look like if we got really busy?’ And then plan to that. That’s Application Design 101: What should our hardware look like today, what will it look like in two years, and then what would we need to do to be able to make the system support what we look like in two years? That’s simple. You’d think that would be something everybody did, no matter what. But it’s not always the case.”

In light of the often daunting tasks that system re-architects face today, Bradford Stephens offers a frame of mind that he calls, “How to Make Life Suck Less.” It’s based on a simple concept of failure: It will happen. Thus, plan for redundancy such that when components fail, they get disconnected and maybe replaced, maybe not. But you still get some sleep. While that sounds like a dangerous camouflage for throwing hardware at the problem, at one level, it’s really not: Virtualization and the cloud make it feasible, and even affordable, to follow Leach’s milestones: to rescale by powers of 10 rather than multiples of 2.

“We can sort of see the destination in the distance, and we see what we think the path is, but it may be kind of curvy,” Stephens warns. “We may not know what’s right around the bend... I think it’s the sort of thing where, if you do it right, you’ll know because you won’t be getting the 2 a.m. phone calls, and deploying some buggy code won’t bring down your entire network. There will be a lot of roadblocks in the way, and there’s going to be a lot of emerging best practices. But there’s no set process that people go through when they say, ‘I need to scale my data infrastructure,’ or, ‘I need to evaluate a scalable data platform.’ We’re not there yet; and of course, we will be, because many, many people will have to tackle this problem. But it’s a transitional period.”

We end with the one conclusion that articles on topics of this scale should perhaps never leave the reader with: We don’t know the next steps on this road. This is uncharted territory. What we do know is that businesses can no longer afford to develop solutions around the edges of the core problem. They can’t just point to the symbols for their systems’ various ills (latency, load imbalance) and invest in the symbols that represent their solutions (scalability, the cloud, CMS platforms, social network platforms) as tools for postponing the inevitable rethinking of their business models. Throwing your business model at the cloud doesn’t make the symptoms go away; indeed, it magnifies them. Symbols aren’t solutions.

I’m reminded of when Charlie Brown’s friend Lucy famously found her first political cartoon, where she had meticulously devised an appropriate symbol for every one of the world’s problems, printed in the newspaper. When she asked — hoping for some praise and admiration — whether he thought this cartoon would solve the world’s problems as she so earnestly intended, Charlie Brown responded, “No, I think it will add a few more to it.”

Scott M. Fulton III http://scottfultononpoint.com