State of Change, Chapter 19: The New Model of Data
“Database management involves the sharing of large quantities of data by many users — who, for the most part, conceive their actions on the data independently from one another,” wrote Dr. E. F. Codd, the creator of the relational model for data. “The opportunities for users to damage data shared in this way are enormous, unless all users, whether knowledgeable about programming or not, abide by a discipline.”
In Dr. Codd’s world, data was a fluid substance that needed to be kindled and cultivated in the storehouses of his time, the “data banks.” His argument was that the only way to ensure the accuracy of data in its ability to reflect some sort of truth, and its efficiency in being able to point the way toward a possible course of action by a business, was through the universal enactment of a set of standards and practices, and the implementation of a common language – his discipline, which led to SQL. In a modern context, one could say Codd advised a networking of the people who administer data, coupled with a centralization of data around an economy of principles.
The first commercial Internet ran contrary to Codd’s advice. It established a torrential sea of asynchronously communicating network hosts, all capable of servicing each other’s requests for data anonymously. The first efforts to establish some kind of centralized core, retaining a universal directory of data published on the Web by all its participants — the first Web portals — eventually failed. What did succeed was the search engine: a device that scans the contents of published documents after the fact, and generates an index of their content based on semantic relationships — educated guesses. To this day, search engineers refine the means by which those guesses are educated. The Web has been an expanding mass of text, having yielded fewer tools for businesses, governments, and schools to organize and make sense of it all, than originally promised.
Yet a new and emerging model of the Internet is taking the place of the original model, such as it was. Cloud servers with extensive virtualization have changed what it means to be a “host.” An IP address becomes a point of contact for a much larger, more pliable and mutable construction of processors, memory, and storage. In this world, a database looks more and more like a compromise, hybridized amalgam of Codd’s tower of perfect relations and the wild, wild Web.
There is, at last, some hope. The new and rapidly evolving industry of big data is centered around a cluster of technologies that enables vast amounts of unstructured, unprocessed data to be useful, practical, and analyzable without the need for vast, contextual indexing after the fact.
Scalability Replaces Enormity
It’s called “big data” for three reasons: first, because it really is different than the universe of data we have come to know; second, because it arose from the desperate need for what I called just a few years ago “scale and scalability;” third, because there were no unused English language words or phrases available that weren’t already being used as names for pointless startup apps.
Yet the phrase tends to give people the wrong impression. Here, in summary, is the real story of big data: Operating a business efficiently is a science. The data that a business generates for itself is no longer sufficient to give it comprehensive insights about its market or about the economy. The Internet enables both the collection of data from multiple sources, and the distribution of that data among multiple data centers. As I explained for our series on business intelligence (BI), both amateur and professional analysts are looking for correlations in the patterns of business activity and consumer behavior, in hopes of learning how to repeat the patterns that yield the greatest revenue.
But in the act of being analyzed, data can outgrow the storage media that contain it. What’s more, as the sizes of these databases grow linearly, the processing time expended in maintaining them grows exponentially. The relational model which governed the structure and processing of data since the 1960s, has been steadily failing.
The solution, engineers discovered, is threefold: a new file system that enables data clusters to span multiple volumes; a new data structure that presents data as encoded documents rather than indexed tables; and a new distribution scheme that introduces new levels of fault tolerance through strategic redundancy and replication. The triumvirate of components that provide this solution are becoming the new realities of everyday existence in the data center: Hadoop, MongoDB, and Cassandra.
Collectively, these open source components are the most positively disruptive technologies ever created for servers. Their very existence has forced us to revise not just our understanding of the science of databases, but the lexicon we use to explain it.
Data - An encoded form of information, either in transit or at rest, that represents a state, a change, or an event. The change here is subtle, but profound: Data no longer sits someplace, waiting for something to happen to it. It’s in motion, like blood, part of a massive circulatory system.
Database – An assembly of data made available for processing and analysis. Up to now, a database had been a box for processed data, like a Tupperware container for prepared food. It was the place where you found data. Now, it’s a kind of album that gathers data together in some meaningful context, like a net collecting butterflies of the same species. In newer systems, data only needs to be properly encoded (for example, in the case of MongoDB, to be rendered as documents) to be considered part of a database.
Transaction - The processing of data that affects the outcome of queries. We used to think as databases as sums of knowledge. Now, we understand that information may exist before it’s “known,” the way sound pervades an empty forest. To explain further: Data in a distributed database may take many forms. In a conventional database system, processing a transaction makes it the “truth,” and any query processed subsequent to that point in time results in a rendering of that truth. This is the so-called atomicity principle, where the database represents the sum total of all facts, and a transaction is either taken as a whole fact or not taken at all. In a big-data system where storage media may span the planet, such a black-and-white state of affairs may not be achievable. So instead, the system settles for a state called eventual consistency, where documents contributed to a system can be expected to result in transactions at some point in time soon. Put more succinctly, we can collect information, but we hope to know it all eventually. The act of making information into knowledge is the transaction.
Document – A container of properly encoded data. Usually a file, although in virtual systems, a document may logically be comprised of multiple files. The Web is largely responsible for this new metaphor. It’s full of encoded documents, most of which may be read anonymously — which runs contrary to the account-driven principle of conventional database access.
Data center – One place where storage media may be located, contributing to all or part of various databases. This change, although straightforward, is the most important of all because it affects the security model. Up to now, the data center was the universe of all data in a database. When you employed access control tools that protected the hardware in the data center, you protected the database. This is no longer guaranteed, because networked databases are not only distributed across data centers but in transit throughout the Internet.
Structured data – A database where processed data is stored for the purposes of the database management system and the applications it runs. When Dr. Codd proposed the first relational data structures, his intention was to free data records from being bound to specific applications. But in the intervening years, in order to expedite queries, DBMSes used optimization techniques, tweaking each database’s structure for its own express purposes. This inevitably re-established some of the same vendor lock-in that Codd – even while at IBM – sought to avoid.
Unstructured data – The collection of all properly encoded documents, wherever they may reside, that may pertain to and be enrolled in a database. It’s this notion — that data already in existence someplace in the world may virtually comprise new databases in your data center — that is forcing the world’s businesses to rethink their approaches to managing their vital information.
Data warehouse – A collection of devices, programs, and services — some made available through the cloud — whose purpose is to enable data everywhere, including but not limited to what’s stored within data centers, to be referenced by applications. It is no longer a physical container, although there continue to be appliances and services that address the need for warehousing at a virtual level, giving applications the appearance of a single container.
The advent of the Web, coupled with the plummeting cost of storage, led to a situation where there was far more data outside of databases than in them. Software manufacturers began trying to take advantage of this situation by pitching the idea that this data must be collected and absorbed internally to be processed and to be made useful.
But businesses are discovering that this drastic step is not necessary. The old principle — that data has to be owned in order to be used — no longer applies.
“The cost of data generation is falling across the board. If you rewind to five or ten years ago, data generation was the rate-limiting step in how you manage that data, how you computed it, stored it, and asked questions of it,” remarks Matt Wood, Amazon Web Services’ principal data scientist. Wood continues:
Whether you’re working in genomic sequencing or you want to do social media analytics, or just look at the logs that your Web applications are generating, that rate-limiting step is no longer there. The cost of generating that data is falling all the time, so the economics are now favorable that the throughput of that data is faster. More of it is being generated, and that puts tremendous pressure on the infrastructure required to store, collect, compute, analyze, collaborate, and share all of the data, and the analytics [derived] from that data.
Memory Replaces Storage
Up until a few short years ago, data was logically structured as indexed tables with records, and relationships that were catalogued and queried. To ensure that related data could be more readily accessed from storage, relational database managers processed these associations in advance, building massive “card catalogs” in advance for their own benefit. This was when stored data was a colossal library of spinning ceramic cylinders, and the processor was like a single custodian responding to inquiries lined up in a massive queue.
Virtualization, cloud dynamics, and highly optimized processors have all conspired to successfully render this old machine obsolete. We look at how these new data engines work and fathom the full extent of their efficiencies, step back into our own data centers, and find ourselves in an antiques shop.
If you’ve studied the way engines work, you understand the principle of displacement. It’s a way of measuring how much air is produced by all the cylinders in one round of explosions, or one cycle. If the cylinder were larger, it might not need so many cycles to produce a greater displacement, except for the fact that its greater mass would counteract the force of the energy it produced.
A database is an engine, at least in the virtual sense. Although “displacement” in the context of databases usually refers to the state of being jobless, a DBMS processes data in cycles. The types of work done per cycle are described as “ETL” — extract, transform, load. The efficiency of those cycles, just as with a combustion engine, has everything to do with timing. If the latency of the storage media where data is contained were not an issue, the processor could simply acquire tables in infinitely-sized batches, and logic could be applied to those batches en masse. But cylinders can only spin so fast.
Virtualization is the science of making a single computing mechanism work by pooling resources from multiple machines. It is the engineering factor underlying all of cloud dynamics. No longer is any single storage device the slave of one processor. And no longer is memory the exclusive cache for the logic and data utilized by a single CPU.
It is a world where the laws of physics, such as we had come to understand them, no longer apply. Now that multiple machines can coalesce to produce a processing device with terabytes of virtual memory, essentially all of it represented by real memory, the “cylinder” with respect to a data engine can grow to enormous size, without suffering the cost of its mass.
Database architects initially took this news like schoolchildren hailing the opening of an infinite public park. For a time, they believed they could conduct essentially the same business they had always conducted, just on a larger scale. But scale has a way of altering one’s perspective. With virtual machines operating at these specifications, it no longer makes sense for the cycles embedded in the old data logic to work the way they do. If I may concoct yet another analogy, we’ve been improving the way folks with buckets fetch water from the proverbial river by building faster transportation conduits between each person. Then along comes someone with a pipeline.
Platforms Replace Warehouses
A new industry has arisen, in the production of in-memory databases whose logic takes advantage of the absence of the fetching mechanism. And to minimize the collateral damage from the disruption this new industry causes, its practitioners — most notably, SAP’s HANA — are selling these databases as cloud services. This way, the terabytes of memory required for the operations don’t have to be yours exclusively.
“Think of it in terms of where the operational data could land in HANA. Now we have this thing called Suite-on-HANA, where you can actually put your ERP solution on the HANA box,” explains Charles Gadalla, SAP’s director of advanced analytics. “That means your analysis is being done on your operational data, non-indexed, linearly, as it’s coming in. That just changes the game completely.”
All of a sudden, research processes that would theoretically take years to complete with conventional systems, now consume mere minutes. Tasks once considered critical enough for an organization to mandate years of capital investment, suddenly take on the profile of water-cooler projects that can get done by the next quarter. Gadalla offers an example of a task that was previously impossible but suddenly trivial: an examination of Twitter posts containing a customer’s hashtag, comparing the context of its discussion of particular items — most notably, brand names — against the discussions conducted by thousands of other Twitter users. Here’s another:
Think of a credit card company. I get a call from my credit card, “Mr. Gadalla, were you at Target in Seattle on Monday of last week? Do you remember how much you spent?” I have no idea. It’s been a week, and time has moved on. But if that credit card company could have called me as the transaction is happening, to verify that it is my credit card that may be under some type of duress, that changes the game. We could actually apprehend a criminal while they’re at the till, before a transaction is processed. It changes the game for some industries completely.
To return to my bucket chain analogy for a moment: Just as there is a physical limit to how much the bucket chain can be accelerated, there is a kind of technological barrier to the extent that traditional database logic can be accelerated in a virtual environment, regardless of its expanse. This may be why the leading stakeholders in the traditional methods are investing in their own in-memory products, with the objective of limiting their use to caches, temporary stores, and synchronization tools. IBM, for example, characterizes its solidDB in-memory product as an “in-memory database caching feature that accelerates virtually all leading relational databases, increasing their performance up to ten times.”
This while HANA — intentionally not marketed as an extension or a cache, but as a database management system in-memory — demonstrates speed increases for certain tasks whose factors are expressed using six digits.
Because storage as a commodity is becoming a trivial expense, and cloud storage has become so readily accessible, the notion that databases must be separate and exclusive entities in order to be practical, is being challenged for the first time. For instance, companies that conduct research on their customers and collect data for that purpose, are weighing the prospects of sharing their research, or utilizing research that has already been conducted. To capitalize on this, Salesforce and others are offering sales contact databases as platforms unto themselves, pre-integrating them into their CRM services. Suddenly, entire classifications of data that individual teams of researchers collected exclusively, can instead be harvested from pre-existing stores.
Developers Replace Administrators
Because time is money, and money is also priority, these technological changes have immediate ramifications on the texture of an organization. An agile team (perhaps with a capital “A”) can take on a once-colossal task by assembling the ablest associates, leasing some cloud compute time, and building an app using a dynamic language on a cloud platform. Then the CIO can poke up the results on her smartphone. The ease with which this can now happen calls into question the very constitution of the information specialist teams in the organization.
Historically, the job of producing reports on the status or key performance indicators of a company, has fallen to the designated database administrator. In a business where the platform is no longer administered the same way, and the nature of examining data as analytics becomes more experimental, the job of tendering this data falls to someone else. And there are extraordinary new candidates for that someone else.
Cloud services provider Appirio is the parent of a crowdsourcing platform called CloudSpokes. Its premise is both radically simple and, at once, alarming: Since the platforms on which a growing number of businesses’ applications run are literally singular, rather than distributed among data centers, then ideally the developers for those platforms should all be in one place. But rather than create a colossal subcontractor, CloudSpokes advances the notion that groups of developers as small as one person can compete to design and implement tasks for companies — tasks that, in the earlier era, required entire corporate divisions, investments of millions, and patience lasting into the decades. Compensation for these developers comes in the form of prize money, which CloudSpokes awards to the team that produces the working task.
“Even more than the meritocracy of the skills of the developers, [CloudSpokes] rewards the time and proximity of experience,” explains Appirio’s chief strategy officer, Narinder Singh. One example Singh offers is a system for examining photographs to determine the location of the eye, and apply tinting to simulate the appearance of contact lenses for prospective customers. It’s the kind of task that you might expect a Photoshop plug-in to perform, but perhaps not always adequately. On a platform scale, a system can learn the location of eyes by studying billions of photographed eyes.
“Some developer, someplace in the world, did something very close to that [task] very recently,” he continues, “and they have an inherent advantage in doing that kind of thing again. Incredibly, it’s a more efficient mechanism for doing this kind of matching of task to person.”
For this particular task, the contractor offered $2,000. Many developers with common skills would turn down such a task, concluding it may take as much as 50 human-hours to complete. Singh goes on:
But think about the person who did that same or similar task last week. They may say, “Man, this is gonna take me six hours, and I can do a fantastic job!” All of a sudden, that’s $333 an hour. All of a sudden, you’re saying, “Wow, I can work on lots of different things that leverage my best advantage, to allow me to produce the highest quality impact I can for somebody, and share in the value of that.” That is the glass-half-full, life-is-wonderful, side of this.
The challenge is this: All of a sudden, more and more work that we have is truly global in nature. Time and proximity can be applied to the solution from wherever you’re at in the world, and that obviously puts stress on our own education system, our own time and proximity skills that we develop as a nation, from an American perspective, to be able to make sure we have people who fit the mold of what I’ve described. We have to prove our advantage by showing the productivity and the proximity of experience we have to be able to accomplish other things, otherwise we have a much bigger pool of folks who also want to participate in that game. We have to make sure, as country or a college or a school, that we’re providing people with the right skills to be at the right time and place to apply their experience in the most efficient manner. That’s a critical part of having it not disrupt the quality of life, and the things we’re used to getting. In the past, we were protected by proximity; and now because of cloud and the Internet, that’s very different.
Managing Replaces Messaging
Up until very recently, building applications around a database was an exercise in making three incompatible machines communicate with one another, over a connection that feels like a telegraph line. Most relational databases are still operated using the SQL language. When an application needs to retrieve a batch of data records, or it needs to add a new one or update an existing one, it needs to assemble a query. But the application itself is not written in SQL, for one principal reason: SQL, by design, does not interact with the user. It only manages the database; it doesn’t display menus or listen for mouse clicks or verbal cues. This is so a database can be queried by multiple users; the database manager handles its own itinerary.
Thus the three machines, with SQL at the core. The user application is the second machine, though there are several simultaneous instances, perhaps hundreds at any one time. Think of SQL as the factory floor and the application as the showroom. Acting as a go-between in this “supply chain,” if you will, is a driver or data provider that negotiates between the two, such as Microsoft’s ODBC and OLE DB, and Java’s JDBC. These are the telegraph stations. They accept the SQL query from the user application, parse it, and present it to the database when it’s ready. Then it receives the result from SQL, which is often a set of records, and doles out records to the client in sequence... because the client can only perceive records one-at-a-time.
Engineering an application with these three disparate, asynchronous processes was difficult enough. But the massive networking of computers brought on by the Internet added two more mechanisms to the process. First, there needed to be a single gateway for all the processors that made up a data warehouse. One term for this gateway is the cube, and two existing technologies that were adapted to serve as the processing systems for such cubes were Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP). They presented the image of a single database manager, on behalf of all the networked databases behind them.
And then there was the Web itself. At first, software manufacturers devised remote consoles, some of which ran outside of Web browsers, that fulfilled the functions that the typical client in a closed network would perform. It would have been a catastrophe waiting to happen, except that it didn’t wait. The security hole this created was the size of a planet, and was immediately exploited. Modern Web applications don’t actually communicate with the database on the back end, but rather with a user application layer on the server. This gives developers the opportunity to avoid one of the most common security vulnerabilities in history: the dreaded SQL injection exploit. So the Web browser actually communicates with a database client running on the Web server.
What inevitably happened in organizations was a kind of “calcification:” a state of affairs that compartmentalized data into disparate components, where they then solidified. These organizations found themselves facing the following:
The data used by the organization’s business and financial functions, was married to a class of application (often a custom one) whose sole job was reporting. So the only visibility the organization had into this core data was periodically.
It was only through the format of these reports that this reporting data could be made useful. Thus the quarterly or monthly production of spreadsheets, many of which replicated past history, making a single view of the organization’s timeline difficult to manage.
The management of all these spreadsheets required (you guessed it) a database. In many cases, Web servers such as SharePoint were leveraged as document databases or virtual catalogs, enabling data to balloon to a size much larger than it needed to be.
The deployment of ERP and CRM systems required the core data to be periodically integrated, sometimes by having the core data “exported” as a spreadsheet (again) and then assimilated by the new application. (Ironically, many of these ERP and CRM applications are now cloud-based, and businesses have hired consultants to study the potential risks of losing custody of exported data.)
Business partners who exchange data outside of EDI transactions — in the form of catalogs, production schedules, manufacturing itineraries and processes, portfolios, and financial records as may be required by due diligence — now demand a higher quantity of data than ever before. Some businesses take the extra step of specifying the exact requirements of exchanged data. But the templates they produce are inevitably (here we go again) spreadsheets, in order to expedite integration. Sometimes, the only way this integration process can be easily automated is if the data documents be in a “least common denominator” format, such as Excel.
The primary mechanism for data exchange in this environment becomes e-mail. But the secondary mechanism becomes the proliferation of USB thumb drives. It’s one reason why these devices sell so well through FedEx Kinko’s and UPS Stores.
The emerging model of data moves closer toward what data scientists call the “single source of truth.” Its goal, as much as possible, is real-time incorporation. But because the volume of exchanged data is growing larger, the absolute consistency and atomicity of all data cannot be guaranteed in the big data model. Some of it can, through integration with a relational core, perhaps through a remodeled data warehouse.
The data warehouse concept is intended to set standards for how a user application (or a client app) makes requests of a data center containing databases from a variety of sources. But it ends up undoing the benefits of Codd’s original architecture. His introduction of predicate logic was supposed to separate the state of the database from the processes leading up to that state, so that a developer did not have to write the processes each time she rendered a query. She could write a SQL procedure declaring “the following statements,” so to speak, to be facts, and the database would make them so. Instead, the developer had to create cursors that stepped through the results in sequence, unraveling each perfect batch into its constituent parts. It’s this type of processing which makes large relational databases slow, and which slows them down exponentially as they grow linearly.
So when the practitioners of the first big data systems, such as Cassandra, set forth to upend this state of affairs, the revolutionaries among them dubbed their technology “NoSQL.” But one of their goals was, ironically, the same as Dr. Codd’s: to build a scheme where large data stores could be distributed across multiple clusters, where a simpler declarative language could address data throughout those clusters. As it turned out, in the most delightful of ironies, the language best suited for achieving the goals of the burgeoning NoSQL movement was... SQL, or at least something that resembles it closely enough that, for a veteran, understanding it takes just a few minutes.
Iteration Replaces Calcification
The reporting systems upon which most enterprise data warehouses are based can be decades old, and the skill sets required to maintain them may have long ago left the building. The new and emerging model of data no longer hinges on reporting systems. Instead:
The new model realizes that relevant data may exist anyplace, perhaps in document form, and maybe without the chance of being indexed. Rather than being structured around reports, the new model enables architects to restructure data around efficiency. The goal here is to be able to provide business intelligence (BI) with a reasonable degree of consistency, while preserving absolute consistency for business critical applications.
The new model provides for data management platforms (DMP) to oversee the processes necessary to attain maximum consistency, whatever they may be. These include automated integration of otherwise incompatible systems where necessary, automated import and export where necessary, and marshaling of the jobs run by big data platforms such as Hadoop that make unprocessed, unindexed, unstructured data useful and practical in a large system.
The new model is no longer predicated around the continuation of outdated applications, as the only means of communication with critical business data at the core of warehouses. In its place, big data platforms such as Microsoft’s and Hortonworks’ HDInsight let developers using cloud-based language platforms, including Java, interact directly with Hadoop processes and data sets, without the high-wire act of crafting queries for ODBC or some other go-between data provider driver. The result is iterative development — the ability to build and deploy reporting, analytics, and management applications today, in mere hours.
The resulting change to the structure of enterprises’ data center initiatives is truly as colossal as replacing a steam train with teleportation. The most obvious change reported by CIOs and division leaders is that they’ve been given direct control of all data processes. The calcified, encrusted, entrenched bureaucracies that separated them from their businesses’ critical information, has been brushed away. And the skill sets that, just months earlier, were indisputably necessary to the operation of the enterprise, are now no longer needed.
There’s a formula to all of this: The real size of big data is equal to the distance between the prosperity of a business prior to its implementation and the prosperity it attains afterward, divided by the amount of time consumed in the process. As of now, that denominator remains amazingly small. How long it stays that way will depend on what data technology vendors perceive as most important: preserving their legacy or persevering into the future.