It is easy to ignore how much data exists in the world. After all, every time you open an app or go to a web page a piece of data is gathered. But when users are more concerned with outcome, such as playing a game or reading a news article, they don't often dwell on what is happening behind the scenes.
Every aspect of a user's behavior is contributing to the vast world of Big Data. The amount of data created and copied annually more than doubles every two years, according to an EMC report by analysis firm IDC. In 2014, IDC anticipated that the digital world would grow from 4.4 trillion gigabytes in 2013 to 44 trillion gigabytes in 2020, growth by a factor of 10 within seven years.
The scale of that amassed data is difficult to grasp because it seems so out of context. But when viewed in parts, it is easy to understand why the digital world has seen explosive growth.
Take any online form. The voter registration form for the District of Columbia, for example, has fill-in-the-blank and dropdown fields for a local resident to complete. However below the surface, each of those fields represents a column for the information of thousands of rows of capital citizens.
Voter registration forms by themselves tell a lot about an individual and the voting landscape of a city. But imagine if a city were to analyze voter registration data in tandem with tax records. The combined insight would tell a completely different story, uniting the voting preferences of a locale with income levels.
With flexible storage options, companies can retain vast amounts of records to better understand customers and their long-term place in the market.
That is a very surface-level look at the promise of Big Data. With the modernization of storage technologies and the advent of cloud computing, it is easier than ever before for companies to gather and store information without needing to purge archives to make room for new data sets.
Depending on what data companies gather could dictate not only how well they understand their customers, but also their long-term place in the market.
The waiting game
Technology had to come a long way before it was ready for Big Data in its current form. In fact, Big Data used to be defined by the type of infrastructure required to work with a data set.
"The reason there was the term 'Big Data' is because you couldn't use the old databases to work with it," said Frank Bien, CEO of Looker, a business intelligence and Big Data analytics software company. "It was too big."
Now, because companies can store everything with the help of cloud service providers, there's no need to clip the amount of data an organization absorbs.
"The concept of the data set alone being the determinate of whether it counts as Big Data is maybe one that was more appropriate three or four years ago," said Alex Bakker, principal analyst at Information Services Group. "What was a huge data set for one tool a couple of years ago might be a totally tractable data set on a laptop with a different tool today."
Vendors have worked to re-imagine tooling, so not only are companies capable of working with larger data sets, they are also finding it cost effective.
"What was a huge data set for one tool a couple of years ago might be a totally tractable data set on a laptop with a different tool today."
Alex Bakker
Principal analyst at ISG
A company can pull its sales data from Salesforce.com and combine it with marketing data from Marketo, customer support information from Zendesk and web traffic information, according to Bien. Alone, that data can tell a business a lot, but when brought together it can highlight whether a customer had a complaint which impacted its purchasing habits, for example.
The real value comes from putting all those metrics together to derive insight, according to Bien. Then, companies can change "questions on the fly," offering businesses more agility and flexibility with how they use and respond to actionable data.
Over collection
Just because companies can collect and store data doesn't always mean they should. After all, why should a company pay for storage if it's not going to use it?
Much of the market transition now with Big Data is acting on the gathered information. Facebook and Google are daily analyzing vast consumer profiles to create value in advertising and product development. But for many businesses that is not yet second nature.
It's a given that organizations are going to collect too much data, but real value comes from applying advanced analytics to contribute to a company's ROI.
Whether a company is under- or over-collecting data, "Big Data needs to lead to effortless information," said James Burke, director at ISG.
It's a given that organizations are going to collect too much data, but as long as they are deriving insights by applying analytics, over collection doesn't pose a problem.
When it comes to data, Burkes says, businesses should care about:
- Descriptive analytics: What happened
- Diagnostic analytics: Why something happened
- Predictive analytics: What will happen
- Prescriptive analytics: How to make something happen
The over collection of data is a pithy problem, however. More value is offered from gathering vast troves of data and applying advanced analytics to contribute to ROI than there is to clipping the amount of data an organization is taking in.
"It's so easy to collect everything and when you start taking only part of stuff or doing what's called aggregation — summarizing the data — if you change the question later, it's impossible to get an answer because you have to go back to the plumbers and re-ingest all the data," Bien said.
Are data lakes the answer?
So a firm has gathered all this data. Now what? Companies are trying to find relationships from an array of data sets, that live across disparate systems. Many struggle with master data management and the experts who need to do analytics cannot access the necessary systems, according to Bakker.
That has given rise to data lakes, a repository for raw enterprise data. Rather than asking questions and limiting what data is collected at the onset, companies can pool data together to ask more complete and complex questions.
Some, however, are skeptical of the security of data lakes. If all a company’s proprietary information is pooled in one place, it creates a tempting target for malicious actors looking to disrupt the enterprise.
"Data lakes in and of themselves are security neutral. There's no reason that a data lake can't be secure," Bakker said. "It has, perhaps, increased security requirements because you are centralizing information. So the risk associated with bad security goes up, because you don't have that security through obscurity and the dispersion of responsibility across multiple systems."
Data pooled in a lake could possibly be more secure than remaining on different systems. "The alternative is complete insecurity," Bien said. "The alternative is that people are taking data and they're putting it into spreadsheets or visualization workbooks and they're emailing it around."