Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Many organizations are launching, planning or considering “big data” analytics initiatives. But big data remains a relatively new concept that can mean different things to different people. Business intelligence (BI) analysts and vendors often define it primarily on the basis of the Three Vs of big data: volume, velocity and variety. Some view big data as referring specifically to unstructured data, such as Web server logs and data collected from sensors.
But according to Wayne Eckerson, director of research for TechTarget’s Business Applications and Architecture Media Group, such definitions cover only a slice of what big data really is. Eckerson describes big data as “a little something old, a little something new,” meaning that you don’t necessarily need to set aside all of your existing data warehousing and analytics technologies in favor of new tools in order to ramp up capabilities for managing and analyzing big data.
In this video interview recorded at SearchBusinessAnalytics.com’s “Delivering Deeper Insights with Big Data and Real-Time Analytics” seminar, Eckerson spoke with Editorial Director Hannah Smalltree about the different facets of big data. He also offered advice on starting a big data analytics program and discussed technologies that can play a role in big data initiatives.
Viewers of the video will:
- Get an explanation of what big data is, according to Wayne Eckerson
- Find out what some of the biggest misconceptions are about big data
- Learn about technologies other than Hadoop that can be useful when working with big data -- data visualization tools, for example
- Hear about case studies on how other companies are taking advantage of big data technology, including Kelley Blue Book and AT&T’s mobile operations
- Get advice on quick-win projects and how to get started with big data analytics
Read the full transcript from this video below: Eckerson: 'Big data' analytics weighs a mix of the old and new
Hannah: Hello and welcome. I'm Hannah Smalltree. I'm the Editorial Director for SearchBusinessAnalytics.com and related sites. I'm here today at our seminar on delivering deeper insights with big data and real-time technologies.
Now, I'm speaking with Wayne Eckerson. He's TechTarget's Director of Research, covering business analytics and related technologies. Thank you so much for being here, Wayne.
Wayne: Thanks for having me.
Hannah: Now, how do you define big data, and what are some of the common misconceptions you've encountered around that term?
Wayne: Well, most people define big data as the three V's: volume, velocity, and variety. That's a pretty good definition as it goes. Big data obviously means lots of volume of data, and you have big volumes of data by definition or default. You also have to ingest it in real-time or near real-time, so that's the
If you're talking lots of data, you're typically talking all different kinds of data. We at TechTarget also like to talk about big data in the terms of analytics, because you don't just acquire lots of data for the sake of acquiring lots of data. You need to do something with it.
We talk about big data analytics. Get data in, and then you analyze it and typically take action on that data. We also increasingly talk about visualization in big data, because often times the best way to analyze big data is to visualize it because as you know, a picture sometimes is worth a thousand words or numbers in this case.
Hannah: The second part of that is what are some of the misconceptions that you're encountering around that term?
Wayne: Well, I think there's a misconception that big data is something new. It is actually something new, but it's also something old. I've been in the data warehousing space for almost 20 years now, and it's always been about big data.
Now, the scale has changed. In the early days we were talking about hundreds of gigabytes of data, now we're talking about petabytes of data, and the technologies have evolved. It's always been about big data. It's always been about analyzing and reporting on that data, but there is some new stuff too. The new stuff is, we're talking about largely unstructured data.
In many ways, big data has become synonymous with this unstructured, semi-structured, or some people call multi-structured data. It's also become synonymous with a certain way of capturing it and analyzing it, and that's using these new open source tools, in particular Hadoop, which many people call a new distributed computing environment for big data.
Hannah: What kinds of technologies are people using to implement these programs. You mentioned Hadoop earlier, is this all about Hadoop?
Wayne: Right. That's a good question. Again, it gets back to what's old and what's new. Obviously, we continue to use relational databases that have been around for many years. In the last five years, the relational database market has undergone fundamental change, where they've developed new products specifically for analytics.
We call those analytical platforms now. They might be massively parallel processing databases, analytical appliances where you get the hardware and software all in one box, commoner databases, really a whole assemblage of different types of products specifically geared toward analyzing large volumes of data. That's kind of what the traditional established data warehousing and business intelligence market brings to the big data table.
We also have this new environment, this Hadoop ecosystem, which comes essentially out of the internet companies who are trying to figure out a way to cost effectively process large volumes of data to create their search indexes. Google, Yahoo, Amazon, Facebook, Twitter, all these folks have huge volumes of data that they need to process.
It was really, frankly, too expensive to do that with the traditional technology, so they've come out with some new techniques, programming techniques as well as distributing computing technologies to do this in a cost-effective way using open source technology.
Hannah: Can you talk a little bit more about some of the most interesting case studies or use cases you've encountered with big data technologies?
Wayne: Yeah. There's any number of case studies. Two that come to mind are Kelley Blue Book. They started using an analytic appliance to ingest large volumes of car transaction data. They collect information about what cars sell for from any source possible, hundreds of different sources, mostly car auctions.
They have to bring all this data in and merge it, match it, normalize it, standardize it, and so they're using an appliance plus fuzzy matching algorithms to do that. Then they run other statistical algorithms against that, merge standardized data, to actually estimate car values, which they publish on their website. Which are actually very accurate.
Another company is AT&T Mobility, which has 80 million customers. It really wants to know what those customers are doing on a day-to-day basis, especially if they're disconnecting the service, or churning, which is a big issue in the telecommunications industry. With big data and big data analytics, they can now watch what their customers are doing on a daily basis, and react to it.
If a segment of customers are churning, they can analyze who those customers are. Immediately, within a day or 24 hours, implement new marketing campaigns to try to retain those customers.
Hannah: Now, how can people get started with big data, or these technologies? Are there any sort of recommended places to start?
Wayne: Well, in some cases, they don't have a choice. They're ingesting large volumes of data, and there's a pain point in the organization. Fortunately, with the new technologies, both on the established vendor side and the new open source technology, there's a way to scratch that itch, so use these new
technologies to meet the business need.
From an analytical perspective, we often times discuss how you can use the new analytics running against big data to overcome internal constraints. Every company has constraints. Usually it's time, staffing, money, so look at where your real pain points are for those constraints. Then you can start to use this technology to overcome them.
It could be as simple as, you want to do marketing to your customers, but you can't possibly afford each of them a direct mail brochure, which might be $10 a piece with the print, collateral, as well as the postage. You have to prioritize. Well, what are the top 10% of customers who are most likely to respond to this offer, and that's what you can use big data analytics for.
Hannah: Thank you so much for talking with us today. Wayne Eckerson, TechTarget's Director of Research for business intelligence, analytics, data warehousing, and related topics. Thanks to you for joining us today. Remember, you can find more articles, videos, and other resources on SearchBusinessAnalytics.com. Thank you so much for joining us today. We hope to see you here