
- Tony Karumba/Agence France-Presse/Getty Images
Data has become every company’s problem and opportunity. Researcher IDC predicted in 2014 that data would more than double every two years, reaching 44 trillion gigabytes by 2020. To manage it all, information technologies that emerged in the early 2000s at Internet firms like Google Inc. and Yahoo Inc. are moving into new markets that demand “webscale” data processing power.
Enter Hadoop. Although it’s been around for a few years, the ecosystem surrounding it has grown and many large companies are starting to ask what it can do for their business. Corporate adoption remains low as firms continue to explore “Big Data” projects, according to one researcher. But its benefits and drawbacks are a major part of the CIO’s discussion.
Here’s a look at Hadoop, its capabilities and its limitations.
What is Hadoop?
Hadoop is a software framework, meaning it provides a set of functions that can be customized by users. It’s open source. No one owns the underlying technology, and anyone can use it.
Unlike many traditional databases, which rely on monolithic systems that store data in pre-defined columns and rows, Hadoop is designed to work with data that is distributed across clusters of low-cost hardware that can scale based on a user’s need. That structure allows companies to catalog a larger volume and wider variety of data, and analyze it in an efficient way.
Where did it come from?

- Doug Cutting with Hadoop, his son’s stuffed elephant
- Cloudera
Hadoop started in the early 2000s, when engineers Doug Cutting and Mike Cafarella were building an open-source search engine called Nutch. Inspired by a pair of Google research papers, they borrowed the concepts of the distributed file system and the MapReduce framework for parallel processing to harness lots of computing power so Nutch could scale.
Yahoo hired Mr. Cutting full-time in 2006, and he and a team of contributors improved the software to the point where it could handle petabytes of data and thousands of computing nodes. Mr. Cutting named the technology Hadoop, after his son’s yellow stuffed elephant.
Hadoop‘s core: HDFS and MapReduce
Hadoop is made up of two primary components: Hadoop Distributed File System (HDFS) and MapReduce.
HDFS manages data storage. Files are broken down into smaller “blocks” that are distributed across a system of commodity computers called a cluster. Computers on the cluster are called nodes.
While many traditional relational databases store data in pre-defined columns and rows, HDFS lets companies store growing amounts of “unstructured data”, such as video and social media posts. Some companies use HDFS to create “data lakes”, essentially huge data repositories that can support analytics. Hadoop also is often used in conjunction with traditional database technologies.
Since HDFS stores data on commodity hardware rather than expensive, specialized machines, it’s generally less costly and easier to scale. HDFS also makes copies of each block and distributes them throughout the cluster, so data isn’t lost if a server goes down. (See diagram below.)
How data is stored in HDFS

- HDFS splits large files into blocks and distributes /replicates those blocks across multiples nodes in the cluster.
- Barclays Research
Computation happens simultaneously across nodes in the cluster, an approach called parallel processing. Spreading the work out can help companies handle larger volumes of data in less time.
MapReduce is the programming model for analyzing data in Hadoop. It happens in three main stages: Map, Shuffle and Reduce.
The Map function takes input data and gets it ready to be analyzed. In a March 2015 report published by Barclays, analysts explained how Hadoop could be used to analyze customer sentiment on social media. Starting with a bunch of raw text, the Map function would split the text into distinct words, determine the sentiment of those words and assign them a value of positive, negative or neutral, the report said.
The Shuffle function brings related pieces of data closer together to make the analysis more efficient. In the social media example, Shuffle would bring together the instances of each word stored across the cluster, the Barclays report said. This is an important step since data are spread out across various nodes in the cluster.
Reduce takes that data and performs calculations to get users closer to an answer. In this case, it would determine what percentage of the social media posts were positive, negative or neutral, Barclays said. A company could use that insight about overall customer sentiment to shape its marketing strategy.
Map, Shuffle and Reduce: Social media sentiment

- Example of MapReduce social media sentiment analysis by phase. Bars represent blocks, colors represent sentiments, circles represent nodes.
- Barclays Research
YARN, short for Yet Another Resource Negotiator, is a third component often considered a key part of Hadoop. It schedules data processing jobs and manages computing resources inside the cluster, Barclays said.
Hadoop in the enterprise
In big companies, Hadoop often sits next to the enterprise data warehouse, which collects information from across the firm and uses a set of complex rules to analyze it and manage how it is used in business reports. Hadoop is often used to process unstructured data that may not yet fit into the data warehouse. It also may be useful for analyzing information that doesn’t yet have a clear business purpose.
The Hadoop ecosystem
Hadoop’s development is overseen by the Apache Software Foundation, a nonprofit that supports more than 350 open source projects. Individuals and companies have contributed to these projects and built both open and for-profit tools linked to Hadoop. The Apache Hive project developed a query language called HiveQL that resembles SQL, a well known query language used in many companies. Apache Pig provides a language – Pig Latin – to simplify MapReduce jobs, which are generally regarded as complex. Apache Sqoop helps facilitate the movement of data between Hadoop and relational databases. For more Hadoop-related Apache projects, click here.
Several companies are trying to make Hadoop useable for corporate customers. They include Hortonworks Inc., which went public in December 2014, as well as Cloudera Inc. and MapR Technologies Inc. Mr. Cutting now works at Cloudera.
Spark
Spark is an Apache project gaining attention among companies that need to analyze huge volumes of data in real time. Instead of writing the output of each analysis to a disk, Spark stores results inside resilient distributed datasets, or RDDs, that are cached in available memory. That reduces lag time and increases performance. Spark’s processing engine is more efficient than MapReduce, many proponents say. Big Data experts have said MapReduce can be clunky and difficult to learn, factors that could slow corporate adoption and spur the development of new and better tools.
“Spark will likely overtake MapReduce as the general purpose Hadoop engine of choice in the not too distant future,” Barclays researchers said in their March report. IBM said in June it would embed Spark into its analytics and commerce platforms.
Corporate adoption
Researcher Gartner Inc. says Hadoop adoption remains low as firms struggle to articulate Hadoop’s business value and overcome a shortage of workers who have the skills to use it. A survey of 284 global IT and business leaders in May found more than half had no plans to invest in Hadoop. Adoption could grow with the use of tools based on SQL, a query language that corporate IT shops know well, Barclays analyst Raimo Lenschow said earlier this year.
Hadoop adoption may be aided as large technology vendors add integration and support. Some examples: SAP SE’s HANA combines an in-memory database, advanced analytics processing and integration to data stores including Hadoop. Oracle Corp. offers Hadoop integration as well as its own Big Data Appliance. Microsoft Corp. provides Hadoop as a managed cloud service on its Azure platform, and International Business Machines Corp. has Hadoop offerings of its own. Some companies offer security and other “enterprise-grade” features. Amazon.com Inc.’s Amazon Web Services unit sells a service called Elastic MapReduce, based on Hadoop, which provides a managed framework for companies to manage large amounts of data.
Beyond Hadoop
Hadoop is one of many data technologies on the market today. Facebook Inc., for example, has developed an open source distributed SQL query engine called Presto to handle analytics in its own Big Data environment. It has its own interface and a custom query engine. Companies including Airbnb Inc. and Netflix Inc. use Presto, a Facebook spokeswoman said.
Hadoop has its limitations. It’s one of a range of data technologies inside the enterprise. Still, it’s one that companies need to understand as data-intensive technology such as mobile devices, social media and sensor-embedded, connected “things” redefine their business.
Write to [email protected]
@Michael Segal -
>
> I can tell you from personal experience that one can find value from joining the data from different
> groups within the organization
>
Totally agree here. This is not new. Combining or enriching data across silos has long be valuable.
AND it takes a log of work to "combine" cross silo data since the silos were built at different times, by different people, for different needs & objectives. Consequently the data is highly unlikely to be similar. May look the same.
Looking similar & meaning similar are two very different things.
Say I'm a small (10,000 employees) financial services firm with a mere 120,000 files on the mainframe. How does one determine which files can be run through Hadoop?
For certain, all the file owners will claim their data is valuable & unique. Worse, the systems very likely were built to politically protect themselves by uniqueness. You can't ask the owners since they'll defend their turf.
So how do you determine what the data actually MEANS? Or is meaning simply not relevant in the Hadoop process?
@Michael Segel -
>
> that one can find value from joining the data from different groups within the organization,
>
If I have data from LoB A, LoB B, & LoB N & combine "like" data from them, where is the process to determine if the meanings of the data in the various LoBs are the same?
Either someone does a lot of work in A, B & N to accurately determine the meaning(s) (LoB owners are unlikely to have a flimsy grasp of the data flowing through their systems) or A, B & N get jammed together in the Hadoop process, right?
Or does Hadoop magic just not need to understand the data it's ingesting? This would seem to imply that "42" has universal meaning no matter where or how it's used.
And then again, perhaps the various LoBs have rigorous data profiling in place... which I don't believe for a second.
@Holden
Microsoft Azure's cloud is a bit interesting and the only reason Microsoft is in the Hadoop game is thanks in a large part to Hortonworks. Azure when compared to AWS or Google is a bit immature.
I'm not sure what you mean in your second paragraph.
First Hadoop vendors like Cloudera, MapR and Hortonworks all can be run on AWS EC2 instances. MapR currently has inked a deal where you can build AWS EMR clusters using MapR along with Amazon's proprietary Hadoop release. (MapR also has a deal with Google if memory serves...)
To your point, regardless of whether you run in the cloud, or on an in-premise solution, you will still have to move your data to the cluster. In terms of security and privacy, while the Cloud Vendors like AWS which have certified their solution to be secure, when dealing with PII information you have more security concerns and more risk on the cloud than in your own data center. There are also other laws that you have to consider, especially when doing business internationally.
What I find striking about this article is that the only reference being made to big data cloud computing is in one sentence about Microsoft Azure....when there is a clear and massive shift of Hadoop computing happening on Amazon Web Services and Google Compute Platform due to the nativity of cloud-hosted business; as well as agility to deployment and availability of integrations with technologies that are coming from cloud-based vendors like Qubole, Databricks, and EMR.
Whereas services like Cloudera, Hortonworks, and Oracle - who for the most part, are all hosted services you have to move your data to run on. These vendors fulfill specific use cases such as situations that require PCI and HIPAA, but even then the cloud IAAS (AWS, GCP, Azure, etc.) are all shifting to satisfy those compliance issues.
Interesting article and comments. Ilya, though, must have financial ties with Oracle or his skill set is Oracle-centric. Hadoop looks to be a low-cost alternative in areas where Oracle isn't clearly the top dog. Currently I use the open source MySQL on a daily basis, it's free, stable, and fast-enough. Our organization couldn't afford Oracle. Perhaps Hadoop could be used for other options within our organization. I think it's worth an investigation.