Archive

Archive for the ‘Hadoop’ Category

Hadoop and Cloud

November 13, 2010 Leave a comment

When I first heard of Hadoop, it was in conjunction with cloud computing. After reading extensively about cloud technology, I had to wonder how Hadoop was related to cloud. To find that, let me first talk about what Hadoop is.

Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

But the main thing to keep in mind is that Hadoop does not replace a database. Hadoop stores data in files, and does not index them. It works with very large datasets by using a distributed filesystem.

All of the major cloud players are working with Hadoop in their cloud environment. Yahoo has a great presentation on Hadoop at http://www.slideshare.net/darugar/cloud-computing-hadoop-presentation

A fellow cloud enthusiast raised that same question – What does Hadoop have to do with Cloud?

In short, Hadoop is a great framework that works well with cloud environment and complements it, but that does not mean you cannot have Hadoop on your Linux OS.