Apache Hadoop

Hadoop is a top-level Apache project being built and used by a global community of contributors,[2] using the Java programming language. Yahoo! has been the largest contributor[3] to the project, and uses Hadoop extensively across its businesses

Online PR News – 21-August-2010 – – The HDFS filesystem stores large files (an ideal file size is a multiple of 64 MB[9]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.

The filesystem is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS. They also serve the data over HTTP, allowing access to all content from a web browser or other client. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

A filesystem requires one unique server, the name node. This is a single point of failure for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster.[10] The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the namenode and downloads a snapshot of the primary Namenode's directory information, which is then saved to a directory. This Secondary Namenode is used together with the edit log of the Primary Namenode to create an up-to-date directory structure.

Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace has been developed to address this problem, at least for Linux and some other Unix systems.