In my previous post, I gave a list of installed services on a “Oracle Big Data Cloud Service – Compute Edition” when you select “full” as deployment profile. In this post, I’ll explain these services and software.
HDFS: HDFS is a distributed, scalable, and portable file system written in Java for Hadoop. It stores data so it is the main component of the our cluster. A Hadoop (big data) cluster has nominally a single namenode plus a cluster of datanodes, but there are redundancy options available for the namenode due to its criticality. Both namenode and datanode services can run in same server (although it’s not recommended on a production environment). In our small cluster, we have 1 active namenode, 1 standby namenode and 3 datanodes – distributed to 3 servers.
YARN + MapReduce (v2): MapReduce is a programming model popularized by Google to process large datasets in a parallel and scalable way. is a framework for cluster resource management and job scheduling. YARN contains a Resource Manager and Node Managers (for redundancy we can create a standby Resource Manager). The Resource Manager tracks how many live nodes and resources are available on the cluster and coordinates which applications submitted by users should get these resources. Each datanode should have a nodemanager to run MapReduce jobs.