0

I am a newbie to Hadoop. Please correct me if I am asking nonsense and help me to solve this problem :).

I installed and configured a two node hadoop cluster (yarn).

  • Master node : 2TB HDD, 4GB RAM
  • Slave node : 500GB HDD, 4GB RAM

Datanode: Master node only (Not keeping replicated data in Slave node)

Map/Reduce : Master node & Slave node.

Out of 10TB data, I uploaded 2TB to Master node(Data node). I am using slave nodes for Map/Reduce only (to use 100% CPU of slave node for running queries).

My questions:

  1. If I add a new 2TB HDD to master node and I want to upload 2TB more to master node, how can I use both the HDD(data in Old HDD and New HDD in master)? Is there any way to give multiple HDD path in hdfs-site.xml?

  2. Do I need to add 4TB HDD in slave node(With all the data in master) to use 100% of CPU of slave? Or Do slave can access data from master and run Map/Reduce jobs?

  3. If I add 4TB to slave and uploading data to hadoop. Will that make any replication in master(duplicates)? Can I get access to all data in the primary HDD of master and primary HDD of slave? Do the queries uses 100% CPU of both nodes if I am doing this?

  4. As a whole, If I have a 10TB data. what is the proper way to configure Hadoop two node cluster? what specification(for master and datanode) should I use to run the Hive queries fast?

I got stuck. I really need your suggestions and help.

Thanks a ton in advance.

1 Answer 1

1

Please find the answers below:

  1. provide a comma-separated list of directories in hdfs-site.xml . source https://www.safaribooksonline.com/library/view/hadoop-mapreduce-cookbook/9781849517287/ch02s05.html
  2. No. you don't need to add HDD on slave to use 100% CPU. Under the current configuration the node manager running on slave will read data from data node running on master (over network). This is not efficient in terms of data locality, but it doesn't affect the processing throughput. It will add additional latency due to network transfer.
  3. No. The replication factor (number of copies to be stored) is independent of number of data nodes. The default replication factor can be changed hdfs-site.xml using the property dfs.replication. You can also configure this on per file basis.
  4. You will need at least 10GB of storage across you cluster (all the data node combined, with replication factor 1). For a production system I would recommend a replication factor 3 (to handle node failure), that is 10*3 = 30GB storage across at least 3 nodes. Since 10GB is very small in Hadoop terms, have 3 nodes each with 2 or 4 core processor and 4 to 8 GB memory. Configur this as - node1: name node + data node + node manager, node2: resource manager + data node + node manager, node3: data node + node manager.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.