I am a newbie to Hadoop. Please correct me if I am asking nonsense and help me to solve this problem :).
I installed and configured a two node hadoop cluster (yarn).
- Master node : 2TB HDD, 4GB RAM
- Slave node : 500GB HDD, 4GB RAM
Datanode: Master node only (Not keeping replicated data in Slave node)
Map/Reduce : Master node & Slave node.
Out of 10TB data, I uploaded 2TB to Master node(Data node). I am using slave nodes for Map/Reduce only (to use 100% CPU of slave node for running queries).
My questions:
If I add a new 2TB HDD to master node and I want to upload 2TB more to master node, how can I use both the HDD(data in Old HDD and New HDD in master)? Is there any way to give multiple HDD path in hdfs-site.xml?
Do I need to add 4TB HDD in slave node(With all the data in master) to use 100% of CPU of slave? Or Do slave can access data from master and run Map/Reduce jobs?
If I add 4TB to slave and uploading data to hadoop. Will that make any replication in master(duplicates)? Can I get access to all data in the primary HDD of master and primary HDD of slave? Do the queries uses 100% CPU of both nodes if I am doing this?
As a whole, If I have a 10TB data. what is the proper way to configure Hadoop two node cluster? what specification(for master and datanode) should I use to run the Hive queries fast?
I got stuck. I really need your suggestions and help.
Thanks a ton in advance.