I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
- Customer life cycle Value (CLV)
- Centrality measures (degree, Eigenvector, edge-betweenness, closeness) The data is in a CSV file (60GB (3 years transnational data)) located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
- Should i load the data from the CSV file into dataframe and work on the dataframe? or
- Should i load the data from the CSV file and convert it into RDD and then work on the RDD? or
- Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..