I have 1 millions records and I want to try spark for this. I have list of items and want to perform lookup in records using this list items.
l = ['domestic',"private"]
text = ["On the domestic front, growth seems to have stalled, private investment and credit off-take is feeble, inflation seems to be bottoming out and turning upward, current account situation is not looking too promising, FPI inflows into debt and equity have slowed, and fiscal deficit situation of states is grim.", "Despite the aforementioned factors, rupee continues to remain strong against the USD and equities continue to outperform.", "This raises the question as to whether the asset prices are diverging from fundamentals and if so when are they expected to fall in line. We examine each of the above factors in a little more detail below.Q1FY18 growth numbers were disappointing with the GVA, or the gross value added, coming in at 5.6 percent. Market participants would be keen to ascertain whether the disappointing growth in Q1 was due to transitory factors such as demonetisation and GST or whether there are structural factors at play. There are silver linings such as a rise in core GVA (GVA excluding agri and public services), a rise in July IIP (at 1.2%), pickup in activity in the cash-intensive sectors, pick up in rail freight and containers handled by ports.However, there is a second school of thought as well, which suggests that growth slowdown could be structural. With demonetisation and rollout of GST, a number of informal industries have now been forced to enter the formal setup."]
res = {}
for rec in text:
for word in l:
if word in rec:
res[rec] = 1
break
print res
This is simple python script and same logic I want to execute using pyspark(Will this same code work?) in distributed manner to reduce the execution time.
Can you please guide me how to do this. I am sorry as I am very new to spark, you help will be much appereciated.