How to run pyspark code in distributed environment

Question

I have 1 millions records and I want to try spark for this. I have list of items and want to perform lookup in records using this list items.

l = ['domestic',"private"]
text = ["On the domestic front, growth seems to have stalled, private investment and credit off-take is feeble, inflation seems to be bottoming out and turning upward, current account situation is not looking too promising, FPI inflows into debt and equity have slowed, and fiscal deficit situation of states is grim.", "Despite the aforementioned factors, rupee continues to remain strong against the USD and equities continue to outperform.", "This raises the question as to whether the asset prices are diverging from fundamentals and if so when are they expected to fall in line. We examine each of the above factors in a little more detail below.Q1FY18 growth numbers were disappointing with the GVA, or the gross value added, coming in at 5.6 percent. Market participants would be keen to ascertain whether the disappointing growth in Q1 was due to transitory factors such as demonetisation and GST or whether there are structural factors at play. There are silver linings such as a rise in core GVA (GVA excluding agri and public services), a rise in July IIP (at 1.2%), pickup in activity in the cash-intensive sectors, pick up in rail freight and containers handled by ports.However, there is a second school of thought as well, which suggests that growth slowdown could be structural. With demonetisation and rollout of GST, a number of informal industries have now been forced to enter the formal setup."]
res = {}
for rec in text:
    for word in l:
        if word in rec:
            res[rec] = 1
            break
print res

This is simple python script and same logic I want to execute using pyspark(Will this same code work?) in distributed manner to reduce the execution time.

Can you please guide me how to do this. I am sorry as I am very new to spark, you help will be much appereciated.

MaFF · Accepted Answer · 2017-09-17 13:05:32Z

After instanciating a spark context and/or a spark session, you'll have to convert your list of records to a dataframe:

df = spark.createDataFrame(
    sc.parallelize(
        [[rec] for rec in text]
    ), 
    ["text"]
)
df.show()

    +--------------------+
    |                text|
    +--------------------+
    |On the domestic f...|
    |Despite the afore...|
    |This raises the q...|
    +--------------------+

Now you can check for each line if words in l are present or not:

sc.broadcast(l)
res = df.withColumn("res", df.text.rlike('|'.join(l)).cast("int"))
res.show()

    +--------------------+---+
    |                text|res|
    +--------------------+---+
    |On the domestic f...|  1|
    |Despite the afore...|  0|
    |This raises the q...|  0|
    +--------------------+---+

rlike is for performing regex matching
sc.broadcast is for copying object l to every node so they don't have to go get it on the driver

Hope this helps

Collectives™ on Stack Overflow

How to run pyspark code in distributed environment

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related