0

I have 1 millions records and I want to try spark for this. I have list of items and want to perform lookup in records using this list items.

l = ['domestic',"private"]
text = ["On the domestic front, growth seems to have stalled, private investment and credit off-take is feeble, inflation seems to be bottoming out and turning upward, current account situation is not looking too promising, FPI inflows into debt and equity have slowed, and fiscal deficit situation of states is grim.", "Despite the aforementioned factors, rupee continues to remain strong against the USD and equities continue to outperform.", "This raises the question as to whether the asset prices are diverging from fundamentals and if so when are they expected to fall in line. We examine each of the above factors in a little more detail below.Q1FY18 growth numbers were disappointing with the GVA, or the gross value added, coming in at 5.6 percent. Market participants would be keen to ascertain whether the disappointing growth in Q1 was due to transitory factors such as demonetisation and GST or whether there are structural factors at play. There are silver linings such as a rise in core GVA (GVA excluding agri and public services), a rise in July IIP (at 1.2%), pickup in activity in the cash-intensive sectors, pick up in rail freight and containers handled by ports.However, there is a second school of thought as well, which suggests that growth slowdown could be structural. With demonetisation and rollout of GST, a number of informal industries have now been forced to enter the formal setup."]
res = {}
for rec in text:
    for word in l:
        if word in rec:
            res[rec] = 1
            break
print res

This is simple python script and same logic I want to execute using pyspark(Will this same code work?) in distributed manner to reduce the execution time.

Can you please guide me how to do this. I am sorry as I am very new to spark, you help will be much appereciated.

1 Answer 1

1

After instanciating a spark context and/or a spark session, you'll have to convert your list of records to a dataframe:

df = spark.createDataFrame(
    sc.parallelize(
        [[rec] for rec in text]
    ), 
    ["text"]
)
df.show()

    +--------------------+
    |                text|
    +--------------------+
    |On the domestic f...|
    |Despite the afore...|
    |This raises the q...|
    +--------------------+

Now you can check for each line if words in l are present or not:

sc.broadcast(l)
res = df.withColumn("res", df.text.rlike('|'.join(l)).cast("int"))
res.show()

    +--------------------+---+
    |                text|res|
    +--------------------+---+
    |On the domestic f...|  1|
    |Despite the afore...|  0|
    |This raises the q...|  0|
    +--------------------+---+
  • rlike is for performing regex matching
  • sc.broadcast is for copying object l to every node so they don't have to go get it on the driver

Hope this helps

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.