1

I need to read a file line wise and split each line into words and perform operations on words.

How do I do that?

I wrote the below code:

logFile = "/home/hadoop/spark-2.3.1-bin-hadoop2.7/README.md"  # Should be 
some file on your system
spark = SparkSession.builder.appName("SimpleApp1").getOrCreate()
logData = spark.read.text(logFile).cache()
logData.printSchema()
logDataLines = logData.collect()

#The line variable below seems to be of type row. How I perform similar operations 
on row or how do I convert row to a string.

for line in logDataLines:
    words = line.select(explode(split(line,"\s+")))
    for word in words:
        print(word)
    print("----------------------------------")
1
  • By using collect() you will collect all data on the driver node, i.e. if you do it like that there would be no need to use Spark. This question shows how to split a column in a dataframe and explode it: stackoverflow.com/questions/38210507/explode-in-pyspark Commented Aug 28, 2018 at 1:52

1 Answer 1

2

I think you should apply a map function to your rows. You can apply anything in the self-created function:

data = spark.read.text("/home/spark/test_it.txt").cache()

def someFunction(row):
    wordlist = row[0].split(" ")
    result = list()
    for word in wordlist:
        result.append(word.upper())
    return result

data.rdd.map(someFunction).collect()

Output:

[[u'THIS', u'IS', u'JUST', u'A', u'TEST'], [u'TO', u'UNDERSTAND'], [u'THE', u'PROCESSING']]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.