Pyspark (Dataframes) read file line wise (Convert row to string)

Question

I need to read a file line wise and split each line into words and perform operations on words.

How do I do that?

I wrote the below code:

logFile = "/home/hadoop/spark-2.3.1-bin-hadoop2.7/README.md"  # Should be 
some file on your system
spark = SparkSession.builder.appName("SimpleApp1").getOrCreate()
logData = spark.read.text(logFile).cache()
logData.printSchema()
logDataLines = logData.collect()

#The line variable below seems to be of type row. How I perform similar operations 
on row or how do I convert row to a string.

for line in logDataLines:
    words = line.select(explode(split(line,"\s+")))
    for word in words:
        print(word)
    print("----------------------------------")

By using collect() you will collect all data on the driver node, i.e. if you do it like that there would be no need to use Spark. This question shows how to split a column in a dataframe and explode it: stackoverflow.com/questions/38210507/explode-in-pyspark — Shaido
– Shaido, Commented Aug 28, 2018 at 1:52

gaw · Accepted Answer · 2018-09-04 13:29:59Z

2

I think you should apply a map function to your rows. You can apply anything in the self-created function:

data = spark.read.text("/home/spark/test_it.txt").cache()

def someFunction(row):
    wordlist = row[0].split(" ")
    result = list()
    for word in wordlist:
        result.append(word.upper())
    return result

data.rdd.map(someFunction).collect()

Output:

[[u'THIS', u'IS', u'JUST', u'A', u'TEST'], [u'TO', u'UNDERSTAND'], [u'THE', u'PROCESSING']]

answered Sep 4, 2018 at 13:29

gaw

1,9602 gold badges15 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark (Dataframes) read file line wise (Convert row to string)

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related