0

I am studying pyspark. So, here is how I build up the environment:

1. ubuntu in virtual machine
2. downloading spark 2.4.0
3. install pyspark using pip
4. configuring environment path: 
    export SPARK_HOME="/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/"
    export PATH=$SPARK_HOME/bin:$PATH
    export PYSPARK_DRIVER_PYTHON=jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS='notebook' 

Then I can use pyspark in jupyter. The first lines are here to discover spark:

import findspark
findspark.init()
import pyspark

Theoretically, I am supposed to use pyspark now. But please see the following two examples:

Example 1:

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd

data1 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
     'Name': {0: 'Owen', 1: 'Florence', 2: 'Laina', 3: 'Lily', 4: 'William'},
     'Sex': {0: 'male', 1: 'female', 2: 'female', 3: 'female', 4: 'male'},
     'Survived': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}}

data2 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
     'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35},
     'Fare': {0: 7.3, 1: 71.3, 2: 7.9, 3: 53.1, 4: 8.0},
     'Pclass': {0: 3, 1: 1, 2: 3, 3: 1, 4: 3}}

df1_pd = pd.DataFrame(data1, columns=data1.keys())
df2_pd = pd.DataFrame(data2, columns=data2.keys())

df1 = spark.createDataFrame(df1_pd)
df2 = spark.createDataFrame(df2_pd)
df1.show()
df1.filter(df1.Survived ==0 ).show()

Its result is:

+-----------+--------+--------+------+
|PassengerId|    Name|Survived|   Sex|
+-----------+--------+--------+------+
|          1|    Owen|       0|  male|
|          2|Florence|       1|female|
|          3|   Laina|       1|female|
|          4|    Lily|       1|female|
|          5| William|       0|  male|
+-----------+--------+--------+------+

+-----------+-------+--------+----+
|PassengerId|   Name|Survived| Sex|
+-----------+-------+--------+----+
|          1|   Owen|       0|male|
|          5|William|       0|male|
+-----------+-------+--------+----+

Example 2:

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd

spark=SparkSession.builder.getOrCreate()
df = spark.read.csv("/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/examples/src/main/resources/people.csv",
                header=True)
df.show()  


df.filter(df.age > 20 ).show()

The result is:

+------------------+
|      name;age;job|
+------------------+
|Jorge;30;Developer|
|  Bob;32;Developer|
+------------------+

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-429ca6587743> in <module>()
 12 
 13 
---> 14 df.filter(df.age > 20 ).show()

/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
   1298         if name not in self.columns:
   1299             raise AttributeError(
-> 1300                 "'%s' object has no attribute '%s'" % 
(self.__class__.__name__, name))
   1301         jc = self._jdf.apply(name)
   1302         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'age'

you can see that in these two examples, the function show() has different result. The first one is better then the second one. And for the function filter(), the second one has an error but the first one is good.

I think the only difference (I probably am wrong about this) between these two examples is the first one uses a small data frame generated in the code while the second one read the data from a csv file.

So, what I should do to read data and analyze it properly?

1 Answer 1

1

When reading a csv file, Spark will split the columns based on commas (,). If you check the file you will find that the columns are separated with ;. In this case, Spark will interpret (correctly) that the whole row belongs to a single column. This is why the results of .show() is different, you only have a single column in the second example.

To read the data properly, change the delimiter when reading:

spark.read.option("delimiter", ";").option("header", "true").csv(file)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.