pandas functions do not work when using pyspark in jupyter under ubuntu in virtual machine

Question

I am studying pyspark. So, here is how I build up the environment:

1. ubuntu in virtual machine
2. downloading spark 2.4.0
3. install pyspark using pip
4. configuring environment path: 
    export SPARK_HOME="/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/"
    export PATH=$SPARK_HOME/bin:$PATH
    export PYSPARK_DRIVER_PYTHON=jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Then I can use pyspark in jupyter. The first lines are here to discover spark:

import findspark
findspark.init()
import pyspark

Theoretically, I am supposed to use pyspark now. But please see the following two examples:

Example 1:

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd

data1 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
     'Name': {0: 'Owen', 1: 'Florence', 2: 'Laina', 3: 'Lily', 4: 'William'},
     'Sex': {0: 'male', 1: 'female', 2: 'female', 3: 'female', 4: 'male'},
     'Survived': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}}

data2 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
     'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35},
     'Fare': {0: 7.3, 1: 71.3, 2: 7.9, 3: 53.1, 4: 8.0},
     'Pclass': {0: 3, 1: 1, 2: 3, 3: 1, 4: 3}}

df1_pd = pd.DataFrame(data1, columns=data1.keys())
df2_pd = pd.DataFrame(data2, columns=data2.keys())

df1 = spark.createDataFrame(df1_pd)
df2 = spark.createDataFrame(df2_pd)
df1.show()
df1.filter(df1.Survived ==0 ).show()

Its result is:

+-----------+--------+--------+------+
|PassengerId|    Name|Survived|   Sex|
+-----------+--------+--------+------+
|          1|    Owen|       0|  male|
|          2|Florence|       1|female|
|          3|   Laina|       1|female|
|          4|    Lily|       1|female|
|          5| William|       0|  male|
+-----------+--------+--------+------+

+-----------+-------+--------+----+
|PassengerId|   Name|Survived| Sex|
+-----------+-------+--------+----+
|          1|   Owen|       0|male|
|          5|William|       0|male|
+-----------+-------+--------+----+

Example 2:

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd

spark=SparkSession.builder.getOrCreate()
df = spark.read.csv("/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/examples/src/main/resources/people.csv",
                header=True)
df.show()  


df.filter(df.age > 20 ).show()

The result is:

+------------------+
|      name;age;job|
+------------------+
|Jorge;30;Developer|
|  Bob;32;Developer|
+------------------+

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-429ca6587743> in <module>()
 12 
 13 
---> 14 df.filter(df.age > 20 ).show()

/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
   1298         if name not in self.columns:
   1299             raise AttributeError(
-> 1300                 "'%s' object has no attribute '%s'" % 
(self.__class__.__name__, name))
   1301         jc = self._jdf.apply(name)
   1302         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'age'

you can see that in these two examples, the function show() has different result. The first one is better then the second one. And for the function filter(), the second one has an error but the first one is good.

I think the only difference (I probably am wrong about this) between these two examples is the first one uses a small data frame generated in the code while the second one read the data from a csv file.

So, what I should do to read data and analyze it properly?

Shaido · Accepted Answer · 2019-02-08 05:42:37Z

1

When reading a csv file, Spark will split the columns based on commas (,). If you check the file you will find that the columns are separated with ;. In this case, Spark will interpret (correctly) that the whole row belongs to a single column. This is why the results of .show() is different, you only have a single column in the second example.

To read the data properly, change the delimiter when reading:

spark.read.option("delimiter", ";").option("header", "true").csv(file)

answered Feb 8, 2019 at 5:42

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pandas functions do not work when using pyspark in jupyter under ubuntu in virtual machine

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related