I am studying pyspark. So, here is how I build up the environment:
1. ubuntu in virtual machine
2. downloading spark 2.4.0
3. install pyspark using pip
4. configuring environment path:
export SPARK_HOME="/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Then I can use pyspark in jupyter. The first lines are here to discover spark:
import findspark
findspark.init()
import pyspark
Theoretically, I am supposed to use pyspark now. But please see the following two examples:
Example 1:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd
data1 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'Name': {0: 'Owen', 1: 'Florence', 2: 'Laina', 3: 'Lily', 4: 'William'},
'Sex': {0: 'male', 1: 'female', 2: 'female', 3: 'female', 4: 'male'},
'Survived': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}}
data2 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35},
'Fare': {0: 7.3, 1: 71.3, 2: 7.9, 3: 53.1, 4: 8.0},
'Pclass': {0: 3, 1: 1, 2: 3, 3: 1, 4: 3}}
df1_pd = pd.DataFrame(data1, columns=data1.keys())
df2_pd = pd.DataFrame(data2, columns=data2.keys())
df1 = spark.createDataFrame(df1_pd)
df2 = spark.createDataFrame(df2_pd)
df1.show()
df1.filter(df1.Survived ==0 ).show()
Its result is:
+-----------+--------+--------+------+
|PassengerId| Name|Survived| Sex|
+-----------+--------+--------+------+
| 1| Owen| 0| male|
| 2|Florence| 1|female|
| 3| Laina| 1|female|
| 4| Lily| 1|female|
| 5| William| 0| male|
+-----------+--------+--------+------+
+-----------+-------+--------+----+
|PassengerId| Name|Survived| Sex|
+-----------+-------+--------+----+
| 1| Owen| 0|male|
| 5|William| 0|male|
+-----------+-------+--------+----+
Example 2:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd
spark=SparkSession.builder.getOrCreate()
df = spark.read.csv("/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/examples/src/main/resources/people.csv",
header=True)
df.show()
df.filter(df.age > 20 ).show()
The result is:
+------------------+
| name;age;job|
+------------------+
|Jorge;30;Developer|
| Bob;32;Developer|
+------------------+
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-429ca6587743> in <module>()
12
13
---> 14 df.filter(df.age > 20 ).show()
/home/feng/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
1298 if name not in self.columns:
1299 raise AttributeError(
-> 1300 "'%s' object has no attribute '%s'" %
(self.__class__.__name__, name))
1301 jc = self._jdf.apply(name)
1302 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'age'
you can see that in these two examples, the function show() has different result. The first one is better then the second one. And for the function filter(), the second one has an error but the first one is good.
I think the only difference (I probably am wrong about this) between these two examples is the first one uses a small data frame generated in the code while the second one read the data from a csv file.
So, what I should do to read data and analyze it properly?