8

I'm following this installation guide but have the following problem with using graphframes

from pyspark import SparkContext
sc =SparkContext()
!pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
from graphframes import *

--------------------------------------------------------------------------- ImportError Traceback (most recent call last) in () ----> 1 from graphframes import *

ImportError: No module named graphframes

I'm not sure wether it is possible to install package on the following way. But I'll appreciate your advice and help.

3
  • This might help: community.hortonworks.com/questions/61386/… Commented May 11, 2018 at 6:30
  • Looks like a nice workaround. Definitely, try, but I guess there should be the more general solution. Commented May 11, 2018 at 6:37
  • scala> util.Properties.versionNumberString res0: String = 2.12.4 Commented May 11, 2018 at 13:31

4 Answers 4

13

Good question!

Open up your bashrc file, and type export SPARK_OPTS="--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11". Once you saved your bashrc file, close it and type source .bashrc.

Finally, open up your notebook and type:

from pyspark import SparkContext
sc = SparkContext()
sc.addPyFile('/home/username/spark-2.3.0-bin-hadoop2.7/jars/graphframes-0.5.0-spark2.1-s_2.11.jar')

After that, you may able to run it.

Sign up to request clarification or add additional context in comments.

Comments

5

I'm using jupyter notebook in docker, trying to get graphframes working. First, I used the method in https://stackoverflow.com/a/35762809/2202107, I have:

import findspark
findspark.init()
import pyspark
import os

SUBMIT_ARGS = "--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)
print(sc._conf.getAll())

Then by following this issue, we finally are able to import graphframes: https://github.com/graphframes/graphframes/issues/172

import sys
pyfiles = str(sc.getConf().get(u'spark.submit.pyFiles')).split(',')
sys.path.extend(pyfiles)
from graphframes import *

2 Comments

Which is better than the answer by @Pranav A, since I don't need to go fiddle manually with paths.
Thank you! this worked for me -- can anyone help me understand what the commands here doing that make it work?
2

The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark.

Just open your terminal and set the two environment variables and start pyspark with the graphframes package

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Comments

0

I went through a long painful road to find a solution that works here.

I am working with the native jupyter server within VS code. In there, I created a .env file:

SPARK_HOME=/home/adam/projects/graph-algorithms-book/spark-3.2.0-bin-hadoop3.2
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
PYSPARK_SUBMIT_ARGS="--driver-memory 2g --executor-memory 6g --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 pyspark-shell"

Then in my python notebook I have something that looks like the following:

from pyspark.sql.types import *
from graphframes import *

from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName('GraphFrames').getOrCreate()

You should see the code to print out and fetch the dependencies accordingly. Something like this:

:: loading settings :: url = jar:file:/home/adam/projects/graph-algorithms-book/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/adam/.ivy2/cache
The jars for the packages stored in: /home/adam/.ivy2/jars
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-96a3a1f1-4ea4-4433-856b-042d0269ec1a;1.0
    confs: [default]
    found graphframes#graphframes;0.8.2-spark3.2-s_2.12 in spark-packages
    found org.slf4j#slf4j-api;1.7.16 in central
:: resolution report :: resolve 174ms :: artifacts dl 8ms
    :: modules in use:
    graphframes#graphframes;0.8.2-spark3.2-s_2.12 from spark-packages in [default]
    org.slf4j#slf4j-api;1.7.16 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
    ---------------------------------------------------------------------

after that I was able to create some code with the relationships:

v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])

It should work fine. Just remember to align all your pyspark versions. I had to install the proper versions of graphframes from a forked repo. The PiPy install is behind on versions so I had to use the PHPirates repo to do the proper install. Here the graphframes has been compiled for version 3.2.0 of pyspark.

pip install "git+https://github.com/PHPirates/[email protected]#egg=graphframes&subdirectory=python"
pip install pyspark==3.2.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.