Unable to import graphframes with pyspark

Unable to import graphframes with pyspark

You might hit into below error message while trying to import graphframe module into your pyspark session in an EMR cluster.

 

>> print(spark.version)
2.1.0
>>> from graphframes import*
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
ImportError: No module named graphframes
>>>

 

it will  need few additional steps to  make it work.

Please follow below the steps to import the module –

1. Download “0.3.0-spark2.0-s_2.11” from http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar
you may run follwoing command on master node to down load the jar.

$ wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar

2. extract the JAR contents:

$ jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar

3. Navigate to “graphframe” directory and zip the contents inside of it.

$ zip graphframes.zip -r *

4. copy the zipped file to your home:

$ cp graphframes.zip /home/hadoop/

5. Set environment variable.

ADD these environment variables to your “/etc/spark/conf/spark-env.sh” file. PySpark will use these variables.

$ sudo vi /etc/spark/conf/spark-env.sh

and add below lines:

export PYSPARK_PYTHON=python34
export PYTHONPATH=$PYTHONPATH:/home/hadoop/graphframes.zip:.

6. Then launch the pyspark with graphframes.

pyspark –packages graphframes:graphframes:0.3.0-spark2.0-s_2.11

7. Make sure the “graphframes.zip” is in the PYTHONPATH

>>> import sys
>>> print(sys.path)

8. Now try to import graphframe packages in your spark shell.

>>> from graphframe import *


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *