Unable to import graphframes with pyspark
You might hit into below error message while trying to import graphframe module into your pyspark session in an EMR cluster.
>> print(spark.version)
2.1.0
>>> from graphframes import*
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
ImportError: No module named graphframes
>>>
it will need few additional steps to make it work.
Please follow below the steps to import the module –
1. Download “0.3.0-spark2.0-s_2.11” from http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar
you may run follwoing command on master node to down load the jar.
$ wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar
2. extract the JAR contents:
$ jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
3. Navigate to “graphframe” directory and zip the contents inside of it.
$ zip graphframes.zip -r *
4. copy the zipped file to your home:
$ cp graphframes.zip /home/hadoop/
5. Set environment variable.
ADD these environment variables to your “/etc/spark/conf/spark-env.sh” file. PySpark will use these variables.
$ sudo vi /etc/spark/conf/spark-env.sh
and add below lines:
export PYSPARK_PYTHON=python34
export PYTHONPATH=$PYTHONPATH:/home/hadoop/graphframes.zip:.
6. Then launch the pyspark with graphframes.
pyspark –packages graphframes:graphframes:0.3.0-spark2.0-s_2.11
7. Make sure the “graphframes.zip” is in the PYTHONPATH
>>> import sys
>>> print(sys.path)
8. Now try to import graphframe packages in your spark shell.
>>> from graphframe import *
Leave a Reply