python - AWS EMR Spark "No Module named pyspark" -
python - AWS EMR Spark "No Module named pyspark" -
i created spark cluster, ssh master, , launch shell:
master=yarn-client ./spark/bin/pyspark
when following:
x = sc.textfile("s3://location/files.*") xt = x.map(lambda x: handlejson(x)) table= sqlctx.inferschema(xt)
i next error:
error python worker: /usr/bin/python: no module named pyspark pythonpath was: /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.1.0-hadoop2.4.0.jar java.io.eofexception java.io.datainputstream.readint(datainputstream.java:392) org.apache.spark.api.python.pythonworkerfactory.startdaemon(pythonworkerfactory.scala:151) org.apache.spark.api.python.pythonworkerfactory.createthroughdaemon(pythonworkerfactory.scala:78) org.apache.spark.api.python.pythonworkerfactory.create(pythonworkerfactory.scala:54) org.apache.spark.sparkenv.createpythonworker(sparkenv.scala:97) org.apache.spark.api.python.pythonrdd.compute(pythonrdd.scala:66) org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:262) org.apache.spark.rdd.rdd.iterator(rdd.scala:229) org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:62) org.apache.spark.scheduler.task.run(task.scala:54) org.apache.spark.executor.executor$taskrunner.run(executor.scala:177) java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1145) java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:615) java.lang.thread.run(thread.java:745)
i checked pythonpath
>>> os.environ['pythonpath'] '/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip:/home/hadoop/spark/python/:/home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar'
and looked within jar pyspark, , it's there:
jar -tf /home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar | grep pyspark pyspark/ pyspark/shuffle.py pyspark/resultiterable.py pyspark/files.py pyspark/accumulators.py pyspark/sql.py pyspark/java_gateway.py pyspark/join.py pyspark/serializers.py pyspark/shell.py pyspark/rddsampler.py pyspark/rdd.py ....
has run before? thanks!
you'll want reference these spark issues:
https://issues.apache.org/jira/browse/spark-3008 https://issues.apache.org/jira/browse/spark-1520the solution (assuming rather not rebuild jar):
unzip -d foo spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar cd foo # if don't have openjdk 1.6: # yum install -y java-1.6.0-openjdk-devel.x86_64 /usr/lib/jvm/openjdk-1.6.0/bin/jar cvmf meta-inf/manifest ../spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar . # don't neglect dot @ end of command
python amazon-web-services apache-spark amazon-emr
Comments
Post a Comment