Spark Nuggets

REPL

Print full classpath used to launch shell:

SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell

Scala REPL show defined terms

$intp.definedTerms.foreach(println)

Enter paste mode

:paste

Misc

User classpath

spark.yarn.user.classpath.first
spark.files.userClassPathFirst=true

Building Spark via Maven

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

Other Spark Libraries

  • http://spark-packages.org/

Scala - rename import

import org.apache.spark.mllib.linalg.{Vector => SparkVector}

Code Snippets

 val recordsKeyValues = sc.newAPIHadoopRDD(conf.getConfiguration,
        classOf[AvroKeyInputFormat[MailRecord]],
        classOf[AvroKey[MailRecord]],
        classOf[NullWritable])