Spark’s default build strategy is to assemble a jar including all of its dependencies. This can be cumbersome when doing iterative development. When developing locally, it is possible to create an assembly jar including all of Spark’s dependencies and then re-package only Spark itself when making changes.
$ build/sbt clean package
$ ./bin/spark-shell
$ export SPARK_PREPEND_CLASSES=true
$ ./bin/spark-shell # Now it's using compiled classes
# ... do some local development ... #
$ build/sbt compile
# ... do some local development ... #
$ build/sbt compile
$ unset SPARK_PREPEND_CLASSES
$ ./bin/spark-shell
# You can also use ~ to let sbt do incremental builds on file changes without running a new sbt session every time
$ build/sbt ~compile
Git provides a mechanism for fetching remote pull requests into your own local repository. This is useful when reviewing code or testing patches locally. If you haven’t yet cloned the Spark Git repository, use the following command:
$ git clone https://github.com/apache/spark.git
$ cd spark
To enable this feature you’ll need to configure the git remote repository to fetch pull request
data. Do this by modifying the .git/config file inside of your Spark directory. The remote may
not be named “origin” if you’ve named it something else:
[remote "origin"]
url = [email protected]:apache/spark.git
fetch = +refs/heads/*:refs/remotes/origin/*
fetch = +refs/pull/*/head:refs/remotes/origin/pr/* # Add this line
Once you’ve done this you can fetch remote pull requests
# Fetch remote pull requests
$ git fetch origin
# Checkout a remote pull request
$ git checkout origin/pr/112
# Create a local branch from a remote pull request
$ git checkout origin/pr/112 -b new-branch
$ # sbt
$ build/sbt dependency-tree
$ # Maven
$ build/mvn -DskipTests install
$ build/mvn dependency:tree
$ # sbt
$ build/sbt package
$ # Maven
$ build/mvn package -DskipTests -pl assembly
If the following error occurs when running ScalaTest
An internal error occurred during: "Launching XYZSuite.scala".
java.lang.NullPointerException
It is due to an incorrect Scala library in the classpath. To fix it:
Build Path | Configure Build PathAdd Library | Scala Libraryscala-library-2.10.4.jar - lib_managed\jarsIn the event of “Could not find resource path for Web UI: org/apache/spark/ui/static”, it’s due to a classpath issue (some classes were probably not compiled). To fix this, it sufficient to run a test from the command line:
build/sbt "test-only org.apache.spark.rdd.SortingSuite"
When running tests for a pull request on Jenkins, you can add special phrases to the title of your pull request to change testing behavior. This includes:
[test-maven] - signals to test the pull request using maven[test-hadoop1.0] - signals to test using Spark’s Hadoop 1.0 profile (other options include
Hadoop 2.0, 2.2, and 2.3)You can use a IntelliJ Imports Organizer from Aaron Davidson to help you organize the imports in your code. It can be configured to match the import ordering from the style guide.
While many of the Spark developers use SBT or Maven on the command line, the most common IDE we
use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get
free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.
To create a Spark project for IntelliJ:
File -> Import Project, locate the spark source directory, and select “Maven Project”.-P[profile name] above may be enabled on the
Profiles screen in the Import wizard. For example, if developing for Hadoop 2.4 with YARN support,
enable profiles yarn and hadoop-2.4. These selections can be changed later by accessing the
“Maven Projects” tool window from the View menu, and expanding the Profiles section.Other tips:
/Users/irashid/github/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
Error:(147, 9) value q is not a member of StringContext
Note: implicit class Evaluate2 is not applicable here because it comes after the application point and it lacks an explicit result type
q"""
^
Eclipse can be used to develop and test Spark. The following configuration is known to work:
The easiest way is to download the Scala IDE bundle from the Scala IDE download page. It comes pre-installed with ScalaTest. Alternatively, use the Scala IDE update site or Eclipse Marketplace.
SBT can create Eclipse .project and .classpath files. To create these files for each Spark sub
project, use this command:
sbt/sbt eclipse
To import a specific project, e.g. spark-core, select File | Import | Existing Projects into
Workspace. Do not select “Copy projects into workspace”.
If you want to develop on Scala 2.10 you need to configure a Scala installation for the
exact Scala version that’s used to compile Spark.
Since Scala IDE bundles the latest versions (2.10.5 and 2.11.8 at this point), you need to add one
in Eclipse Preferences -> Scala -> Installations by pointing to the lib/ directory of your
Scala 2.10.5 distribution. Once this is done, select all Spark projects and right-click,
choose Scala -> Set Scala Installation and point to the 2.10.5 installation.
This should clear all errors about invalid cross-compiled libraries. A clean build should succeed now.
ScalaTest can execute unit tests by right clicking a source file and selecting Run As | Scala Test.
If Java memory errors occur, it might be necessary to increase the settings in eclipse.ini
in the Eclipse install directory. Increase the following setting as needed:
--launcher.XXMaxPermSize
256M
Packages are built regularly off of Spark’s master branch and release branches. These provide Spark developers access to the bleeding-edge of Spark master or the most recent fixes not yet incorporated into a maintenance release. These should only be used by Spark developers, as they may have bugs and have not undergone the same level of testing as releases. Spark nightly packages are available at:
Spark also publishes SNAPSHOT releases of its Maven artifacts for both master and maintenance branches on a nightly basis. To link to a SNAPSHOT you need to add the ASF snapshot repository to your build. Note that SNAPSHOT artifacts are ephemeral and may change or be removed. To use these you must add the ASF snapshot repository at http://repository.apache.org/snapshots/.
groupId: org.apache.spark
artifactId: spark-core_2.10
version: 1.5.0-SNAPSHOT
Here are instructions on profiling Spark applications using YourKit Java Profiler.
/root in our case): tar xvjf yjp-12.0.5-linux.tar.bz2~/spark-ec2/copy-dir /root/yjp-12.0.5~/spark/conf/spark-env.sh
and adding the lines
SPARK_DAEMON_JAVA_OPTS+=" -agentpath:/root/yjp-12.0.5/bin/linux-x86-64/libyjpagent.so=sampling"
export SPARK_DAEMON_JAVA_OPTS
SPARK_JAVA_OPTS+=" -agentpath:/root/yjp-12.0.5/bin/linux-x86-64/libyjpagent.so=sampling"
export SPARK_JAVA_OPTS
~/spark-ec2/copy-dir ~/spark/conf/spark-env.sh~/spark/bin/stop-all.sh and ~/spark/bin/start-all.shSecurity Groups from the Network & Security section on the left side of the page.
Find the security groups corresponding to your cluster; if you launched a cluster named test_cluster,
then you will want to modify the settings for the test_cluster-slaves and test_cluster-master
security groups. For each group, select it from the list, click the Inbound tab, and create a
new Custom TCP Rule opening the port range 10001-10010. Finally, click Apply Rule Changes.
Make sure to do this for both security groups.
Note: by default, spark-ec2 re-uses security groups: if you stop this cluster and launch another
cluster with the same name, your security group settings will be re-used.ec2--.compute-1.amazonaws.comPlease see the full YourKit documentation for the full list of profiler agent startup options.
When running Spark tests through SBT, add javaOptions in Test += "-agentpath:/path/to/yjp"
to SparkBuild.scala to launch the tests with the YourKit profiler agent enabled.
The platform-specific paths to the profiler agents are listed in the
YourKit documentation.