Using Scala 3 with Apache Spark

Published on: Aug 23, 2023

Using Scala 3 with Apache Spark

As of the current version (v3.4.1), Apache Spark has not yet revealed official support for Scala 3. However, a dedicated edition tailored for Scala 2.13 has been available since the v3.2.0 release. Thanks to the cross-building, this signifies that developers can utilize Apache Spark while coding in Scala 3.

Prerequisites

If you are new to Apache Spark, be sure to set up an Apache Spark environment with a Hadoop. If you are using Linux (Arch for example), be sure you download Pre-build for Apache Hadoop (Scala 2.13), any version later than 3.2.0 should be fine.

Choose correct version

After that, set up environment variables for SPARK_HOME and HADOOP_HOME, and they should share the same value.

If you are using Windows, things are going to be a bit easier. Here I recommend scoop.sh for installing spark itself and other required software.

$ scoop bucket add versions
$ scoop install versions/spark-scala2.13
$ scoop install extras/hadoop-winutils

Then add the environment variables mentioned above in the system settings.

Create SBT Project

Scala Build Tool is quite useful for managing Scala projects. Here I use IntelliJ with Scala Lang plugin (nightly) for demonstration. Make sure you are choosing Scala 3 (that’s what this passage is written for).

Create Scala Project

After sbt downloads all necessary files, we should add dependencies for Apache Spark. Open build.sbt in the root path of the project, modify it like this:

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "3.3.0" // scala 3 you choose

lazy val root = (project in file("."))
  .settings(
    name := "name-of-your-project"
  )


val sparkVersion = "3.4.1" // any version later than 3.2.0

libraryDependencies ++= Seq(
  ("org.apache.spark" %% "spark-core" % sparkVersion).cross(CrossVersion
    .for3Use2_13),
  ("org.apache.spark" %% "spark-sql" % sparkVersion).cross(CrossVersion
    .for3Use2_13)
)

If you want to use other libraries, mllib for example if you want to use some machine learning algorithms, search the package name and append it in libraryDependencies with CrossVersion.for3Use2_13:

...

libraryDependencies ++= Seq(
  ("org.apache.spark" %% "spark-core" % sparkVersion).cross(CrossVersion
    .for3Use2_13),
  ("org.apache.spark" %% "spark-sql" % sparkVersion).cross(CrossVersion
    .for3Use2_13),
  // add mllib
  ("org.apache.spark" %% "spark-mllib" % sparkVersion).cross(CrossVersion
    .for3Use2_13)
)

Fix JVM Options

If you are using Java 17 or the project builds failed due to module xxx does not export xxxx to unnamed module, here’s a simple fix.

Add the following arguments to the JVM option. You may not need all of them, but it doesn’t hurt if you just add them at once.

--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED
--add-opens=java.base/sun.security.action=ALL-UNNAMED
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED

For IntelliJ, you can edit this directly in `run/debug configurations

Run/debug configurations

Then it should work.

References

[1]vincenzobaz, “Spark-Scala3,” GitHub [Source Code]. Available: https://github.com/vincenzobaz/spark-scala3. Aug. 22, 2023. [Accessed: Aug 23, 2023]

[2]H. Miao, “Apache Spark 3.3.0 breaks on Java 17 with `cannot access class sun.nio.ch.DirectBuffer`,” Stack Overflow [Online]. Available: https://stackoverflow.com/a/75795628. Mar. 20, 2023. [Accessed: Aug 23, 2023]