This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
In the previous blog on Profiling Microsoft Fabric Spark Notebooks with Sparklens, we covered how to run Sparklens to profile and tune the performance of your spark notebooks in Microsoft Fabric. In that blog, we used a custom Sparklens JAR. The Sparklens JARs available in the Maven Central repo supports only the Spark version 2.X, which is not compatible with Microsoft Fabric. In this blog, you will learn how to build the sparklens JAR for Spark 3.X, which can be used in Microsoft Fabric.
To learn what is Sparklens and how to run it on Microsoft Fabric Spark Notebook and optimize performance, please check out this blog: Profiling Microsoft Fabric Spark Notebooks with Sparklens
Sparklens is an open-source Spark profiling tool to profile Spark jobs and Notebooks. Latest JARs in Maven Central repo support Spark 2.X and doesn’t work with Spark 3.X. Here are modifications you need to make to run on Spark 3.X.
Note: Sparklens is not owned/maintained by Microsoft, it's crucial you implement all necessary security measures, similar to the precautions taken when using any package or library. Please check out Sparklens License details here.
1. Setup the Build Tool:
Sparklens is developed in Scala. To package a Scala project, you can use build tools like sbt (simple build tool). Ensure you have sbt installed on your local machine. This blog uses sbt version 0.13.18.
2. Prepare Your Development Environment:
Use your preferred IDE to make necessary changes. For this blog, Visual Studio Code is used. Open the terminal and navigate to the Sparklens directory:
cd sparklens
3. Clone the Repository:
Clone the Sparklens GitHub repository to your local machine from the following link: qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com).
git clone https://github.com/qubole/sparklens.git
4. Modify plugins.sbt:
Update the plugins.sbt file to comment out the existing addSbtPlugin
(addSbtPlugin(“org.spark-packages” % “sbt-spark-package” % “0.2.4”)):
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
resolvers += "Spark Package Main Repo" at "https://dl.bintray.com/spark-packages/maven"
// addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.4")
5. Update build.sbt:
Make the following changes to the build.sbt file:
Here is the updated sections in the build.sbt:
name := "sparklens"
organization := "com.qubole"
scalaVersion := "2.12.0"
crossScalaVersions := Seq("2.10.6", "2.12.0")
// spName := "qubole/sparklens"
// sparkVersion := "2.0.0"
// spAppendScalaVersion := true
val spName = "qubole/sparklens"
val sparkVersion = "3.0.0"
val spAppendScalaVersion = true
// libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion.version % "provided"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"
6. Update QuboleJobListener.scala:
In QuboleJobListener.scala (src/main/scala/com/qubole/sparklens/QuboleJobListener.scala), change attemptId to attemptNumber() as shown in this code snippet:
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
val stageTimeSpan = stageMap(stageCompleted.stageInfo.stageId)
if (stageCompleted.stageInfo.completionTime.isDefined) {
stageTimeSpan.setEndTime(stageCompleted.stageInfo.completionTime.get)
}
if (stageCompleted.stageInfo.submissionTime.isDefined) {
stageTimeSpan.setStartTime(stageCompleted.stageInfo.submissionTime.get)
}
if (stageCompleted.stageInfo.failureReason.isDefined) {
//stage failed
val si = stageCompleted.stageInfo
failedStages += s""" Stage ${si.stageId} attempt ${si.attemptNumber()} in job ${stageIDToJobID(si.stageId)} failed.
Stage tasks: ${si.numTasks}
"""
stageTimeSpan.finalUpdate()
}else {
val jobID = stageIDToJobID(stageCompleted.stageInfo.stageId)
val jobTimeSpan = jobMap(jobID)
jobTimeSpan.addStage(stageTimeSpan)
stageTimeSpan.finalUpdate()
}
}
7. Update HDFSConfigHelper.scala:
In the HDFSConfigHelper.scala (src\main\scala\com\qubole\sparklens\helper\HDFSConfigHelper.scala), SparkHadoopUtil class has been changed to a private class in Spark 3. Modify this as shown below:
import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkConf
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.SparkSession
object HDFSConfigHelper {
def getHadoopConf(sparkConfOptional: Option[SparkConf]): Configuration = {
if (sparkConfOptional.isDefined) {
val spark = SparkSession.builder.config(sparkConfOptional.get).getOrCreate()
spark.sparkContext.hadoopConfiguration
} else {
val spark = SparkSession.builder.getOrCreate()
spark.sparkContext.hadoopConfiguration
}
}
}
8. Compile the Revised Code: Run "sbt compile" to compile the project.
Building_a_Custom_Sparklens_JAR_for_Microsoft_Fabric
9. Package the Compiled Code: Run "sbt package" to package the project as a JAR file.
Building_a_Custom_Sparklens_JAR_for_Microsoft_Fabric
10. You can now use the JAR (target/scala-2.12/sparklens_2.12-0.3.2.jar) and run profiling on Microsoft Fabric Notebook: Profiling Microsoft Fabric Spark Notebooks with Sparklens.
qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com)
Profiling Microsoft Fabric Spark Notebooks with Sparklens | Microsoft Fabric Blog | Microsoft Fabric
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.