Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
Joshrodgers123
Advocate V
Advocate V

Spark XML does not work with pyspark

Has anyone been able to read XML files in a notebook using pyspark yet? I loaded the spark-xml_2.12-0.16.0.jar library and am trying to run the below code, but it does not seem to recognize the package. I have the same configuration in an azure synapse notebook and it works perfectly. The interesting thing is that this does work in Fabric if I read the xml file using scala instead.

 

I just tried this on the new 2.2 runtime as well and no luck.

 

Code:

df = spark.read.format("xml").option("rowTag", "BillOfLading").load("Files/Freight/kls/raw/KACC20230724.xml")
 
Error: 
Py4JJavaError: An error occurred while calling o5568.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: xml. Please find packages at `https://spark.apache.org/third-party-projects.html`.
1 ACCEPTED SOLUTION
Fgarcia1986
Regular Visitor

 

One way that i found is:

1 - Create an enviroment

Fgarcia1986_1-1709519835771.png

2 - upload the the file spark-xml_2.12-0.17.0.jar

Fgarcia1986_2-1709519883898.png

 

 

 

Open your notebook and language choose spark(scala) and then place the code below:

%%configure -f
{"conf": {"spark.jars.packages": "com.databricks:spark-xml_2.12:0.16.0"}}
Fgarcia1986_0-1709519689575.png

 

IMPORTANT: Must be the first code in the session and you can use the environment WorkSpace Default, you don´t have to use the environment that you´ve created, i don´t know but worked.
 
Then you can change your language to PySpark(Python) and read xml
 
It takes from 2 to 3 minutes to execute.
 
Let me know if you have any doubt.
 
I hope works for everyone
 
Cheers

View solution in original post

16 REPLIES 16
Fgarcia1986
Regular Visitor

 

One way that i found is:

1 - Create an enviroment

Fgarcia1986_1-1709519835771.png

2 - upload the the file spark-xml_2.12-0.17.0.jar

Fgarcia1986_2-1709519883898.png

 

 

 

Open your notebook and language choose spark(scala) and then place the code below:

%%configure -f
{"conf": {"spark.jars.packages": "com.databricks:spark-xml_2.12:0.16.0"}}
Fgarcia1986_0-1709519689575.png

 

IMPORTANT: Must be the first code in the session and you can use the environment WorkSpace Default, you don´t have to use the environment that you´ve created, i don´t know but worked.
 
Then you can change your language to PySpark(Python) and read xml
 
It takes from 2 to 3 minutes to execute.
 
Let me know if you have any doubt.
 
I hope works for everyone
 
Cheers
Joshrodgers123
Advocate V
Advocate V

That is where I have been loading it. 

Hi  @Joshrodgers123 

Can you please share your workspace id, artifact id ? We'd like to check if it's our issue or hit the error by design. It will be great if you can also share the code snippet along with it. We would like to understand why there is an issue?

You can send us this information through email to AzCommunity[at]Microsoft[dot]com with the below details,

Email subject: <Attn - v-nikhilan-msft  :Spark XML does not work with pyspark>

 

Thanks.

 

Hi @v-nikhilan-msft, I have emailed all of the requested details. Thanks.

Hi @Joshrodgers123 ,
Thanks for providing the information. I have given the details to the internal team. I will update you once I hear back from them.
Appreciate your patience.

Hi  @Joshrodgers123 

To use PySpark in order to play with XML files, we have to use spark-xml package Link1

Try using the Scala API

%%spark

val df = spark.read
                .format("com.databricks.spark.xml")
                .option("rowTag", "book")
                .load("file:///synfs/nb_resource/builtin/demo.xml")

df.show(10)

 

 You can find the tutorial here: Link2

 

So basically, the format must be .format("com.databricks.spark.xml").

 

 

I have already installed that package. The code you provided is scala, which does work. Pyspark does not work though. 

Hi @Joshrodgers123 
Apologies for the delay in response.

I would request you to please go ahead with Microsoft support for this. Please raise a support ticket on this link: https://support.fabric.microsoft.com/en-US/support/.

Also once you have opened the support ticket , please do share the supportcase# here so that we can keep an eye on it.

Thanks

Here is the support ticket: 2311150040007106

@Josh Did you get a reply on how to do use spark-xml with pyspark in Fabric? Thanks

It doesn't seem to be supported with pyspark. I got it working by loading the data with scala and then doing my transformations with pyspark. 

My workaround is loading into a Pandas dataframe and then converting it to a pyspark dataframe before writing to delta tables.

Hi @Joshrodgers123 
Thanks for the details. We expect you to keep using this forum and also motivate others to do that same. 
Thanks

v-nikhilan-msft
Community Support
Community Support

Hi @Joshrodgers123 ,

Thanks for using Fabric Community.

Apologies for the issue you have been facing. 

We are reaching out to the internal team to get more information related to your query and will get back to you as soon as we have an update.

Appreciate your patience.

Hi @Joshrodgers123 

Could you please try to upload the .jar file in library management, and install it then use in notebook?

vnikhilanmsft_0-1699260007446.png

Please upload the .jar file here and try running the pyspark code.
Hope this helps. Please let us know if you have any further questions.

Can you provide a link to the .jar file or to the webpage where we can download it please? 

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.