Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The ultimate Microsoft Fabric, Power BI, Azure AI & SQL learning event! Join us in Las Vegas from March 26-28, 2024. Use code MSCUST for a $100 discount. Register Now

Reply
Joshrodgers123
Advocate III
Advocate III

Spark XML does not work with pyspark

Has anyone been able to read XML files in a notebook using pyspark yet? I loaded the spark-xml_2.12-0.16.0.jar library and am trying to run the below code, but it does not seem to recognize the package. I have the same configuration in an azure synapse notebook and it works perfectly. The interesting thing is that this does work in Fabric if I read the xml file using scala instead.

 

I just tried this on the new 2.2 runtime as well and no luck.

 

Code:

df = spark.read.format("xml").option("rowTag", "BillOfLading").load("Files/Freight/kls/raw/KACC20230724.xml")
 
Error: 
Py4JJavaError: An error occurred while calling o5568.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: xml. Please find packages at `https://spark.apache.org/third-party-projects.html`.
15 REPLIES 15
Joshrodgers123
Advocate III
Advocate III

That is where I have been loading it. 

Hi  @Joshrodgers123 

Can you please share your workspace id, artifact id ? We'd like to check if it's our issue or hit the error by design. It will be great if you can also share the code snippet along with it. We would like to understand why there is an issue?

You can send us this information through email to AzCommunity[at]Microsoft[dot]com with the below details,

Email subject: <Attn - v-nikhilan-msft  :Spark XML does not work with pyspark>

 

Thanks.

 

Hi @v-nikhilan-msft, I have emailed all of the requested details. Thanks.

Hi @Joshrodgers123 ,
Thanks for providing the information. I have given the details to the internal team. I will update you once I hear back from them.
Appreciate your patience.

Hi  @Joshrodgers123 

To use PySpark in order to play with XML files, we have to use spark-xml package Link1

Try using the Scala API

%%spark

val df = spark.read
                .format("com.databricks.spark.xml")
                .option("rowTag", "book")
                .load("file:///synfs/nb_resource/builtin/demo.xml")

df.show(10)

 

 You can find the tutorial here: Link2

 

So basically, the format must be .format("com.databricks.spark.xml").

 

 

I have already installed that package. The code you provided is scala, which does work. Pyspark does not work though. 

Hi @Joshrodgers123 
Apologies for the delay in response.

I would request you to please go ahead with Microsoft support for this. Please raise a support ticket on this link: https://support.fabric.microsoft.com/en-US/support/.

Also once you have opened the support ticket , please do share the supportcase# here so that we can keep an eye on it.

Thanks

Here is the support ticket: 2311150040007106

@Josh Did you get a reply on how to do use spark-xml with pyspark in Fabric? Thanks

It doesn't seem to be supported with pyspark. I got it working by loading the data with scala and then doing my transformations with pyspark. 

My workaround is loading into a Pandas dataframe and then converting it to a pyspark dataframe before writing to delta tables.

Hi @Joshrodgers123 
Thanks for the details. We expect you to keep using this forum and also motivate others to do that same. 
Thanks

v-nikhilan-msft
Community Support
Community Support

Hi @Joshrodgers123 ,

Thanks for using Fabric Community.

Apologies for the issue you have been facing. 

We are reaching out to the internal team to get more information related to your query and will get back to you as soon as we have an update.

Appreciate your patience.

Hi @Joshrodgers123 

Could you please try to upload the .jar file in library management, and install it then use in notebook?

vnikhilanmsft_0-1699260007446.png

Please upload the .jar file here and try running the pyspark code.
Hope this helps. Please let us know if you have any further questions.

Can you provide a link to the .jar file or to the webpage where we can download it please? 

Helpful resources

Announcements
Fabric Community Conference

Microsoft Fabric Community Conference

Join us at our first-ever Microsoft Fabric Community Conference, March 26-28, 2024 in Las Vegas with 100+ sessions by community experts and Microsoft engineering.

Fabric Career Hub

Microsoft Fabric Career Hub

Explore career paths and learn resources in Fabric.

Fabric Partner Community

Microsoft Fabric Partner Community

Engage with the Fabric engineering team, hear of product updates, business opportunities, and resources in the Fabric Partner Community.

Fabric Hack Slide Banner

Hack Together: The Microsoft Fabric AI Global Hack

Learn from experts, get hands-on experience, and win awesome prizes.