Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered

Reply
Amir_m_h
New Member

XML parsing fabric notebooks

Hi,


TLDR:

I have trouble in parsing an xml file in fabric lakehouse. Spark xml parser somehow misses _value of one of the tags.

The xml file related to abstract looks like below and I have trouble in extracting the description

 

 

<abstract id="abstract">
<p id="p-0001" num="0000">The invention involves a system and method for plant agriculture. The system includes a sheet of barrier material upon which soil is placed. A drip tube is placed in the soil so that the soil can be shaped, and the sheet material is placed over the top of the soil so that the edges of the sheet overlap. A seal, which may be soil, is placed on the overlapping edges of the sheet to hold the sheet material in a closed position. The crop is planted through the sheet material, and growth occurs through the aperture created during planting.</p>
</abstract>

 



Inside the xml file, it has around 8K xml files concatenated to each other.
I need to load this in fabric lakehouse and process it.

My approach has been :

 

  • Load it into a dataframe
  • Remove all xml and doc type declaration (
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE us-patent-application SYSTEM "us-patent-application-v46-2022-02-17.dtd" [ ]>)
  • Keep one declaration on top of the file in order to treat at as 1 big xml file

Then I have written the dataframe as text back to the lakehouse.

Later on I'm usking the spark parser to read this file

df=spark.read.format("xml").option("rowTag","us-patent-application").load("Files/Cleaned.txt")

My problem is that abstract column does not show its _value
Amir_m_h_0-1741110365360.png

 

 

<abstract id="abstract">
<p id="p-0001" num="0000">The invention involves a system and method for plant agriculture. The system includes a sheet of barrier material upon which soil is placed. A drip tube is placed in the soil so that the soil can be shaped, and the sheet material is placed over the top of the soil so that the edges of the sheet overlap. A seal, which may be soil, is placed on the overlapping edges of the sheet to hold the sheet material in a closed position. The crop is planted through the sheet material, and growth occurs through the aperture created during planting.</p>
</abstract>

 


If I extract the content manually and load one of the files, then xml parser captures it correctly but when all the 8k files are together, it has trouble. Any idea what could be the reason?

Thanks

1 ACCEPTED SOLUTION

How I personally would approach loading 8k files with inconsistent schemas;

  1. If I can't identify the schema from the filename/metadata, process each file separately.
  2. Convert each file to flat pyspark/pandas data frames - the schemas I've see so far are hierarchical which can be a bit painful to merge.
  3. Once everything has been flattened I'd then consider writing to either a single table or a single pyspark data frame making sure I allow the process to add new columns (easier when appending to a table, but not hard)
  4. I should now have a table with 1 row per entry, and null values where columns don't exist in one row but do in another.
  5. From this point onwards it's now a data processing/analysis problem 😊

View solution in original post

10 REPLIES 10
v-shamiliv
Community Support
Community Support

Hi @Amir_m_h 
I wanted to check if you had the opportunity to review the information provided @spencer_sa . Please feel free to contact us if you have any further questions. If your issue has been resolved ,please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

 

v-shamiliv
Community Support
Community Support

Hi @Amir_m_h 

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

spencer_sa
Super User
Super User

What happens if you print the schema of the dataframe?  Does _VALUE appear in the nested schema?

 

df_xml.printSchema()

 

 

I made a very hooky nested xml file and the _VALUE still appears, so there may be something afoot in how you combined the files

spencer_sa_0-1741468254278.png


Also, 'amen' to JSON over XML.  JSON converts directly into a python structure - you screw the schema up, you soon know about it when you json.loads().

I think the problem could be that abstract does not have a consistent schema across different files which I honestly don't know how to deal with it.

So if I take a single file that abstract schema works correctly, the schema looks like this:

root
 |-- _country: string (nullable = true)
 |-- _date-produced: long (nullable = true)
 |-- _date-publ: long (nullable = true)
 |-- _dtd-version: string (nullable = true)
 |-- _file: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- _lang: string (nullable = true)
 |-- _status: string (nullable = true)
 |-- abstract: struct (nullable = true)
 |    |-- _id: string (nullable = true)
 |    |-- p: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _num: long (nullable = true)
 |-- claims: struct (nullable = true)
 |    |-- _id: string (nullable = true)
 |    |-- claim: array (nullable = true)

 

but when files are combined and I print the schema, it looks like below and loses the _value even for the file that individually worked before being combined with the rest:

root
 |-- _country: string (nullable = true)
 |-- _date-produced: long (nullable = true)
 |-- _date-publ: long (nullable = true)
 |-- _dtd-version: string (nullable = true)
 |-- _file: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- _lang: string (nullable = true)
 |-- _status: string (nullable = true)
 |-- abstract: struct (nullable = true)
 |    |-- _id: string (nullable = true)
 |    |-- p: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _id: string (nullable = true)
 |    |    |    |-- _num: long (nullable = true)
 |    |    |    |-- b: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- br: string (nullable = true)
 |    |    |    |-- chemistry: struct (nullable = true)
 |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |-- _num: long (nullable = true)
 |    |    |    |    |-- img: struct (nullable = true)
 |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |    |-- _alt: string (nullable = true)
 |    |    |    |    |    |-- _file: string (nullable = true)
 |    |    |    |    |    |-- _he: string (nullable = true)
 |    |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |    |-- _img-content: string (nullable = true)
 |    |    |    |    |    |-- _img-format: string (nullable = true)
 |    |    |    |    |    |-- _wi: string (nullable = true)
 |    |    |    |-- i: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- img: array (nullable = true)



How I personally would approach loading 8k files with inconsistent schemas;

  1. If I can't identify the schema from the filename/metadata, process each file separately.
  2. Convert each file to flat pyspark/pandas data frames - the schemas I've see so far are hierarchical which can be a bit painful to merge.
  3. Once everything has been flattened I'd then consider writing to either a single table or a single pyspark data frame making sure I allow the process to add new columns (easier when appending to a table, but not hard)
  4. I should now have a table with 1 row per entry, and null values where columns don't exist in one row but do in another.
  5. From this point onwards it's now a data processing/analysis problem 😊
Abhilash_K
New Member

Hi @Amir_m_h 
May I know what are the packages or libararies that you have installed to read xml files?

This library

Amir_m_h_0-1741440754551.png

 

and I start the session with 

 

%%configure -f
{"conf": {"spark.jars.packages": "com.databricks:spark-xml_2.12:0.18.0"}}

Thanks @Amir_m_h 

v-shamiliv
Community Support
Community Support

Hi @Amir_m_h 
Thank you for reaching out microsoft fabric community forum.

  • It looks like your XML structure has <abstract> nested inside <us-patent-application>, but Spark’s XML parser might not be handling that well, especially with large, combined XML files.
  • The spark-xml library usually extracts attributes correctly, but when it comes to text inside deeply nested elements (like <p> inside <abstract>), it might miss the actual content if the structure isn’t processed properly.
  • Also, if you’ve simply concatenated 8,000 XML documents into one big file without wrapping them in a proper root element, Spark might struggle to parse them. Ideally, each XML document should be treated as a separate row.

If this solution helps, please consider giving us Kudos and accepting it as the solution so that it may assist other members in the community.

Thank you.

 

In my case I have included a root element as header and footer of the document but still it didn't solve the problem.

what is strange is that individual documents are parsed correctly but when put together this problem happens. So instead I also tried an approach to split the file to 8000 different xml files and then load them all in the data frame and the problem still persists.

Another thing I noticed is that abstract descriptions sometimes have <b> <i> tags for bold and italic. I tried to remove those and still the problem is there. I don't know what other option I need to try. (makes me think I prefer JSON over xml any day 🙂 )

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.