Solved: Re: XML parsing fabric notebooks

Amir_m_h · ‎03-04-2025

Hi,

TLDR:

I have trouble in parsing an xml file in fabric lakehouse. Spark xml parser somehow misses _value of one of the tags.

The xml file related to abstract looks like below and I have trouble in extracting the description

<abstract id="abstract">
<p id="p-0001" num="0000">The invention involves a system and method for plant agriculture. The system includes a sheet of barrier material upon which soil is placed. A drip tube is placed in the soil so that the soil can be shaped, and the sheet material is placed over the top of the soil so that the edges of the sheet overlap. A seal, which may be soil, is placed on the overlapping edges of the sheet to hold the sheet material in a closed position. The crop is planted through the sheet material, and growth occurs through the aperture created during planting.</p>
</abstract>

Inside the xml file, it has around 8K xml files concatenated to each other.
I need to load this in fabric lakehouse and process it.

My approach has been :

Load it into a dataframe
Remove all xml and doc type declaration (
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v46-2022-02-17.dtd" [ ]>)
Keep one declaration on top of the file in order to treat at as 1 big xml file

Then I have written the dataframe as text back to the lakehouse.

Later on I'm usking the spark parser to read this file

df=spark.read.format("xml").option("rowTag","us-patent-application").load("Files/Cleaned.txt")

My problem is that abstract column does not show its _value

<abstract id="abstract">
<p id="p-0001" num="0000">The invention involves a system and method for plant agriculture. The system includes a sheet of barrier material upon which soil is placed. A drip tube is placed in the soil so that the soil can be shaped, and the sheet material is placed over the top of the soil so that the edges of the sheet overlap. A seal, which may be soil, is placed on the overlapping edges of the sheet to hold the sheet material in a closed position. The crop is planted through the sheet material, and growth occurs through the aperture created during planting.</p>
</abstract>

If I extract the content manually and load one of the files, then xml parser captures it correctly but when all the 8k files are together, it has trouble. Any idea what could be the reason?

Thanks

spencer_sa · ‎03-10-2025

How I personally would approach loading 8k files with inconsistent schemas;

If I can't identify the schema from the filename/metadata, process each file separately.
Convert each file to flat pyspark/pandas data frames - the schemas I've see so far are hierarchical which can be a bit painful to merge.
Once everything has been flattened I'd then consider writing to either a single table or a single pyspark data frame making sure I allow the process to add new columns (easier when appending to a table, but not hard)
I should now have a table with 1 row per entry, and null values where columns don't exist in one row but do in another.
From this point onwards it's now a data processing/analysis problem 😊

View solution in original post

v-shamiliv · ‎03-13-2025

Hi @Amir_m_h
I wanted to check if you had the opportunity to review the information provided @spencer_sa . Please feel free to contact us if you have any further questions. If your issue has been resolved ,please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

v-shamiliv · ‎03-10-2025

Hi @Amir_m_h

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

spencer_sa · ‎03-08-2025

What happens if you print the schema of the dataframe? Does _VALUE appear in the nested schema?

df_xml.printSchema()

I made a very hooky nested xml file and the _VALUE still appears, so there may be something afoot in how you combined the files

Also, 'amen' to JSON over XML. JSON converts directly into a python structure - you screw the schema up, you soon know about it when you json.loads().

Amir_m_h · ‎03-09-2025

I think the problem could be that abstract does not have a consistent schema across different files which I honestly don't know how to deal with it.

So if I take a single file that abstract schema works correctly, the schema looks like this:

root
 |-- _country: string (nullable = true)
 |-- _date-produced: long (nullable = true)
 |-- _date-publ: long (nullable = true)
 |-- _dtd-version: string (nullable = true)
 |-- _file: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- _lang: string (nullable = true)
 |-- _status: string (nullable = true)
 |-- abstract: struct (nullable = true)
 |    |-- _id: string (nullable = true)
 |    |-- p: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _num: long (nullable = true)
 |-- claims: struct (nullable = true)
 |    |-- _id: string (nullable = true)
 |    |-- claim: array (nullable = true)

but when files are combined and I print the schema, it looks like below and loses the _value even for the file that individually worked before being combined with the rest:

root
 |-- _country: string (nullable = true)
 |-- _date-produced: long (nullable = true)
 |-- _date-publ: long (nullable = true)
 |-- _dtd-version: string (nullable = true)
 |-- _file: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- _lang: string (nullable = true)
 |-- _status: string (nullable = true)
 |-- abstract: struct (nullable = true)
 |    |-- _id: string (nullable = true)
 |    |-- p: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _id: string (nullable = true)
 |    |    |    |-- _num: long (nullable = true)
 |    |    |    |-- b: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- br: string (nullable = true)
 |    |    |    |-- chemistry: struct (nullable = true)
 |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |-- _num: long (nullable = true)
 |    |    |    |    |-- img: struct (nullable = true)
 |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |    |-- _alt: string (nullable = true)
 |    |    |    |    |    |-- _file: string (nullable = true)
 |    |    |    |    |    |-- _he: string (nullable = true)
 |    |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |    |-- _img-content: string (nullable = true)
 |    |    |    |    |    |-- _img-format: string (nullable = true)
 |    |    |    |    |    |-- _wi: string (nullable = true)
 |    |    |    |-- i: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- img: array (nullable = true)

spencer_sa · ‎03-10-2025

How I personally would approach loading 8k files with inconsistent schemas;

If I can't identify the schema from the filename/metadata, process each file separately.
Convert each file to flat pyspark/pandas data frames - the schemas I've see so far are hierarchical which can be a bit painful to merge.
Once everything has been flattened I'd then consider writing to either a single table or a single pyspark data frame making sure I allow the process to add new columns (easier when appending to a table, but not hard)
I should now have a table with 1 row per entry, and null values where columns don't exist in one row but do in another.
From this point onwards it's now a data processing/analysis problem 😊

Abhilash_K · ‎03-07-2025

Hi @Amir_m_h
May I know what are the packages or libararies that you have installed to read xml files?

Amir_m_h · ‎03-08-2025

This library

and I start the session with

%%configure -f
{"conf": {"spark.jars.packages": "com.databricks:spark-xml_2.12:0.18.0"}}

Abhilash_K · ‎03-10-2025

Thanks @Amir_m_h

v-shamiliv · ‎03-05-2025

Hi @Amir_m_h
Thank you for reaching out microsoft fabric community forum.

It looks like your XML structure has <abstract> nested inside <us-patent-application>, but Spark’s XML parser might not be handling that well, especially with large, combined XML files.
The spark-xml library usually extracts attributes correctly, but when it comes to text inside deeply nested elements (like <p> inside <abstract>), it might miss the actual content if the structure isn’t processed properly.
Also, if you’ve simply concatenated 8,000 XML documents into one big file without wrapping them in a proper root element, Spark might struggle to parse them. Ideally, each XML document should be treated as a separate row.

If this solution helps, please consider giving us Kudos and accepting it as the solution so that it may assist other members in the community.

Thank you.

Amir_m_h · ‎03-05-2025

In my case I have included a root element as header and footer of the document but still it didn't solve the problem.

what is strange is that individual documents are parsed correctly but when put together this problem happens. So instead I also tried an approach to split the file to 8000 different xml files and then load them all in the data frame and the problem still persists.

Another thing I noticed is that abstract descriptions sometimes have <b> <i> tags for bold and italic. I tried to remove those and still the problem is there. I don't know what other option I need to try. (makes me think I prefer JSON over xml any day 🙂 )

XML parsing fabric notebooks

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

XML parsing fabric notebooks

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025