Power BI is turning 10, and we’re marking the occasion with a special community challenge. Use your creativity to tell a story, uncover trends, or highlight something unexpected.
Get startedJoin us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered
Hi,
TLDR:
I have trouble in parsing an xml file in fabric lakehouse. Spark xml parser somehow misses _value of one of the tags.
The xml file related to abstract looks like below and I have trouble in extracting the description
<abstract id="abstract">
<p id="p-0001" num="0000">The invention involves a system and method for plant agriculture. The system includes a sheet of barrier material upon which soil is placed. A drip tube is placed in the soil so that the soil can be shaped, and the sheet material is placed over the top of the soil so that the edges of the sheet overlap. A seal, which may be soil, is placed on the overlapping edges of the sheet to hold the sheet material in a closed position. The crop is planted through the sheet material, and growth occurs through the aperture created during planting.</p>
</abstract>
Inside the xml file, it has around 8K xml files concatenated to each other.
I need to load this in fabric lakehouse and process it.
My approach has been :
Then I have written the dataframe as text back to the lakehouse.
Later on I'm usking the spark parser to read this file
<abstract id="abstract">
<p id="p-0001" num="0000">The invention involves a system and method for plant agriculture. The system includes a sheet of barrier material upon which soil is placed. A drip tube is placed in the soil so that the soil can be shaped, and the sheet material is placed over the top of the soil so that the edges of the sheet overlap. A seal, which may be soil, is placed on the overlapping edges of the sheet to hold the sheet material in a closed position. The crop is planted through the sheet material, and growth occurs through the aperture created during planting.</p>
</abstract>
If I extract the content manually and load one of the files, then xml parser captures it correctly but when all the 8k files are together, it has trouble. Any idea what could be the reason?
Thanks
Solved! Go to Solution.
How I personally would approach loading 8k files with inconsistent schemas;
Hi @Amir_m_h
I wanted to check if you had the opportunity to review the information provided @spencer_sa . Please feel free to contact us if you have any further questions. If your issue has been resolved ,please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.
Thank you.
Hi @Amir_m_h
May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.
Thank you.
What happens if you print the schema of the dataframe? Does _VALUE appear in the nested schema?
df_xml.printSchema()
I made a very hooky nested xml file and the _VALUE still appears, so there may be something afoot in how you combined the files
Also, 'amen' to JSON over XML. JSON converts directly into a python structure - you screw the schema up, you soon know about it when you json.loads().
I think the problem could be that abstract does not have a consistent schema across different files which I honestly don't know how to deal with it.
So if I take a single file that abstract schema works correctly, the schema looks like this:
root
|-- _country: string (nullable = true)
|-- _date-produced: long (nullable = true)
|-- _date-publ: long (nullable = true)
|-- _dtd-version: string (nullable = true)
|-- _file: string (nullable = true)
|-- _id: string (nullable = true)
|-- _lang: string (nullable = true)
|-- _status: string (nullable = true)
|-- abstract: struct (nullable = true)
| |-- _id: string (nullable = true)
| |-- p: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _id: string (nullable = true)
| | |-- _num: long (nullable = true)
|-- claims: struct (nullable = true)
| |-- _id: string (nullable = true)
| |-- claim: array (nullable = true)
but when files are combined and I print the schema, it looks like below and loses the _value even for the file that individually worked before being combined with the rest:
root
|-- _country: string (nullable = true)
|-- _date-produced: long (nullable = true)
|-- _date-publ: long (nullable = true)
|-- _dtd-version: string (nullable = true)
|-- _file: string (nullable = true)
|-- _id: string (nullable = true)
|-- _lang: string (nullable = true)
|-- _status: string (nullable = true)
|-- abstract: struct (nullable = true)
| |-- _id: string (nullable = true)
| |-- p: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _id: string (nullable = true)
| | | |-- _num: long (nullable = true)
| | | |-- b: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- br: string (nullable = true)
| | | |-- chemistry: struct (nullable = true)
| | | | |-- _id: string (nullable = true)
| | | | |-- _num: long (nullable = true)
| | | | |-- img: struct (nullable = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _alt: string (nullable = true)
| | | | | |-- _file: string (nullable = true)
| | | | | |-- _he: string (nullable = true)
| | | | | |-- _id: string (nullable = true)
| | | | | |-- _img-content: string (nullable = true)
| | | | | |-- _img-format: string (nullable = true)
| | | | | |-- _wi: string (nullable = true)
| | | |-- i: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- img: array (nullable = true)
How I personally would approach loading 8k files with inconsistent schemas;
Hi @Amir_m_h
May I know what are the packages or libararies that you have installed to read xml files?
This library
and I start the session with
%%configure -f
{"conf": {"spark.jars.packages": "com.databricks:spark-xml_2.12:0.18.0"}}
Hi @Amir_m_h
Thank you for reaching out microsoft fabric community forum.
If this solution helps, please consider giving us Kudos and accepting it as the solution so that it may assist other members in the community.
Thank you.
In my case I have included a root element as header and footer of the document but still it didn't solve the problem.
what is strange is that individual documents are parsed correctly but when put together this problem happens. So instead I also tried an approach to split the file to 8000 different xml files and then load them all in the data frame and the problem still persists.
Another thing I noticed is that abstract descriptions sometimes have <b> <i> tags for bold and italic. I tried to remove those and still the problem is there. I don't know what other option I need to try. (makes me think I prefer JSON over xml any day 🙂 )
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
Check out the June 2025 Fabric update to learn about new features.