Check your eligibility for this 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700.
Get StartedJoin us at the 2025 Microsoft Fabric Community Conference. March 31 - April 2, Las Vegas, Nevada. Use code FABINSIDER for $400 discount. Register now
I am trying to get the data from an XML based API into a datalake. Using web activities I succeeded retrieving the data, but I couldn't find a way to load that output into a datalake. I tried the copy data with a rest connection as source, this resulted in a 500 error. Most of the solutions I could find were options I don't see in my UI (I will attach some screenshots).
The most important thing is just to get the data in datalake. It doesn't matter what activities I use. If possible, I like to use the output of a web activity (which like I mentioned does retrieve data).
Solved! Go to Solution.
Here is what I did to test that I could get it to work.
from io import StringIO # https://docs.python.org/3/library/io.html#io.StringIO
import pandas as pd # https://pandas.pydata.org/docs/reference/index.html
import requests # https://requests.readthedocs.io/en/latest/
# Setup Request Variables.
base_url = 'https://vsac.nlm.nih.gov/vsac/svs/'
end_point = 'RetrieveMultipleValueSets'
parameters = {'id':'2.16.840.1.113883.3.117.1.7.1.226','release':'eCQM Update 2020-05-07'}
# Request - Basic Authentication values have been removed
response = requests.get(f"{base_url}{end_point}",params=parameters,auth=('userName','password'))
# Warns about using response.text directly using StringIO as suggested
xml_io_concept = StringIO(response.text)
# Define namespaces
namespaces = {
'ns0': 'urn:ihe:iti:svs:2008',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}
# Parse the XML and extract Concept elements
pandas_df = pd.read_xml( # https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
xml_io_concept,
xpath=".//ns0:Concept",
namespaces=namespaces
)
# Convert Pandas to PySpark
spark_df = spark.createDataFrame(pandas_df)
# Write to table in default Lakehouse
spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"Tables/test_value_set")
# Query Lakehouse Table
df = spark.sql("SELECT * FROM ValueSet_Lake.test_value_set LIMIT 1000")
display(df)
I solved the copy data by using a http connector with basic auth instead of a REST connection.
This was not obvious because the API uses tokens for auth and a http connection anomynous failed.
Hi @Jorn_vb
To retrieve XML data from an API and load it into a Microsoft Fabric Lakehouse, you can use a combination of Web Activity and Copy Activity in your data pipeline. Here's a step-by-step approach to accomplish this:
Please accept this solution and give kudos , if this helps .
Thanks
Can you explain: Choose "REST" as the source type. How can I set the Request body for a GET request? It only shows up for me when I switch it to POST?
The screenshots on this page do not look like my UI (see screenshot in my post). With what I know it does not make sense to put the response (data I want) into the body of a request? But that is probaply just because what I see mismatches what you see. Do you know anything that can help me?
The screen shots do not match what I see either and I was at a loss on how to create a dummy request, I am new to this. I ended up using a notebook with the requests and pandas libraries to import the xml API data into my lakehouse tables
Here is what I did to test that I could get it to work.
from io import StringIO # https://docs.python.org/3/library/io.html#io.StringIO
import pandas as pd # https://pandas.pydata.org/docs/reference/index.html
import requests # https://requests.readthedocs.io/en/latest/
# Setup Request Variables.
base_url = 'https://vsac.nlm.nih.gov/vsac/svs/'
end_point = 'RetrieveMultipleValueSets'
parameters = {'id':'2.16.840.1.113883.3.117.1.7.1.226','release':'eCQM Update 2020-05-07'}
# Request - Basic Authentication values have been removed
response = requests.get(f"{base_url}{end_point}",params=parameters,auth=('userName','password'))
# Warns about using response.text directly using StringIO as suggested
xml_io_concept = StringIO(response.text)
# Define namespaces
namespaces = {
'ns0': 'urn:ihe:iti:svs:2008',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}
# Parse the XML and extract Concept elements
pandas_df = pd.read_xml( # https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
xml_io_concept,
xpath=".//ns0:Concept",
namespaces=namespaces
)
# Convert Pandas to PySpark
spark_df = spark.createDataFrame(pandas_df)
# Write to table in default Lakehouse
spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"Tables/test_value_set")
# Query Lakehouse Table
df = spark.sql("SELECT * FROM ValueSet_Lake.test_value_set LIMIT 1000")
display(df)
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Check out the February 2025 Fabric update to learn about new features.
User | Count |
---|---|
4 | |
1 | |
1 | |
1 | |
1 |