- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Getting data from XML API fails in copy data but works in web activity
I am trying to get the data from an XML based API into a datalake. Using web activities I succeeded retrieving the data, but I couldn't find a way to load that output into a datalake. I tried the copy data with a rest connection as source, this resulted in a 500 error. Most of the solutions I could find were options I don't see in my UI (I will attach some screenshots).
The most important thing is just to get the data in datalake. It doesn't matter what activities I use. If possible, I like to use the output of a web activity (which like I mentioned does retrieve data).
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is what I did to test that I could get it to work.
from io import StringIO # https://docs.python.org/3/library/io.html#io.StringIO
import pandas as pd # https://pandas.pydata.org/docs/reference/index.html
import requests # https://requests.readthedocs.io/en/latest/
# Setup Request Variables.
base_url = 'https://vsac.nlm.nih.gov/vsac/svs/'
end_point = 'RetrieveMultipleValueSets'
parameters = {'id':'2.16.840.1.113883.3.117.1.7.1.226','release':'eCQM Update 2020-05-07'}
# Request - Basic Authentication values have been removed
response = requests.get(f"{base_url}{end_point}",params=parameters,auth=('userName','password'))
# Warns about using response.text directly using StringIO as suggested
xml_io_concept = StringIO(response.text)
# Define namespaces
namespaces = {
'ns0': 'urn:ihe:iti:svs:2008',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}
# Parse the XML and extract Concept elements
pandas_df = pd.read_xml( # https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
xml_io_concept,
xpath=".//ns0:Concept",
namespaces=namespaces
)
# Convert Pandas to PySpark
spark_df = spark.createDataFrame(pandas_df)
# Write to table in default Lakehouse
spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"Tables/test_value_set")
# Query Lakehouse Table
df = spark.sql("SELECT * FROM ValueSet_Lake.test_value_set LIMIT 1000")
display(df)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I solved the copy data by using a http connector with basic auth instead of a REST connection.
This was not obvious because the API uses tokens for auth and a http connection anomynous failed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Jorn_vb
To retrieve XML data from an API and load it into a Microsoft Fabric Lakehouse, you can use a combination of Web Activity and Copy Activity in your data pipeline. Here's a step-by-step approach to accomplish this:
Use Web Activity to Retrieve XML Data
- Add a Web Activity to your pipeline to make the API call and retrieve the XML data.
- Configure the Web Activity with the appropriate URL, method, headers, and any required authentication.
- The Web Activity will successfully retrieve the XML data as you've mentioned.
Process Web Activity Output
- Add a Set Variable Activity after the Web Activity.
- Configure the Set Variable Activity to store the Web Activity's output in a variable.
- Use the following expression to capture the Web Activity's output:
@activity('Web1').output.Response
(Replace 'Web1' with your Web Activity's name)
Use Copy Activity to Load Data into Lakehouse
- Add a Copy Activity after the Set Variable Activity.
- For the source:
- Choose "REST" as the source type.
- Set the Request Method to "GET".
- In the Additional Headers, add a Content-Type header with value "application/xml".
- In the Request Body, use the variable you set in the previous step.
- For the sink:
- Choose your Lakehouse as the destination.
- Select the appropriate file format (e.g., XML, JSON, or Parquet, depending on your needs).
- In the Copy Activity's settings, enable "First row as header" if applicable.
Please accept this solution and give kudos , if this helps .
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you explain: Choose "REST" as the source type. How can I set the Request body for a GET request? It only shows up for me when I switch it to POST?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The screenshots on this page do not look like my UI (see screenshot in my post). With what I know it does not make sense to put the response (data I want) into the body of a request? But that is probaply just because what I see mismatches what you see. Do you know anything that can help me?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The screen shots do not match what I see either and I was at a loss on how to create a dummy request, I am new to this. I ended up using a notebook with the requests and pandas libraries to import the xml API data into my lakehouse tables
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is what I did to test that I could get it to work.
from io import StringIO # https://docs.python.org/3/library/io.html#io.StringIO
import pandas as pd # https://pandas.pydata.org/docs/reference/index.html
import requests # https://requests.readthedocs.io/en/latest/
# Setup Request Variables.
base_url = 'https://vsac.nlm.nih.gov/vsac/svs/'
end_point = 'RetrieveMultipleValueSets'
parameters = {'id':'2.16.840.1.113883.3.117.1.7.1.226','release':'eCQM Update 2020-05-07'}
# Request - Basic Authentication values have been removed
response = requests.get(f"{base_url}{end_point}",params=parameters,auth=('userName','password'))
# Warns about using response.text directly using StringIO as suggested
xml_io_concept = StringIO(response.text)
# Define namespaces
namespaces = {
'ns0': 'urn:ihe:iti:svs:2008',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}
# Parse the XML and extract Concept elements
pandas_df = pd.read_xml( # https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
xml_io_concept,
xpath=".//ns0:Concept",
namespaces=namespaces
)
# Convert Pandas to PySpark
spark_df = spark.createDataFrame(pandas_df)
# Write to table in default Lakehouse
spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"Tables/test_value_set")
# Query Lakehouse Table
df = spark.sql("SELECT * FROM ValueSet_Lake.test_value_set LIMIT 1000")
display(df)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Helpful resources
Join us at the Microsoft Fabric Community Conference
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Fabric Monthly Update - February 2025
Check out the February 2025 Fabric update to learn about new features.

Subject | Author | Posted | |
---|---|---|---|
09-10-2024 07:42 AM | |||
02-06-2025 04:32 AM | |||
01-07-2025 09:03 AM | |||
12-19-2024 01:04 AM | |||
07-05-2023 06:38 AM |
User | Count |
---|---|
7 | |
4 | |
2 | |
1 | |
1 |
User | Count |
---|---|
13 | |
9 | |
5 | |
5 | |
4 |