Solved: Getting data from XML API fails in copy data but w...

Jorn_vb · ‎01-20-2025

I am trying to get the data from an XML based API into a datalake. Using web activities I succeeded retrieving the data, but I couldn't find a way to load that output into a datalake. I tried the copy data with a rest connection as source, this resulted in a 500 error. Most of the solutions I could find were options I don't see in my UI (I will attach some screenshots).
The most important thing is just to get the data in datalake. It doesn't matter what activities I use. If possible, I like to use the output of a web activity (which like I mentioned does retrieve data).

Salmatador · ‎01-31-2025

Here is what I did to test that I could get it to work.

from io import StringIO # https://docs.python.org/3/library/io.html#io.StringIO
import pandas as pd # https://pandas.pydata.org/docs/reference/index.html
import requests # https://requests.readthedocs.io/en/latest/

# Setup Request Variables.

base_url = 'https://vsac.nlm.nih.gov/vsac/svs/'
end_point = 'RetrieveMultipleValueSets'
parameters = {'id':'2.16.840.1.113883.3.117.1.7.1.226','release':'eCQM Update 2020-05-07'}

# Request - Basic Authentication values have been removed 

response = requests.get(f"{base_url}{end_point}",params=parameters,auth=('userName','password'))

# Warns about using response.text directly using StringIO as suggested 
xml_io_concept = StringIO(response.text)

# Define namespaces
namespaces = {
    'ns0': 'urn:ihe:iti:svs:2008',
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

# Parse the XML and extract Concept elements
pandas_df = pd.read_xml( # https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
    xml_io_concept,
    xpath=".//ns0:Concept",
    namespaces=namespaces
)

# Convert Pandas to PySpark 

spark_df = spark.createDataFrame(pandas_df)

# Write to table in default Lakehouse 

spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"Tables/test_value_set")

# Query Lakehouse Table

df = spark.sql("SELECT * FROM ValueSet_Lake.test_value_set LIMIT 1000")
display(df)

View solution in original post

Jorn_vb · ‎02-06-2025

I solved the copy data by using a http connector with basic auth instead of a REST connection.

This was not obvious because the API uses tokens for auth and a http connection anomynous failed.

nilendraFabric · ‎01-20-2025

Hi @Jorn_vb

To retrieve XML data from an API and load it into a Microsoft Fabric Lakehouse, you can use a combination of Web Activity and Copy Activity in your data pipeline. Here's a step-by-step approach to accomplish this:

Use Web Activity to Retrieve XML Data

Add a Web Activity to your pipeline to make the API call and retrieve the XML data.
Configure the Web Activity with the appropriate URL, method, headers, and any required authentication.
The Web Activity will successfully retrieve the XML data as you've mentioned.

Process Web Activity Output

Add a Set Variable Activity after the Web Activity.
Configure the Set Variable Activity to store the Web Activity's output in a variable.
Use the following expression to capture the Web Activity's output:
@activity('Web1').output.Response
(Replace 'Web1' with your Web Activity's name)

Use Copy Activity to Load Data into Lakehouse

Add a Copy Activity after the Set Variable Activity.
For the source:
- Choose "REST" as the source type.
- Set the Request Method to "GET".
- In the Additional Headers, add a Content-Type header with value "application/xml".
- In the Request Body, use the variable you set in the previous step.
For the sink:
- Choose your Lakehouse as the destination.
- Select the appropriate file format (e.g., XML, JSON, or Parquet, depending on your needs).
In the Copy Activity's settings, enable "First row as header" if applicable.

Please accept this solution and give kudos , if this helps .

Thanks

Salmatador · ‎01-30-2025

Can you explain: Choose "REST" as the source type. How can I set the Request body for a GET request? It only shows up for me when I switch it to POST?

nilendraFabric · ‎01-30-2025

https://learn.microsoft.com/en-us/fabric/data-factory/connector-rest-copy-activity

Jorn_vb · ‎01-31-2025

The screenshots on this page do not look like my UI (see screenshot in my post). With what I know it does not make sense to put the response (data I want) into the body of a request? But that is probaply just because what I see mismatches what you see. Do you know anything that can help me?

Salmatador · ‎01-31-2025

The screen shots do not match what I see either and I was at a loss on how to create a dummy request, I am new to this. I ended up using a notebook with the requests and pandas libraries to import the xml API data into my lakehouse tables

nilendraFabric · ‎01-31-2025

Hello @Salmatador

Could you please share your approach here.

Salmatador · ‎01-31-2025

Here is what I did to test that I could get it to work.

from io import StringIO # https://docs.python.org/3/library/io.html#io.StringIO
import pandas as pd # https://pandas.pydata.org/docs/reference/index.html
import requests # https://requests.readthedocs.io/en/latest/

# Setup Request Variables.

base_url = 'https://vsac.nlm.nih.gov/vsac/svs/'
end_point = 'RetrieveMultipleValueSets'
parameters = {'id':'2.16.840.1.113883.3.117.1.7.1.226','release':'eCQM Update 2020-05-07'}

# Request - Basic Authentication values have been removed 

response = requests.get(f"{base_url}{end_point}",params=parameters,auth=('userName','password'))

# Warns about using response.text directly using StringIO as suggested 
xml_io_concept = StringIO(response.text)

# Define namespaces
namespaces = {
    'ns0': 'urn:ihe:iti:svs:2008',
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

# Parse the XML and extract Concept elements
pandas_df = pd.read_xml( # https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
    xml_io_concept,
    xpath=".//ns0:Concept",
    namespaces=namespaces
)

# Convert Pandas to PySpark 

spark_df = spark.createDataFrame(pandas_df)

# Write to table in default Lakehouse 

spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"Tables/test_value_set")

# Query Lakehouse Table

df = spark.sql("SELECT * FROM ValueSet_Lake.test_value_set LIMIT 1000")
display(df)

nilendraFabric · ‎01-31-2025

Thanks for sharing @Salmatador

@Jorn_vb see if this is helpful

Getting data from XML API fails in copy data but works in web activity

Use Web Activity to Retrieve XML Data

Process Web Activity Output

Use Copy Activity to Load Data into Lakehouse

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

Getting data from XML API fails in copy data but works in web activity

Use Web Activity to Retrieve XML Data

Process Web Activity Output

Use Copy Activity to Load Data into Lakehouse

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025