Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at the 2025 Microsoft Fabric Community Conference. March 31 - April 2, Las Vegas, Nevada. Use code FABINSIDER for $400 discount. Register now

Reply
Jorn_vb
Regular Visitor

Getting data from XML API fails in copy data but works in web activity

I am trying to get the data from an XML based API into a datalake. Using web activities I succeeded retrieving the data, but I couldn't find a way to load that output into a datalake. I tried the copy data with a rest connection as source, this resulted in a 500 error. Most of the solutions I could find were options I don't see in my UI (I will attach some screenshots).
The most important thing is just to get the data in datalake. It doesn't matter what activities I use. If possible, I like to use the output of a web activity (which like I mentioned does retrieve data).

SourceUI2.png

1 ACCEPTED SOLUTION

Here is what I did to test that I could get it to work. 

from io import StringIO # https://docs.python.org/3/library/io.html#io.StringIO
import pandas as pd # https://pandas.pydata.org/docs/reference/index.html
import requests # https://requests.readthedocs.io/en/latest/

# Setup Request Variables.

base_url = 'https://vsac.nlm.nih.gov/vsac/svs/'
end_point = 'RetrieveMultipleValueSets'
parameters = {'id':'2.16.840.1.113883.3.117.1.7.1.226','release':'eCQM Update 2020-05-07'}

# Request - Basic Authentication values have been removed 

response = requests.get(f"{base_url}{end_point}",params=parameters,auth=('userName','password'))

# Warns about using response.text directly using StringIO as suggested 
xml_io_concept = StringIO(response.text)

# Define namespaces
namespaces = {
    'ns0': 'urn:ihe:iti:svs:2008',
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

# Parse the XML and extract Concept elements
pandas_df = pd.read_xml( # https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
    xml_io_concept,
    xpath=".//ns0:Concept",
    namespaces=namespaces
)

# Convert Pandas to PySpark 

spark_df = spark.createDataFrame(pandas_df)

# Write to table in default Lakehouse 

spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"Tables/test_value_set")

# Query Lakehouse Table

df = spark.sql("SELECT * FROM ValueSet_Lake.test_value_set LIMIT 1000")
display(df)

 

View solution in original post

9 REPLIES 9
Jorn_vb
Regular Visitor

I solved the copy data by using a http connector with basic auth instead of a REST connection.

 

This was not obvious because the API uses tokens for auth and a http connection anomynous failed.

nilendraFabric
Community Champion
Community Champion

Hi @Jorn_vb 

To retrieve XML data from an API and load it into a Microsoft Fabric Lakehouse, you can use a combination of Web Activity and Copy Activity in your data pipeline. Here's a step-by-step approach to accomplish this:

Use Web Activity to Retrieve XML Data

  1. Add a Web Activity to your pipeline to make the API call and retrieve the XML data.
  2. Configure the Web Activity with the appropriate URL, method, headers, and any required authentication.
  3. The Web Activity will successfully retrieve the XML data as you've mentioned.

Process Web Activity Output

  1. Add a Set Variable Activity after the Web Activity.
  2. Configure the Set Variable Activity to store the Web Activity's output in a variable.
  3. Use the following expression to capture the Web Activity's output:
    @activity('Web1').output.Response
    (Replace 'Web1' with your Web Activity's name)

Use Copy Activity to Load Data into Lakehouse

  1. Add a Copy Activity after the Set Variable Activity.
  2. For the source:
    • Choose "REST" as the source type.
    • Set the Request Method to "GET".
    • In the Additional Headers, add a Content-Type header with value "application/xml".
    • In the Request Body, use the variable you set in the previous step.
  3. For the sink:
    • Choose your Lakehouse as the destination.
    • Select the appropriate file format (e.g., XML, JSON, or Parquet, depending on your needs).
  4. In the Copy Activity's settings, enable "First row as header" if applicable.

Please accept this solution and give kudos , if this helps .

Thanks

Can you explain: Choose "REST" as the source type. How can I set the Request body for a GET request? It only shows up for me when I switch it to POST?

The screenshots on this page do not look like my UI (see screenshot in my post). With what I know it does not make sense to put the response (data I want) into the body of a request? But that is probaply just because what I see mismatches what you see. Do you know anything that can help me?

 

The screen shots do not match what I see either and I was at a loss on how to create a dummy request, I am new to this. I ended up using a notebook with the requests and pandas libraries to import the xml API data into my lakehouse tables

Hello @Salmatador 

 

Could you please share your approach here. 

Here is what I did to test that I could get it to work. 

from io import StringIO # https://docs.python.org/3/library/io.html#io.StringIO
import pandas as pd # https://pandas.pydata.org/docs/reference/index.html
import requests # https://requests.readthedocs.io/en/latest/

# Setup Request Variables.

base_url = 'https://vsac.nlm.nih.gov/vsac/svs/'
end_point = 'RetrieveMultipleValueSets'
parameters = {'id':'2.16.840.1.113883.3.117.1.7.1.226','release':'eCQM Update 2020-05-07'}

# Request - Basic Authentication values have been removed 

response = requests.get(f"{base_url}{end_point}",params=parameters,auth=('userName','password'))

# Warns about using response.text directly using StringIO as suggested 
xml_io_concept = StringIO(response.text)

# Define namespaces
namespaces = {
    'ns0': 'urn:ihe:iti:svs:2008',
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

# Parse the XML and extract Concept elements
pandas_df = pd.read_xml( # https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
    xml_io_concept,
    xpath=".//ns0:Concept",
    namespaces=namespaces
)

# Convert Pandas to PySpark 

spark_df = spark.createDataFrame(pandas_df)

# Write to table in default Lakehouse 

spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"Tables/test_value_set")

# Query Lakehouse Table

df = spark.sql("SELECT * FROM ValueSet_Lake.test_value_set LIMIT 1000")
display(df)

 

Thanks for sharing @Salmatador 

@Jorn_vb see if this is helpful 

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

FebFBC_Carousel

Fabric Monthly Update - February 2025

Check out the February 2025 Fabric update to learn about new features.

Feb2025 NL Carousel

Fabric Community Update - February 2025

Find out what's new and trending in the Fabric community.

Top Kudoed Authors