Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Next up in the FabCon + SQLCon recap series: The roadmap for Microsoft SQL and Maximizing Developer experiences in Fabric. All sessions are available on-demand after the live show. Register now

Reply
rockper
Frequent Visitor

Import multiple files from external source

Source: https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023/index.html

Destination: my_lakehouse_root/files/ais/2023/

Lakehouse SQL Endpoint: xyz.datawarehouse.fabric.microsoft.com

 

I have successfully downloaded this in a VM on another cloud with:

$ wget -qO -np -r -nH -L --cut-dirs=3 https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023/

 

I tried "Copy Data" in a pipeline, but it only grabs the webpage itself (index.html), and not the zip files.

I only need to copy the files, I will extract and process these zip files in Spark later.

1 ACCEPTED SOLUTION
Anonymous
Not applicable

I would donwload them via a python notebook instead

Create an iterator that generates the needed urls. from 01_01 until 09_30

download each url 

https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023/AIS_2023_01_01.zip

 

if you need inspiration on how to download files in python, then use the existing samples that you can install for free. The Machine detection one for example downloads its own data. 

alxdean_0-1708596428326.png

 

They all contain a code block that downloads a file and unzips it to the lakehouse. here is the example from the uplift sample

 

 

if not IS_CUSTOM_DATA:
    # Download demo data files into lakehouse if not exist
    import os, requests

    download_file = "criteo-research-uplift-v2.1.csv.gz"
    download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"

    if not os.path.exists("/lakehouse/default"😞
        raise FileNotFoundError("Default lakehouse not found, please add a lakehouse and restart the session.")
    os.makedirs(download_path, exist_ok=True)
    if not os.path.exists(f"{download_path}/{DATA_FILE}"😞
        r = requests.get(f"{remote_url}", timeout=30)
        with open(f"{download_path}/{download_file}", "wb") as f:
            f.write(r.content)
        with gzip.open(f"{download_path}/{download_file}", "rb") as fin:
            with open(f"{download_path}/{DATA_FILE}", "wb") as fout:
                fout.write(fin.read())
    print("Downloaded demo data files into lakehouse.")
 
 
Hope this will help you to get started with downloading data and loading your lakehouse.

View solution in original post

2 REPLIES 2
Anonymous
Not applicable

I would donwload them via a python notebook instead

Create an iterator that generates the needed urls. from 01_01 until 09_30

download each url 

https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023/AIS_2023_01_01.zip

 

if you need inspiration on how to download files in python, then use the existing samples that you can install for free. The Machine detection one for example downloads its own data. 

alxdean_0-1708596428326.png

 

They all contain a code block that downloads a file and unzips it to the lakehouse. here is the example from the uplift sample

 

 

if not IS_CUSTOM_DATA:
    # Download demo data files into lakehouse if not exist
    import os, requests

    download_file = "criteo-research-uplift-v2.1.csv.gz"
    download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"

    if not os.path.exists("/lakehouse/default"😞
        raise FileNotFoundError("Default lakehouse not found, please add a lakehouse and restart the session.")
    os.makedirs(download_path, exist_ok=True)
    if not os.path.exists(f"{download_path}/{DATA_FILE}"😞
        r = requests.get(f"{remote_url}", timeout=30)
        with open(f"{download_path}/{download_file}", "wb") as f:
            f.write(r.content)
        with gzip.open(f"{download_path}/{download_file}", "rb") as fin:
            with open(f"{download_path}/{DATA_FILE}", "wb") as fout:
                fout.write(fin.read())
    print("Downloaded demo data files into lakehouse.")
 
 
Hope this will help you to get started with downloading data and loading your lakehouse.

Thanks,

This worked. My error was in using my Lakehouse name is the destination path. I got the correct destination by going to the folder in the Lakehouse navigation sidebar, then left-clicking on the "...", and left-clicking "Copy File API Path"

```

import os, requests
download_file = "AIS_2023_01_01.zip"
download_path = f"/lakehouse/default/Files/ais/2023"
 
r = requests.get(f"{remote_url}/{download_file}")
with open(f"{download_path}/{download_file}", "wb") as f:
  f.write(r.content)
```

 

Helpful resources

Announcements
FabCon and SQLCon Highlights Carousel

FabCon &SQLCon Highlights

Experience the highlights from FabCon & SQLCon, available live and on-demand starting April 14th.

New to Fabric survey Carousel

New to Fabric Survey

If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.

Join our Fabric User Panel

Join our Fabric User Panel

Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.

March Fabric Update Carousel

Fabric Monthly Update - March 2026

Check out the March 2026 Fabric update to learn about new features.