Solved: Import multiple files from external source

rockper · ‎02-22-2024

Source: https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023/index.html

Destination: my_lakehouse_root/files/ais/2023/

Lakehouse SQL Endpoint: xyz.datawarehouse.fabric.microsoft.com

I have successfully downloaded this in a VM on another cloud with:

$ wget -qO -np -r -nH -L --cut-dirs=3 https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023/

I tried "Copy Data" in a pipeline, but it only grabs the webpage itself (index.html), and not the zip files.

I only need to copy the files, I will extract and process these zip files in Spark later.

Anonymous · ‎02-22-2024

I would donwload them via a python notebook instead

Create an iterator that generates the needed urls. from 01_01 until 09_30

download each url

https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023/AIS_2023_01_01.zip

if you need inspiration on how to download files in python, then use the existing samples that you can install for free. The Machine detection one for example downloads its own data.

They all contain a code block that downloads a file and unzips it to the lakehouse. here is the example from the uplift sample

if not IS_CUSTOM_DATA:

# Download demo data files into lakehouse if not exist

import os, requests

remote_url = "http://go.criteo.net/criteo-research-uplift-v2.1.csv.gz"

download_file = "criteo-research-uplift-v2.1.csv.gz"

download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"

if not os.path.exists("/lakehouse/default"😞

raise FileNotFoundError("Default lakehouse not found, please add a lakehouse and restart the session.")

os.makedirs(download_path, exist_ok=True)

if not os.path.exists(f"{download_path}/{DATA_FILE}"😞

r = requests.get(f"{remote_url}", timeout=30)

with open(f"{download_path}/{download_file}", "wb") as f:

f.write(r.content)

with gzip.open(f"{download_path}/{download_file}", "rb") as fin:

with open(f"{download_path}/{DATA_FILE}", "wb") as fout:

fout.write(fin.read())

print("Downloaded demo data files into lakehouse.")

Hope this will help you to get started with downloading data and loading your lakehouse.

View solution in original post

Anonymous · ‎02-22-2024

I would donwload them via a python notebook instead

Create an iterator that generates the needed urls. from 01_01 until 09_30

download each url

https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023/AIS_2023_01_01.zip

if you need inspiration on how to download files in python, then use the existing samples that you can install for free. The Machine detection one for example downloads its own data.

They all contain a code block that downloads a file and unzips it to the lakehouse. here is the example from the uplift sample

if not IS_CUSTOM_DATA:

# Download demo data files into lakehouse if not exist

import os, requests

remote_url = "http://go.criteo.net/criteo-research-uplift-v2.1.csv.gz"

download_file = "criteo-research-uplift-v2.1.csv.gz"

download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"

if not os.path.exists("/lakehouse/default"😞

raise FileNotFoundError("Default lakehouse not found, please add a lakehouse and restart the session.")

os.makedirs(download_path, exist_ok=True)

if not os.path.exists(f"{download_path}/{DATA_FILE}"😞

r = requests.get(f"{remote_url}", timeout=30)

with open(f"{download_path}/{download_file}", "wb") as f:

f.write(r.content)

with gzip.open(f"{download_path}/{download_file}", "rb") as fin:

with open(f"{download_path}/{DATA_FILE}", "wb") as fout:

fout.write(fin.read())

print("Downloaded demo data files into lakehouse.")

Hope this will help you to get started with downloading data and loading your lakehouse.

rockper · ‎02-25-2024

Thanks,

This worked. My error was in using my Lakehouse name is the destination path. I got the correct destination by going to the folder in the Lakehouse navigation sidebar, then left-clicking on the "...", and left-clicking "Copy File API Path"

```

import os, requests

remote_url = "https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2023"

download_file = "AIS_2023_01_01.zip"

download_path = f"/lakehouse/default/Files/ais/2023"

r = requests.get(f"{remote_url}/{download_file}")

with open(f"{download_path}/{download_file}", "wb") as f:

f.write(r.content)

```

Import multiple files from external source

Helpful resources

FabCon Global Hackathon

Fabric Monthly Update - September 2025

FabCon Atlanta 2026

FabCon is coming to Atlanta

Import multiple files from external source

Helpful resources

FabCon Global Hackathon

Fabric Monthly Update - September 2025

FabCon Atlanta 2026