Don't miss your chance to take exam DP-600 or DP-700 on us!
Request nowFabric Data Days Monthly is back. Join us on March 26th for two expert-led sessions on 1) Getting Started with Fabric IQ and 2) Mapping & Spacial Analytics in Fabric. Register now
I have a Lakehouse in Fabric, and under Files there is a main folder called X.
Inside X, there are 56 subfolders, but I only want to process folders starting from the 8th one onward (i.e., skip the first 7).
Each of those folders may contain:
Excel files directly, or
Nested subfolders that contain the Excel files.
I need to find all Excel files whose filename contains “Maturity” (case-insensitive).
There may be multiple such files in a single folder because they differ by FYXX Hx (e.g., FY24 H1, FY23 H2).
For every matching Excel file:
Extract metadata and tag every row with:
The main folder name (the folder under X). The folder name is before a delimiter '-'
The quarter tag (FYXX Hx), extracted from the filename
The source filename
Each Excel workbook has 20 sheets:
I need the 3rd sheet; this is the Summary sheet (I need all columns)
I need sheets 4 to 20; these are Detail sheets (I need only 3 specific columns)
If a folder under X contains multiple qualifying Excel files, I should end up with multiple sets of extracted rows.
At the end, I want two final tables:
Summary table (combined from all 3rd sheets)
Detail table (combined from sheets 4–20)
Each row is tagged with the folder name quarter.
I would like to implement this logic using Python or PySpark in a Fabric Notebook so I can automate everything within the Lakehouse environment.
Solved! Go to Solution.
Hi @datagig ,
Thank you for the update. Based on your description, the problem isn't with your loop or how you index sheets. Instead, it's that Fabric notebooks can't reliably open Excel files using standard file system paths. Files stored in Lakehouse / Files typically aren't accessible as local OS files within the notebook, so functions like pd.read_excel(path) or pd.ExcelFile(path) may not work as expected.
To access workbooks stored in a Lakehouse, the recommended approach is.
1. Use mssparkutils.fs.open() to read the file and load the bytes into a BytesIO buffer.
2. Pass that buffer to pandas.ExcelFile(), then read each sheet by name rather than by index.
This method generally resolves the cannot read sheet errors.
Here’s an example you can try in your Fabric notebook. I can’t test this in your environment, but this follows the recommended pattern for accessing Excel files from Lakehouse Files.
from mssparkutils import fs
import pandas as pd
import io
def open_workbook(path):
with fs.open(path, "rb") as f:
raw = f.read()
bio = io.BytesIO(raw)
ext = path.lower()
engine = "openpyxl" if ext.endswith(("xlsx","xlsm")) else "xlrd"
return pd.ExcelFile(bio, engine=engine)
xls = open_workbook("/lakehouse/default/Files/X/Subfolder/sample.xlsx")
print("Sheets found:", xls.sheet_names)
Could you give this a try and let me know if you get any error messages. That will help determine if the problem is related to permissions, the workbook format, or the engine.
Thanks for your time.
Any particular reason why you are not orchestrating part of this with Data Pipelines? For example, use it to get the metadata of all the files and then loop through with the condition that the name matches your requirements.
Hi @datagig ,
Could you please let us know if the issue is resolved, or if you need any further assistance.
Thank you.
Hi @datagig ,
Have you had a chance to review the response I shared. Was it helpful, or are you facing any difficulties? Please let us know if you need any clarification.
Thanks.
OK, try this, again AI generated but I think you'll appreciate, your request is hard to replicate and work through. A tip for working with AI generated content, split the code down into different cells so that you can identify what works and what doesn't. Use display(df) often to see where you are at any point:
import os
import re
import pandas as pd
# =========================
# Config – adjust as needed
# =========================
ROOT_DIR = "/lakehouse/default/Files/X" # change X to your main folder name under Files
SKIP_FIRST_N_SUBFOLDERS = 7 # skip first 7 subfolders alphabetically
DETAIL_COLUMNS = [
# e.g., "Customer", "Product", "Amount"
]
SUMMARY_TABLE_NAME = "maturity_summary"
DETAIL_TABLE_NAME = "maturity_detail"
# =========================
# Small helpers
# =========================
def main_folder_tag(folder_name: str) -> str:
"""Take part before first '-' else full name."""
return folder_name.split("-", 1)[0].strip()
def extract_quarter_tag(filename: str) -> str | None:
"""Find 'FYxx Hx' with flexible spacing/underscores/dashes, case-insensitive."""
norm = filename.replace("_", " ").replace("-", " ")
m = re.search(r"(FY)\s*(\d{2})\s*H\s*([12])", norm, re.IGNORECASE)
return f"FY{m.group(2)} H{m.group(3)}" if m else None
def excel_engine_for(path: str) -> str:
"""
Choose engine by extension:
- .xlsx / .xlsm -> openpyxl
- .xls -> xlrd
"""
lp = path.lower()
if lp.endswith(".xlsx") or lp.endswith(".xlsm"):
return "openpyxl"
elif lp.endswith(".xls"):
return "xlrd"
else:
raise ValueError(f"Unsupported Excel extension: {path}")
def select_columns_ci(df: pd.DataFrame, wanted: list[str]) -> pd.DataFrame:
if not wanted:
raise ValueError("DETAIL_COLUMNS is empty. Populate with your 3 column names.")
# case-insensitive map
lower_map = {c.lower(): c for c in df.columns}
actual = []
missing = []
for w in wanted:
k = w.lower()
if k in lower_map:
actual.append(lower_map[k])
else:
missing.append(w)
if missing:
raise KeyError(f"Detail columns not found (case-insensitive): {missing}\nAvailable: {list(df.columns)}")
return df[actual]
# =========================
# Scan folders & process
# =========================
summary_parts = []
detail_parts = []
# deterministic subfolder order (alphabetical)
subfolders = sorted([f for f in os.listdir(ROOT_DIR) if os.path.isdir(os.path.join(ROOT_DIR, f))])
to_process = subfolders[SKIP_FIRST_N_SUBFOLDERS:]
print(f"[INFO] Subfolders found: {len(subfolders)}; skipping first {SKIP_FIRST_N_SUBFOLDERS}; processing {len(to_process)}.")
for sf in to_process:
sf_path = os.path.join(ROOT_DIR, sf)
tag = main_folder_tag(sf)
print(f"\n[Folder] {sf} -> tag: '{tag}'")
# Find Excel files that contain 'Maturity' in filename (case-insensitive)
excel_files = []
for root, _, files in os.walk(sf_path):
for fn in files:
if fn.lower().endswith((".xlsx", ".xlsm", ".xls")) and "maturity" in fn.lower():
excel_files.append(os.path.join(root, fn))
if not excel_files:
print(" [INFO] No matching Excel files.")
continue
print(f" [INFO] {len(excel_files)} matching file(s).")
for fpath in excel_files:
fname = os.path.basename(fpath)
quarter = extract_quarter_tag(fname)
engine = excel_engine_for(fpath)
print(f" [File] {fname} | quarter: {quarter} | engine: {engine}")
# Use ExcelFile to list sheets safely before reading
try:
xls = pd.ExcelFile(fpath, engine=engine)
sheet_names = xls.sheet_names # ordered
sheet_count = len(sheet_names)
except Exception as e:
print(f" [WARN] Cannot open workbook: {e}")
continue
# ---- Summary: 3rd sheet (index 2) if exists
if sheet_count >= 3:
sheet3_name = sheet_names[2]
try:
df_sum = pd.read_excel(xls, sheet_name=sheet3_name)
if not df_sum.empty:
df_sum["main_folder"] = tag
df_sum["quarter"] = quarter
df_sum["source_file"] = fname
summary_parts.append(df_sum)
else:
print(" [INFO] Summary sheet (3rd) is empty.")
except Exception as e:
print(f" [WARN] Failed reading Summary (sheet 3='{sheet3_name}'): {e}")
else:
print(" [INFO] Workbook has fewer than 3 sheets; Summary skipped.")
# ---- Detail: sheets 4..20 (indexes 3..19) if exist
if sheet_count >= 4:
start = 3
end = min(19, sheet_count - 1) # cap to last available
for idx in range(start, end + 1):
sname = sheet_names[idx]
try:
df_det = pd.read_excel(xls, sheet_name=sname)
if df_det.empty:
continue
# select only needed columns (case-insensitive)
try:
df_sel = select_columns_ci(df_det, DETAIL_COLUMNS)
except Exception as col_err:
# If columns fail for just one sheet, skip that sheet but continue others
print(f" [WARN] Sheet {idx+1} ('{sname}') column selection error: {col_err}")
continue
df_sel["main_folder"] = tag
df_sel["quarter"] = quarter
df_sel["source_file"] = fname
detail_parts.append(df_sel)
except Exception as e:
print(f" [WARN] Failed reading Detail sheet {idx+1} ('{sname}'): {e}")
else:
print(" [INFO] Workbook has fewer than 4 sheets; no Detail sheets to process.")
# =========================
# Combine & Save to Delta
# =========================
summary_df = pd.concat(summary_parts, ignore_index=True) if summary_parts else pd.DataFrame()
detail_df = pd.concat(detail_parts, ignore_index=True) if detail_parts else pd.DataFrame()
print("\n[RESULT] Combined shapes:")
print(f" Summary: {summary_df.shape}")
print(f" Detail : {detail_df.shape}")
if not summary_df.empty:
spark.createDataFrame(summary_df).write.format("delta").mode("overwrite").saveAsTable(SUMMARY_TABLE_NAME)
print(f"[SAVE] Lakehouse table written: {SUMMARY_TABLE_NAME}")
else:
print("[SAVE] Summary empty; no table written.")
if not detail_df.empty:
spark.createDataFrame(detail_df).write.format("delta").mode("overwrite").saveAsTable(DETAIL_TABLE_NAME)
print(f"[SAVE] Lakehouse table written: {DETAIL_TABLE_NAME}")
else:
print("[SAVE] Detail empty; no table written.")
--------------------------------
I hope this helps, please give kudos and mark as solved if it does!
Connect with me on LinkedIn.
Subscribe to my YouTube channel for Fabric/Power Platform related content!
I really appreciate your effort. Thank you! Yes, trying to get through it.
Hi @wardy912, I think the script is not able to read or open workbooks and sheets with some Python or Spark methods I tried using. Do you know a function that will work in this case when I am trying to open a workbook from a notebook and the workbooks are present in lakehouse/files? I tried using a couple of them, but nothing works.
Hi @datagig ,
Thank you for the update. Based on your description, the problem isn't with your loop or how you index sheets. Instead, it's that Fabric notebooks can't reliably open Excel files using standard file system paths. Files stored in Lakehouse / Files typically aren't accessible as local OS files within the notebook, so functions like pd.read_excel(path) or pd.ExcelFile(path) may not work as expected.
To access workbooks stored in a Lakehouse, the recommended approach is.
1. Use mssparkutils.fs.open() to read the file and load the bytes into a BytesIO buffer.
2. Pass that buffer to pandas.ExcelFile(), then read each sheet by name rather than by index.
This method generally resolves the cannot read sheet errors.
Here’s an example you can try in your Fabric notebook. I can’t test this in your environment, but this follows the recommended pattern for accessing Excel files from Lakehouse Files.
from mssparkutils import fs
import pandas as pd
import io
def open_workbook(path):
with fs.open(path, "rb") as f:
raw = f.read()
bio = io.BytesIO(raw)
ext = path.lower()
engine = "openpyxl" if ext.endswith(("xlsx","xlsm")) else "xlrd"
return pd.ExcelFile(bio, engine=engine)
xls = open_workbook("/lakehouse/default/Files/X/Subfolder/sample.xlsx")
print("Sheets found:", xls.sheet_names)
Could you give this a try and let me know if you get any error messages. That will help determine if the problem is related to permissions, the workbook format, or the engine.
Thanks for your time.
Hi @wardy912, my script didn’t complete successfully. It reaches each folder and detects the Excel files correctly, but it fails when trying to read the sheets from 3 to 20. For every file and folder, it returns an error saying it cannot read those sheets.
Hi @datagig
The following code is AI generated and unconfirmed but I hope it helps!
# Fabric Notebook: Extract Maturity Excel Sheets -> Summary & Detail Lakehouse tables
import os
import re
import pandas as pd
# ----------------------------
# Config – update these values
# ----------------------------
ROOT_DIR = "/lakehouse/default/Files/X" # change X to your main folder name under Files
SKIP_FIRST_N_SUBFOLDERS = 7 # skip first 7 subfolders under X
DETAIL_COLUMNS = [
# TODO: put your 3 column names EXACTLY as they appear in the Detail sheets
# e.g., "Customer", "Product", "Amount"
]
SUMMARY_TABLE_NAME = "maturity_summary" # Lakehouse table name for the Summary
DETAIL_TABLE_NAME = "maturity_detail" # Lakehouse table name for the Detail
OUTPUT_FILES_DIR = "/lakehouse/default/Files/outputs/maturity" # optional file outputs
# ----------------------------
# Helpers
# ----------------------------
def list_top_level_subfolders(path: str) -> list[str]:
"""Return a sorted list of immediate subfolders under path."""
if not os.path.exists(path):
raise FileNotFoundError(f"Root path not found: {path}")
subfolders = [
f for f in os.listdir(path)
if os.path.isdir(os.path.join(path, f))
]
# deterministic ordering: alphabetical
subfolders.sort()
return subfolders
def main_folder_tag(folder_name: str) -> str:
"""Tag = part before first '-' if present, else whole folder name."""
return folder_name.split("-", 1)[0].strip()
def extract_quarter_tag(filename: str) -> str | None:
"""
Extract 'FYXX Hx' (e.g. FY24 H1) from filename, case-insensitive.
Accepts optional spaces/underscores/dashes between parts.
"""
# Normalize
fname = filename.replace("_", " ").replace("-", " ")
m = re.search(r"(FY)\s*(\d{2})\s*H\s*([12])", fname, re.IGNORECASE)
if m:
yy = m.group(2)
h = m.group(3)
return f"FY{yy} H{h}"
return None
def is_excel_file(path: str) -> bool:
return path.lower().endswith((".xlsx", ".xls"))
def matches_maturity(filename: str) -> bool:
return "maturity" in filename.lower()
def select_columns_case_insensitive(df: pd.DataFrame, wanted: list[str]) -> pd.DataFrame:
"""
Select columns from df matching wanted names case-insensitively.
Preserves order of wanted.
"""
if not wanted:
raise ValueError("DETAIL_COLUMNS is empty. Please set the 3 detail column names.")
lower_map = {c.lower(): c for c in df.columns}
selected_actual = []
missing = []
for w in wanted:
key = w.lower()
if key in lower_map:
selected_actual.append(lower_map[key])
else:
missing.append(w)
if missing:
raise KeyError(f"Detail columns not found (case-insensitive): {missing}\nAvailable: {list(df.columns)}")
return df[selected_actual]
def read_sheet_by_index(file_path: str, sheet_index: int) -> pd.DataFrame | None:
"""
Read a single sheet by 0-based index.
Returns None if the sheet index is out of range or if reading fails.
"""
try:
if file_path.lower().endswith(".xlsx"):
df = pd.read_excel(file_path, sheet_name=sheet_index, engine="openpyxl")
else:
df = pd.read_excel(file_path, sheet_name=sheet_index, engine="xlrd")
return df
except Exception as e:
# Common reasons: sheet index out of range, bad workbook, etc.
print(f" [WARN] Cannot read sheet {sheet_index+1} from {os.path.basename(file_path)}: {e}")
return None
def ensure_output_dirs():
try:
os.makedirs(OUTPUT_FILES_DIR, exist_ok=True)
except Exception as e:
print(f"[WARN] Could not create output dir {OUTPUT_FILES_DIR}: {e}")
# ----------------------------
# Processing
# ----------------------------
summary_rows = [] # list of pandas DataFrames
detail_rows = [] # list of pandas DataFrames
subfolders = list_top_level_subfolders(ROOT_DIR)
if len(subfolders) <= SKIP_FIRST_N_SUBFOLDERS:
print(f"[INFO] Only {len(subfolders)} subfolders under {ROOT_DIR}. Nothing to process after skipping {SKIP_FIRST_N_SUBFOLDERS}.")
else:
to_process = subfolders[SKIP_FIRST_N_SUBFOLDERS:]
print(f"[INFO] Found {len(subfolders)} subfolders. Skipping {SKIP_FIRST_N_SUBFOLDERS}, processing {len(to_process)}:")
for sf in to_process:
tag = main_folder_tag(sf)
sf_path = os.path.join(ROOT_DIR, sf)
print(f"\n[Folder] {sf} -> tag: '{tag}'")
# recurse into this subfolder to find matching Excel files
matched_files = []
for root, dirs, files in os.walk(sf_path):
for fn in files:
if is_excel_file(fn) and matches_maturity(fn):
matched_files.append(os.path.join(root, fn))
if not matched_files:
print(" [INFO] No matching 'Maturity' Excel files in this folder.")
continue
print(f" [INFO] {len(matched_files)} matching workbook(s) found.")
for fpath in matched_files:
fname = os.path.basename(fpath)
quarter = extract_quarter_tag(fname)
if not quarter:
print(f" [WARN] Could not extract quarter tag from filename '{fname}'. Row will have quarter=None.")
print(f" [File] {fname} | quarter: {quarter}")
# ---- Summary: 3rd sheet (index 2)
df_sum = read_sheet_by_index(fpath, sheet_index=2)
if df_sum is not None and not df_sum.empty:
# Tagging columns
df_sum["main_folder"] = tag
df_sum["quarter"] = quarter
df_sum["source_file"] = fname
summary_rows.append(df_sum)
else:
print(" [INFO] Summary sheet (3rd) missing or empty.")
# ---- Details: sheets 4..20 (indexes 3..19)
dfs_detail_this_file = []
for idx in range(3, 20):
df_det = read_sheet_by_index(fpath, sheet_index=idx)
if df_det is None or df_det.empty:
continue
try:
# select only desired columns
df_sel = select_columns_case_insensitive(df_det, DETAIL_COLUMNS)
except Exception as e:
print(f" [WARN] Skipping sheet {idx+1} due to column selection error: {e}")
continue
# Tagging columns
df_sel["main_folder"] = tag
df_sel["quarter"] = quarter
df_sel["source_file"] = fname
dfs_detail_this_file.append(df_sel)
if dfs_detail_this_file:
detail_rows.append(pd.concat(dfs_detail_this_file, ignore_index=True))
else:
print(" [INFO] No usable Detail sheets (4–20) in this workbook.")
# ----------------------------
# Combine & Save
# ----------------------------
summary_df = pd.concat(summary_rows, ignore_index=True) if summary_rows else pd.DataFrame()
detail_df = pd.concat(detail_rows, ignore_index=True) if detail_rows else pd.DataFrame()
print("\n[RESULT] Combined shapes:")
print(f" Summary: {summary_df.shape}")
print(f" Detail : {detail_df.shape}")
# Convert to Spark DataFrames
if summary_df is not None and not summary_df.empty:
summary_sdf = spark.createDataFrame(summary_df)
summary_sdf.write.format("delta").mode("overwrite").saveAsTable(SUMMARY_TABLE_NAME)
print(f"[SAVE] Lakehouse table written: {SUMMARY_TABLE_NAME}")
else:
print("[SAVE] Summary is empty. No table written.")
if detail_df is not None and not detail_df.empty:
detail_sdf = spark.createDataFrame(detail_df)
detail_sdf.write.format("delta").mode("overwrite").saveAsTable(DETAIL_TABLE_NAME)
print(f"[SAVE] Lakehouse table written: {DETAIL_TABLE_NAME}")
else:
print("[SAVE] Detail is empty. No table written.")
# Optional: also write files to /Files for ad-hoc access
ensure_output_dirs()
try:
if summary_df is not None and not summary_df.empty:
summary_df.to_parquet(os.path.join(OUTPUT_FILES_DIR, "maturity_summary.parquet"), index=False)
if detail_df is not None and not detail_df.empty:
detail_df.to_parquet(os.path.join(OUTPUT_FILES_DIR, "maturity_detail.parquet"), index=False)
print(f"[SAVE] Parquet exports written under {OUTPUT_FILES_DIR}")
except Exception as e:
print(f"[WARN] Could not write Parquet exports: {e}")
--------------------------------
I hope this helps, please give kudos and mark as solved if it does!
Connect with me on LinkedIn.
Subscribe to my YouTube channel for Fabric/Power Platform related content!
Hi, I am trying this out now. I will update if it works. If I get stuck at something, I will post it here. Thanks for the help!
Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.
Check out the February 2026 Fabric update to learn about new features.
| User | Count |
|---|---|
| 23 | |
| 12 | |
| 10 | |
| 7 | |
| 7 |
| User | Count |
|---|---|
| 46 | |
| 41 | |
| 22 | |
| 15 | |
| 14 |