Solved: Re: Data Transformation using Python/PySpark in La...

datagig · ‎12-08-2025

I have a Lakehouse in Fabric, and under Files there is a main folder called X.
Inside X, there are 56 subfolders, but I only want to process folders starting from the 8th one onward (i.e., skip the first 7).

Each of those folders may contain:

Excel files directly, or
Nested subfolders that contain the Excel files.

I need to find all Excel files whose filename contains “Maturity” (case-insensitive).
There may be multiple such files in a single folder because they differ by FYXX Hx (e.g., FY24 H1, FY23 H2).

What I need to extract

For every matching Excel file:

Extract metadata and tag every row with:
- The main folder name (the folder under X). The folder name is before a delimiter '-'
- The quarter tag (FYXX Hx), extracted from the filename
- The source filename
Each Excel workbook has 20 sheets:
- I need the 3rd sheet; this is the Summary sheet (I need all columns)
- I need sheets 4 to 20; these are Detail sheets (I need only 3 specific columns)
If a folder under X contains multiple qualifying Excel files, I should end up with multiple sets of extracted rows.
At the end, I want two final tables:
- Summary table (combined from all 3rd sheets)
- Detail table (combined from sheets 4–20)
  Each row is tagged with the folder name quarter.

I would like to implement this logic using Python or PySpark in a Fabric Notebook so I can automate everything within the Lakehouse environment.

V-yubandi-msft · ‎12-12-2025

Hi @datagig ,
Thank you for the update. Based on your description, the problem isn't with your loop or how you index sheets. Instead, it's that Fabric notebooks can't reliably open Excel files using standard file system paths. Files stored in Lakehouse / Files typically aren't accessible as local OS files within the notebook, so functions like pd.read_excel(path) or pd.ExcelFile(path) may not work as expected.

To access workbooks stored in a Lakehouse, the recommended approach is.

1. Use mssparkutils.fs.open() to read the file and load the bytes into a BytesIO buffer.

2. Pass that buffer to pandas.ExcelFile(), then read each sheet by name rather than by index.

This method generally resolves the cannot read sheet errors.

Here’s an example you can try in your Fabric notebook. I can’t test this in your environment, but this follows the recommended pattern for accessing Excel files from Lakehouse Files.

from mssparkutils import fs
import pandas as pd
import io

def open_workbook(path):
    with fs.open(path, "rb") as f:
        raw = f.read()
    bio = io.BytesIO(raw)

    ext = path.lower()
    engine = "openpyxl" if ext.endswith(("xlsx","xlsm")) else "xlrd"

    return pd.ExcelFile(bio, engine=engine)

xls = open_workbook("/lakehouse/default/Files/X/Subfolder/sample.xlsx")
print("Sheets found:", xls.sheet_names)

Could you give this a try and let me know if you get any error messages. That will help determine if the problem is related to permissions, the workbook format, or the engine.

Thanks for your time.

View solution in original post

KevinChant · ‎12-19-2025

Any particular reason why you are not orchestrating part of this with Data Pipelines? For example, use it to get the metadata of all the files and then loop through with the condition that the name matches your requirements.

V-yubandi-msft · ‎12-18-2025

Hi @datagig ,

Could you please let us know if the issue is resolved, or if you need any further assistance.

Thank you.

V-yubandi-msft · ‎12-15-2025

Hi @datagig ,
Have you had a chance to review the response I shared. Was it helpful, or are you facing any difficulties? Please let us know if you need any clarification.

Thanks.

wardy912 · ‎12-09-2025

OK, try this, again AI generated but I think you'll appreciate, your request is hard to replicate and work through. A tip for working with AI generated content, split the code down into different cells so that you can identify what works and what doesn't. Use display(df) often to see where you are at any point:

import os
import re
import pandas as pd

# =========================
# Config – adjust as needed
# =========================
ROOT_DIR = "/lakehouse/default/Files/X"  # change X to your main folder name under Files
SKIP_FIRST_N_SUBFOLDERS = 7              # skip first 7 subfolders alphabetically
DETAIL_COLUMNS = [
    # e.g., "Customer", "Product", "Amount"
]
SUMMARY_TABLE_NAME = "maturity_summary"
DETAIL_TABLE_NAME = "maturity_detail"

# =========================
# Small helpers
# =========================
def main_folder_tag(folder_name: str) -> str:
    """Take part before first '-' else full name."""
    return folder_name.split("-", 1)[0].strip()

def extract_quarter_tag(filename: str) -> str | None:
    """Find 'FYxx Hx' with flexible spacing/underscores/dashes, case-insensitive."""
    norm = filename.replace("_", " ").replace("-", " ")
    m = re.search(r"(FY)\s*(\d{2})\s*H\s*([12])", norm, re.IGNORECASE)
    return f"FY{m.group(2)} H{m.group(3)}" if m else None

def excel_engine_for(path: str) -> str:
    """
    Choose engine by extension:
    - .xlsx / .xlsm -> openpyxl
    - .xls          -> xlrd
    """
    lp = path.lower()
    if lp.endswith(".xlsx") or lp.endswith(".xlsm"):
        return "openpyxl"
    elif lp.endswith(".xls"):
        return "xlrd"
    else:
        raise ValueError(f"Unsupported Excel extension: {path}")

def select_columns_ci(df: pd.DataFrame, wanted: list[str]) -> pd.DataFrame:
    if not wanted:
        raise ValueError("DETAIL_COLUMNS is empty. Populate with your 3 column names.")
    # case-insensitive map
    lower_map = {c.lower(): c for c in df.columns}
    actual = []
    missing = []
    for w in wanted:
        k = w.lower()
        if k in lower_map:
            actual.append(lower_map[k])
        else:
            missing.append(w)
    if missing:
        raise KeyError(f"Detail columns not found (case-insensitive): {missing}\nAvailable: {list(df.columns)}")
    return df[actual]

# =========================
# Scan folders & process
# =========================
summary_parts = []
detail_parts = []

# deterministic subfolder order (alphabetical)
subfolders = sorted([f for f in os.listdir(ROOT_DIR) if os.path.isdir(os.path.join(ROOT_DIR, f))])
to_process = subfolders[SKIP_FIRST_N_SUBFOLDERS:]

print(f"[INFO] Subfolders found: {len(subfolders)}; skipping first {SKIP_FIRST_N_SUBFOLDERS}; processing {len(to_process)}.")

for sf in to_process:
    sf_path = os.path.join(ROOT_DIR, sf)
    tag = main_folder_tag(sf)
    print(f"\n[Folder] {sf}  -> tag: '{tag}'")

    # Find Excel files that contain 'Maturity' in filename (case-insensitive)
    excel_files = []
    for root, _, files in os.walk(sf_path):
        for fn in files:
            if fn.lower().endswith((".xlsx", ".xlsm", ".xls")) and "maturity" in fn.lower():
                excel_files.append(os.path.join(root, fn))

    if not excel_files:
        print("  [INFO] No matching Excel files.")
        continue

    print(f"  [INFO] {len(excel_files)} matching file(s).")

    for fpath in excel_files:
        fname = os.path.basename(fpath)
        quarter = extract_quarter_tag(fname)
        engine = excel_engine_for(fpath)

        print(f"  [File] {fname} | quarter: {quarter} | engine: {engine}")

        # Use ExcelFile to list sheets safely before reading
        try:
            xls = pd.ExcelFile(fpath, engine=engine)
            sheet_names = xls.sheet_names  # ordered
            sheet_count = len(sheet_names)
        except Exception as e:
            print(f"    [WARN] Cannot open workbook: {e}")
            continue

        # ---- Summary: 3rd sheet (index 2) if exists
        if sheet_count >= 3:
            sheet3_name = sheet_names[2]
            try:
                df_sum = pd.read_excel(xls, sheet_name=sheet3_name)
                if not df_sum.empty:
                    df_sum["main_folder"] = tag
                    df_sum["quarter"] = quarter
                    df_sum["source_file"] = fname
                    summary_parts.append(df_sum)
                else:
                    print("    [INFO] Summary sheet (3rd) is empty.")
            except Exception as e:
                print(f"    [WARN] Failed reading Summary (sheet 3='{sheet3_name}'): {e}")
        else:
            print("    [INFO] Workbook has fewer than 3 sheets; Summary skipped.")

        # ---- Detail: sheets 4..20 (indexes 3..19) if exist
        if sheet_count >= 4:
            start = 3
            end = min(19, sheet_count - 1)  # cap to last available
            for idx in range(start, end + 1):
                sname = sheet_names[idx]
                try:
                    df_det = pd.read_excel(xls, sheet_name=sname)
                    if df_det.empty:
                        continue
                    # select only needed columns (case-insensitive)
                    try:
                        df_sel = select_columns_ci(df_det, DETAIL_COLUMNS)
                    except Exception as col_err:
                        # If columns fail for just one sheet, skip that sheet but continue others
                        print(f"    [WARN] Sheet {idx+1} ('{sname}') column selection error: {col_err}")
                        continue
                    df_sel["main_folder"] = tag
                    df_sel["quarter"] = quarter
                    df_sel["source_file"] = fname
                    detail_parts.append(df_sel)
                except Exception as e:
                    print(f"    [WARN] Failed reading Detail sheet {idx+1} ('{sname}'): {e}")
        else:
            print("    [INFO] Workbook has fewer than 4 sheets; no Detail sheets to process.")

# =========================
# Combine & Save to Delta
# =========================
summary_df = pd.concat(summary_parts, ignore_index=True) if summary_parts else pd.DataFrame()
detail_df  = pd.concat(detail_parts,  ignore_index=True) if detail_parts  else pd.DataFrame()

print("\n[RESULT] Combined shapes:")
print(f"  Summary: {summary_df.shape}")
print(f"  Detail : {detail_df.shape}")

if not summary_df.empty:
    spark.createDataFrame(summary_df).write.format("delta").mode("overwrite").saveAsTable(SUMMARY_TABLE_NAME)
    print(f"[SAVE] Lakehouse table written: {SUMMARY_TABLE_NAME}")
else:
    print("[SAVE] Summary empty; no table written.")

if not detail_df.empty:
    spark.createDataFrame(detail_df).write.format("delta").mode("overwrite").saveAsTable(DETAIL_TABLE_NAME)
    print(f"[SAVE] Lakehouse table written: {DETAIL_TABLE_NAME}")
else:
    print("[SAVE] Detail empty; no table written.")

Why this should fix your errors

No blind indexing: We enumerate sheet_names first, so we only read sheets that exist.
Right engine per format:
- .xlsx / .xlsm ⇒ openpyxl
- .xls ⇒ xlrd
Hidden/merged sheets: Using ExcelFile often handles these better than direct read_excel by index.
Partial workbooks: If a workbook has < 3 or <4 sheets, we skip cleanly with a log message.
Granular logging: You’ll see exactly which sheet name/index failed, per file.

--------------------------------

I hope this helps, please give kudos and mark as solved if it does!

Connect with me on LinkedIn.

Subscribe to my YouTube channel for Fabric/Power Platform related content!

datagig · ‎12-09-2025

I really appreciate your effort. Thank you! Yes, trying to get through it.

datagig · ‎12-10-2025

Hi @wardy912, I think the script is not able to read or open workbooks and sheets with some Python or Spark methods I tried using. Do you know a function that will work in this case when I am trying to open a workbook from a notebook and the workbooks are present in lakehouse/files? I tried using a couple of them, but nothing works.

V-yubandi-msft · ‎12-12-2025

Hi @datagig ,
Thank you for the update. Based on your description, the problem isn't with your loop or how you index sheets. Instead, it's that Fabric notebooks can't reliably open Excel files using standard file system paths. Files stored in Lakehouse / Files typically aren't accessible as local OS files within the notebook, so functions like pd.read_excel(path) or pd.ExcelFile(path) may not work as expected.

To access workbooks stored in a Lakehouse, the recommended approach is.

1. Use mssparkutils.fs.open() to read the file and load the bytes into a BytesIO buffer.

2. Pass that buffer to pandas.ExcelFile(), then read each sheet by name rather than by index.

This method generally resolves the cannot read sheet errors.

Here’s an example you can try in your Fabric notebook. I can’t test this in your environment, but this follows the recommended pattern for accessing Excel files from Lakehouse Files.

from mssparkutils import fs
import pandas as pd
import io

def open_workbook(path):
    with fs.open(path, "rb") as f:
        raw = f.read()
    bio = io.BytesIO(raw)

    ext = path.lower()
    engine = "openpyxl" if ext.endswith(("xlsx","xlsm")) else "xlrd"

    return pd.ExcelFile(bio, engine=engine)

xls = open_workbook("/lakehouse/default/Files/X/Subfolder/sample.xlsx")
print("Sheets found:", xls.sheet_names)

Could you give this a try and let me know if you get any error messages. That will help determine if the problem is related to permissions, the workbook format, or the engine.

Thanks for your time.

datagig · ‎12-09-2025

Hi @wardy912, my script didn’t complete successfully. It reaches each folder and detects the Excel files correctly, but it fails when trying to read the sheets from 3 to 20. For every file and folder, it returns an error saying it cannot read those sheets.

wardy912 · ‎12-09-2025

Hi @datagig

The following code is AI generated and unconfirmed but I hope it helps!

# Fabric Notebook: Extract Maturity Excel Sheets -> Summary & Detail Lakehouse tables

import os
import re
import pandas as pd

# ----------------------------
# Config – update these values
# ----------------------------
ROOT_DIR = "/lakehouse/default/Files/X"  # change X to your main folder name under Files
SKIP_FIRST_N_SUBFOLDERS = 7              # skip first 7 subfolders under X
DETAIL_COLUMNS = [
    # TODO: put your 3 column names EXACTLY as they appear in the Detail sheets
    # e.g., "Customer", "Product", "Amount"
]
SUMMARY_TABLE_NAME = "maturity_summary"   # Lakehouse table name for the Summary
DETAIL_TABLE_NAME = "maturity_detail"     # Lakehouse table name for the Detail
OUTPUT_FILES_DIR = "/lakehouse/default/Files/outputs/maturity"  # optional file outputs


# ----------------------------
# Helpers
# ----------------------------
def list_top_level_subfolders(path: str) -> list[str]:
    """Return a sorted list of immediate subfolders under path."""
    if not os.path.exists(path):
        raise FileNotFoundError(f"Root path not found: {path}")
    subfolders = [
        f for f in os.listdir(path)
        if os.path.isdir(os.path.join(path, f))
    ]
    # deterministic ordering: alphabetical
    subfolders.sort()
    return subfolders


def main_folder_tag(folder_name: str) -> str:
    """Tag = part before first '-' if present, else whole folder name."""
    return folder_name.split("-", 1)[0].strip()


def extract_quarter_tag(filename: str) -> str | None:
    """
    Extract 'FYXX Hx' (e.g. FY24 H1) from filename, case-insensitive.
    Accepts optional spaces/underscores/dashes between parts.
    """
    # Normalize
    fname = filename.replace("_", " ").replace("-", " ")
    m = re.search(r"(FY)\s*(\d{2})\s*H\s*([12])", fname, re.IGNORECASE)
    if m:
        yy = m.group(2)
        h = m.group(3)
        return f"FY{yy} H{h}"
    return None


def is_excel_file(path: str) -> bool:
    return path.lower().endswith((".xlsx", ".xls"))


def matches_maturity(filename: str) -> bool:
    return "maturity" in filename.lower()


def select_columns_case_insensitive(df: pd.DataFrame, wanted: list[str]) -> pd.DataFrame:
    """
    Select columns from df matching wanted names case-insensitively.
    Preserves order of wanted.
    """
    if not wanted:
        raise ValueError("DETAIL_COLUMNS is empty. Please set the 3 detail column names.")
    lower_map = {c.lower(): c for c in df.columns}
    selected_actual = []
    missing = []
    for w in wanted:
        key = w.lower()
        if key in lower_map:
            selected_actual.append(lower_map[key])
        else:
            missing.append(w)
    if missing:
        raise KeyError(f"Detail columns not found (case-insensitive): {missing}\nAvailable: {list(df.columns)}")
    return df[selected_actual]


def read_sheet_by_index(file_path: str, sheet_index: int) -> pd.DataFrame | None:
    """
    Read a single sheet by 0-based index.
    Returns None if the sheet index is out of range or if reading fails.
    """
    try:
        if file_path.lower().endswith(".xlsx"):
            df = pd.read_excel(file_path, sheet_name=sheet_index, engine="openpyxl")
        else:
            df = pd.read_excel(file_path, sheet_name=sheet_index, engine="xlrd")
        return df
    except Exception as e:
        # Common reasons: sheet index out of range, bad workbook, etc.
        print(f"  [WARN] Cannot read sheet {sheet_index+1} from {os.path.basename(file_path)}: {e}")
        return None


def ensure_output_dirs():
    try:
        os.makedirs(OUTPUT_FILES_DIR, exist_ok=True)
    except Exception as e:
        print(f"[WARN] Could not create output dir {OUTPUT_FILES_DIR}: {e}")


# ----------------------------
# Processing
# ----------------------------
summary_rows = []  # list of pandas DataFrames
detail_rows = []   # list of pandas DataFrames

subfolders = list_top_level_subfolders(ROOT_DIR)
if len(subfolders) <= SKIP_FIRST_N_SUBFOLDERS:
    print(f"[INFO] Only {len(subfolders)} subfolders under {ROOT_DIR}. Nothing to process after skipping {SKIP_FIRST_N_SUBFOLDERS}.")
else:
    to_process = subfolders[SKIP_FIRST_N_SUBFOLDERS:]
    print(f"[INFO] Found {len(subfolders)} subfolders. Skipping {SKIP_FIRST_N_SUBFOLDERS}, processing {len(to_process)}:")
    for sf in to_process:
        tag = main_folder_tag(sf)
        sf_path = os.path.join(ROOT_DIR, sf)
        print(f"\n[Folder] {sf}  -> tag: '{tag}'")

        # recurse into this subfolder to find matching Excel files
        matched_files = []
        for root, dirs, files in os.walk(sf_path):
            for fn in files:
                if is_excel_file(fn) and matches_maturity(fn):
                    matched_files.append(os.path.join(root, fn))

        if not matched_files:
            print("  [INFO] No matching 'Maturity' Excel files in this folder.")
            continue

        print(f"  [INFO] {len(matched_files)} matching workbook(s) found.")
        for fpath in matched_files:
            fname = os.path.basename(fpath)
            quarter = extract_quarter_tag(fname)
            if not quarter:
                print(f"  [WARN] Could not extract quarter tag from filename '{fname}'. Row will have quarter=None.")
            print(f"  [File] {fname} | quarter: {quarter}")

            # ---- Summary: 3rd sheet (index 2)
            df_sum = read_sheet_by_index(fpath, sheet_index=2)
            if df_sum is not None and not df_sum.empty:
                # Tagging columns
                df_sum["main_folder"] = tag
                df_sum["quarter"] = quarter
                df_sum["source_file"] = fname
                summary_rows.append(df_sum)
            else:
                print("    [INFO] Summary sheet (3rd) missing or empty.")

            # ---- Details: sheets 4..20 (indexes 3..19)
            dfs_detail_this_file = []
            for idx in range(3, 20):
                df_det = read_sheet_by_index(fpath, sheet_index=idx)
                if df_det is None or df_det.empty:
                    continue
                try:
                    # select only desired columns
                    df_sel = select_columns_case_insensitive(df_det, DETAIL_COLUMNS)
                except Exception as e:
                    print(f"    [WARN] Skipping sheet {idx+1} due to column selection error: {e}")
                    continue
                # Tagging columns
                df_sel["main_folder"] = tag
                df_sel["quarter"] = quarter
                df_sel["source_file"] = fname
                dfs_detail_this_file.append(df_sel)

            if dfs_detail_this_file:
                detail_rows.append(pd.concat(dfs_detail_this_file, ignore_index=True))
            else:
                print("    [INFO] No usable Detail sheets (4–20) in this workbook.")

# ----------------------------
# Combine & Save
# ----------------------------
summary_df = pd.concat(summary_rows, ignore_index=True) if summary_rows else pd.DataFrame()
detail_df  = pd.concat(detail_rows,  ignore_index=True) if detail_rows  else pd.DataFrame()

print("\n[RESULT] Combined shapes:")
print(f"  Summary: {summary_df.shape}")
print(f"  Detail : {detail_df.shape}")

# Convert to Spark DataFrames
if summary_df is not None and not summary_df.empty:
    summary_sdf = spark.createDataFrame(summary_df)
    summary_sdf.write.format("delta").mode("overwrite").saveAsTable(SUMMARY_TABLE_NAME)
    print(f"[SAVE] Lakehouse table written: {SUMMARY_TABLE_NAME}")
else:
    print("[SAVE] Summary is empty. No table written.")

if detail_df is not None and not detail_df.empty:
    detail_sdf = spark.createDataFrame(detail_df)
    detail_sdf.write.format("delta").mode("overwrite").saveAsTable(DETAIL_TABLE_NAME)
    print(f"[SAVE] Lakehouse table written: {DETAIL_TABLE_NAME}")
else:
    print("[SAVE] Detail is empty. No table written.")

# Optional: also write files to /Files for ad-hoc access
ensure_output_dirs()
try:
    if summary_df is not None and not summary_df.empty:
        summary_df.to_parquet(os.path.join(OUTPUT_FILES_DIR, "maturity_summary.parquet"), index=False)
    if detail_df is not None and not detail_df.empty:
        detail_df.to_parquet(os.path.join(OUTPUT_FILES_DIR, "maturity_detail.parquet"), index=False)
    print(f"[SAVE] Parquet exports written under {OUTPUT_FILES_DIR}")
except Exception as e:
    print(f"[WARN] Could not write Parquet exports: {e}")

--------------------------------

I hope this helps, please give kudos and mark as solved if it does!

Connect with me on LinkedIn.

Subscribe to my YouTube channel for Fabric/Power Platform related content!

datagig · ‎12-09-2025

Hi, I am trying this out now. I will update if it works. If I get stuck at something, I will post it here. Thanks for the help!

Data Transformation using Python/PySpark in Lakehouse

What I need to extract

Why this should fix your errors

Helpful resources

Fabric Monthly Update - April 2026

Data Days 2026 coming soon!

New to Fabric Survey

Join us in Barcelona for FabCon and SQLCon, September 28 - October 1, 2026.

Data Transformation using Python/PySpark in Lakehouse

What I need to extract

Why this should fix your errors

Helpful resources

Fabric Monthly Update - April 2026

Data Days 2026 coming soon!

New to Fabric Survey