Solved: How to read MS Word files without downloading

charlie77 · ‎09-10-2025

I'm new to Fabric. Got a bounch of MS Word files uploaded in a lakehouse. For security reason they are not allowed to be downloaded locally for use. I've tried a few python/pyspark scripts in notebook (in Fabric) but to no avail. Any advice please.

tayloramy · ‎09-10-2025

Hi @charlie77,

You can use Spark to load Word files as binary, then a small UDF to extract text with python-docx. In Fabric notebooks, the default lakehouse is mounted at /lakehouse/default, so you can read from /lakehouse/default/Files directly (How to use a notebook to load data into your lakehouse). Spark supports whole-file ingestion via the binaryFile format (Binary file data source - Apache Spark), and python-docx can parse Word content from an in-memory stream (python-docx API - Document).

Steps:

In a new notebook cell, install the dependency:
%pip install python-docx
In the next cell, import the libraries:
from io import BytesIO
from docx import Document
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
Set the input path to your Word files:
word_path = "/lakehouse/default/Files/WordDocs/*.docx"
Read the files as binary with Spark:
df = (spark.read.format("binaryFile").option("pathGlobFilter", "*.docx").load(word_path))
Columns include: path, modificationTime, length, content.

Define a function to extract text from each file and wrap it as a UDF:

def extract_text(content: bytes) -> str: try: doc = Document(BytesIO(content)) return "\n".join(p.text for p in doc.paragraphs) except Exception as e: return f"ERROR: {e}"

extract_udf = udf(extract_text, StringType())

Apply the UDF and keep only the file path and extracted text:
out = df.select("path", extract_udf(col("content")).alias("text"))
Preview the results in the notebook:
display(out)
Save the results to a Delta table in the lakehouse:
out.write.mode("overwrite").saveAsTable("word_text")

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

View solution in original post

charlie77 · ‎09-11-2025

Thank you, taloramy. it worked with bit tweak including using the absolute path instead of relative path.

tayloramy · ‎09-11-2025

GLad to hear it. Happy I could help

charlie77 · ‎09-10-2025

Thank you tayloramy for the prompt advice. I'll give a try.

tayloramy · ‎09-10-2025

Hi @charlie77,

You can use Spark to load Word files as binary, then a small UDF to extract text with python-docx. In Fabric notebooks, the default lakehouse is mounted at /lakehouse/default, so you can read from /lakehouse/default/Files directly (How to use a notebook to load data into your lakehouse). Spark supports whole-file ingestion via the binaryFile format (Binary file data source - Apache Spark), and python-docx can parse Word content from an in-memory stream (python-docx API - Document).

Steps:

In a new notebook cell, install the dependency:
%pip install python-docx
In the next cell, import the libraries:
from io import BytesIO
from docx import Document
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
Set the input path to your Word files:
word_path = "/lakehouse/default/Files/WordDocs/*.docx"
Read the files as binary with Spark:
df = (spark.read.format("binaryFile").option("pathGlobFilter", "*.docx").load(word_path))
Columns include: path, modificationTime, length, content.

Define a function to extract text from each file and wrap it as a UDF:

def extract_text(content: bytes) -> str: try: doc = Document(BytesIO(content)) return "\n".join(p.text for p in doc.paragraphs) except Exception as e: return f"ERROR: {e}"

extract_udf = udf(extract_text, StringType())

Apply the UDF and keep only the file path and extracted text:
out = df.select("path", extract_udf(col("content")).alias("text"))
Preview the results in the notebook:
display(out)
Save the results to a Delta table in the lakehouse:
out.write.mode("overwrite").saveAsTable("word_text")