Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

To celebrate FabCon Vienna, we are offering 50% off select exams. Ends October 3rd. Request your discount now.

Reply
charlie77
Helper II
Helper II

How to read MS Word files without downloading

I'm new to Fabric. Got a bounch of MS Word files uploaded in a lakehouse. For security reason they are not allowed to be downloaded locally for use. I've tried a few python/pyspark scripts in notebook (in Fabric) but to no avail.  Any advice please. 

1 ACCEPTED SOLUTION
tayloramy
Solution Sage
Solution Sage

Hi @charlie77
 
You can use Spark to load Word files as binary, then a small UDF to extract text with python-docx. In Fabric notebooks, the default lakehouse is mounted at /lakehouse/default, so you can read from /lakehouse/default/Files directly (How to use a notebook to load data into your lakehouse). Spark supports whole-file ingestion via the binaryFile format (Binary file data source - Apache Spark), and python-docx can parse Word content from an in-memory stream (python-docx API - Document).

 

Steps: 

  1. In a new notebook cell, install the dependency:
    %pip install python-docx
  2. In the next cell, import the libraries:
    from io import BytesIO
    from docx import Document
    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import StringType
  3. Set the input path to your Word files:
    word_path = "/lakehouse/default/Files/WordDocs/*.docx"
  4. Read the files as binary with Spark:
    df = (spark.read.format("binaryFile").option("pathGlobFilter", "*.docx").load(word_path))
    Columns include: path, modificationTime, length, content.
  5. Define a function to extract text from each file and wrap it as a UDF:
    def extract_text(content: bytes) -> str: try: doc = Document(BytesIO(content)) return "\n".join(p.text for p in doc.paragraphs) except Exception as e: return f"ERROR: {e}"
    
    extract_udf = udf(extract_text, StringType())
  6. Apply the UDF and keep only the file path and extracted text:
    out = df.select("path", extract_udf(col("content")).alias("text"))
  7. Preview the results in the notebook:
    display(out)
  8. Save the results to a Delta table in the lakehouse:
    out.write.mode("overwrite").saveAsTable("word_text")

 

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

 

View solution in original post

4 REPLIES 4
charlie77
Helper II
Helper II

Thank you, taloramy. it worked with bit tweak including using the absolute path instead of relative path. 

GLad to hear it. Happy I could help

charlie77
Helper II
Helper II

Thank you tayloramy for the prompt advice. I'll give a try. 

tayloramy
Solution Sage
Solution Sage

Hi @charlie77
 
You can use Spark to load Word files as binary, then a small UDF to extract text with python-docx. In Fabric notebooks, the default lakehouse is mounted at /lakehouse/default, so you can read from /lakehouse/default/Files directly (How to use a notebook to load data into your lakehouse). Spark supports whole-file ingestion via the binaryFile format (Binary file data source - Apache Spark), and python-docx can parse Word content from an in-memory stream (python-docx API - Document).

 

Steps: 

  1. In a new notebook cell, install the dependency:
    %pip install python-docx
  2. In the next cell, import the libraries:
    from io import BytesIO
    from docx import Document
    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import StringType
  3. Set the input path to your Word files:
    word_path = "/lakehouse/default/Files/WordDocs/*.docx"
  4. Read the files as binary with Spark:
    df = (spark.read.format("binaryFile").option("pathGlobFilter", "*.docx").load(word_path))
    Columns include: path, modificationTime, length, content.
  5. Define a function to extract text from each file and wrap it as a UDF:
    def extract_text(content: bytes) -> str: try: doc = Document(BytesIO(content)) return "\n".join(p.text for p in doc.paragraphs) except Exception as e: return f"ERROR: {e}"
    
    extract_udf = udf(extract_text, StringType())
  6. Apply the UDF and keep only the file path and extracted text:
    out = df.select("path", extract_udf(col("content")).alias("text"))
  7. Preview the results in the notebook:
    display(out)
  8. Save the results to a Delta table in the lakehouse:
    out.write.mode("overwrite").saveAsTable("word_text")

 

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

 

Helpful resources

Announcements
September Fabric Update Carousel

Fabric Monthly Update - September 2025

Check out the September 2025 Fabric update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.