Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.

Reply
charlie77
Helper II
Helper II

How to read MS Word files without downloading

I'm new to Fabric. Got a bounch of MS Word files uploaded in a lakehouse. For security reason they are not allowed to be downloaded locally for use. I've tried a few python/pyspark scripts in notebook (in Fabric) but to no avail.  Any advice please. 

1 ACCEPTED SOLUTION
tayloramy
Community Champion
Community Champion

Hi @charlie77
 
You can use Spark to load Word files as binary, then a small UDF to extract text with python-docx. In Fabric notebooks, the default lakehouse is mounted at /lakehouse/default, so you can read from /lakehouse/default/Files directly (How to use a notebook to load data into your lakehouse). Spark supports whole-file ingestion via the binaryFile format (Binary file data source - Apache Spark), and python-docx can parse Word content from an in-memory stream (python-docx API - Document).

 

Steps: 

  1. In a new notebook cell, install the dependency:
    %pip install python-docx
  2. In the next cell, import the libraries:
    from io import BytesIO
    from docx import Document
    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import StringType
  3. Set the input path to your Word files:
    word_path = "/lakehouse/default/Files/WordDocs/*.docx"
  4. Read the files as binary with Spark:
    df = (spark.read.format("binaryFile").option("pathGlobFilter", "*.docx").load(word_path))
    Columns include: path, modificationTime, length, content.
  5. Define a function to extract text from each file and wrap it as a UDF:
    def extract_text(content: bytes) -> str: try: doc = Document(BytesIO(content)) return "\n".join(p.text for p in doc.paragraphs) except Exception as e: return f"ERROR: {e}"
    
    extract_udf = udf(extract_text, StringType())
  6. Apply the UDF and keep only the file path and extracted text:
    out = df.select("path", extract_udf(col("content")).alias("text"))
  7. Preview the results in the notebook:
    display(out)
  8. Save the results to a Delta table in the lakehouse:
    out.write.mode("overwrite").saveAsTable("word_text")

 

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

 

View solution in original post

4 REPLIES 4
charlie77
Helper II
Helper II

Thank you, taloramy. it worked with bit tweak including using the absolute path instead of relative path. 

GLad to hear it. Happy I could help

charlie77
Helper II
Helper II

Thank you tayloramy for the prompt advice. I'll give a try. 

tayloramy
Community Champion
Community Champion

Hi @charlie77
 
You can use Spark to load Word files as binary, then a small UDF to extract text with python-docx. In Fabric notebooks, the default lakehouse is mounted at /lakehouse/default, so you can read from /lakehouse/default/Files directly (How to use a notebook to load data into your lakehouse). Spark supports whole-file ingestion via the binaryFile format (Binary file data source - Apache Spark), and python-docx can parse Word content from an in-memory stream (python-docx API - Document).

 

Steps: 

  1. In a new notebook cell, install the dependency:
    %pip install python-docx
  2. In the next cell, import the libraries:
    from io import BytesIO
    from docx import Document
    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import StringType
  3. Set the input path to your Word files:
    word_path = "/lakehouse/default/Files/WordDocs/*.docx"
  4. Read the files as binary with Spark:
    df = (spark.read.format("binaryFile").option("pathGlobFilter", "*.docx").load(word_path))
    Columns include: path, modificationTime, length, content.
  5. Define a function to extract text from each file and wrap it as a UDF:
    def extract_text(content: bytes) -> str: try: doc = Document(BytesIO(content)) return "\n".join(p.text for p in doc.paragraphs) except Exception as e: return f"ERROR: {e}"
    
    extract_udf = udf(extract_text, StringType())
  6. Apply the UDF and keep only the file path and extracted text:
    out = df.select("path", extract_udf(col("content")).alias("text"))
  7. Preview the results in the notebook:
    display(out)
  8. Save the results to a Delta table in the lakehouse:
    out.write.mode("overwrite").saveAsTable("word_text")

 

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

 

Helpful resources

Announcements
FabCon Global Hackathon Carousel

FabCon Global Hackathon

Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!

September Fabric Update Carousel

Fabric Monthly Update - September 2025

Check out the September 2025 Fabric update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.