Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!To celebrate FabCon Vienna, we are offering 50% off select exams. Ends October 3rd. Request your discount now.
I'm new to Fabric. Got a bounch of MS Word files uploaded in a lakehouse. For security reason they are not allowed to be downloaded locally for use. I've tried a few python/pyspark scripts in notebook (in Fabric) but to no avail. Any advice please.
Solved! Go to Solution.
Hi @charlie77,
You can use Spark to load Word files as binary, then a small UDF to extract text with python-docx. In Fabric notebooks, the default lakehouse is mounted at /lakehouse/default, so you can read from /lakehouse/default/Files directly (How to use a notebook to load data into your lakehouse). Spark supports whole-file ingestion via the binaryFile format (Binary file data source - Apache Spark), and python-docx can parse Word content from an in-memory stream (python-docx API - Document).
Steps:
def extract_text(content: bytes) -> str: try: doc = Document(BytesIO(content)) return "\n".join(p.text for p in doc.paragraphs) except Exception as e: return f"ERROR: {e}" extract_udf = udf(extract_text, StringType())
If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.
Thank you, taloramy. it worked with bit tweak including using the absolute path instead of relative path.
GLad to hear it. Happy I could help
Thank you tayloramy for the prompt advice. I'll give a try.
Hi @charlie77,
You can use Spark to load Word files as binary, then a small UDF to extract text with python-docx. In Fabric notebooks, the default lakehouse is mounted at /lakehouse/default, so you can read from /lakehouse/default/Files directly (How to use a notebook to load data into your lakehouse). Spark supports whole-file ingestion via the binaryFile format (Binary file data source - Apache Spark), and python-docx can parse Word content from an in-memory stream (python-docx API - Document).
Steps:
def extract_text(content: bytes) -> str: try: doc = Document(BytesIO(content)) return "\n".join(p.text for p in doc.paragraphs) except Exception as e: return f"ERROR: {e}" extract_udf = udf(extract_text, StringType())
If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.
User | Count |
---|---|
22 | |
15 | |
12 | |
11 | |
10 |
User | Count |
---|---|
36 | |
30 | |
27 | |
22 | |
15 |