Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now

Reply

How to read excel file in a notebook in fabric without using datapipeline

In Databricks to read a excel file we will use com.crealytics.spark.excel by installing libraries . In case of Fabric notebook how can we read an excel file with out using data pipeline

2 ACCEPTED SOLUTIONS

Hi @poojitha_yeluru,

 

Yes, it should be possible to install libraries into Fabric as well. You can manage them via an environment, https://learn.microsoft.com/en-us/fabric/data-engineering/environment-manage-library

 



Happy to help out 🙂

I share #PowerAutomate and #SharePointOnline content on my Blog, Bluesky profile or Youtube Channel

View solution in original post

Hi @poojitha_yeluru 
As mentioned by @Expiscornovus we can use Environments in Fabric. and also Pandas library to read an excel file. These are the available options for now. 

Please let me know if you have any further questions.

View solution in original post

9 REPLIES 9
Calebstimothy
New Member

It make Microsoft fabric stand out as the ultimate analytics tool for Data Engineers

_maclura_
Frequent Visitor

The easiest general-purpose solution I found until now to process Excel files in Fabric using only notebooks (PySpark) is illustrated below.

 

Conditions:

- The Excel source file has been uploaded in a lakehouse not attached to the notebook.
- A specific range in an Excel spreadsheet must be loaded into a dataframe for processing.
- The processed dataframe must be saved in a delta table in another lakehouse not attached to the notebook.

 

Strategy:

Use the Pandas Python library to read the source Excel file content.

 

Challenges:

The Pandas library (as well as other binary I/O activities in Python, like copy an Excel File from Sharepoint to lakehouse, but this is another topic) can't manage absolute lakehouse "abfss" file paths, only locally mounted filesystem paths.

 

Solution:

  • Mount the source lakehouse in the notebook local filesystem (use Notebookutils)
  • Get the detailed locally mounted filesystem full path (use Notebookutils)
  • Read the Excel file in a Pandas dataframe (use Pandas)
  • Do whatever you have to do with that dataframe (use Pandas)
  • Convert the Pandas dataframe in a Spark dataframe (use pyspark.sql.SparkSession)
    • because, in my case,  Pandas can't write a dataframe in a delta table using an absolute path
  • Write the Spark dataframe to the destination lakehouse e.g. in a delta table (use pyspark.sql.SparkSession)

 

Details:

import notebookutils as nu  # required to mount lakehouse filesystem for Pandas
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() # you can omit this row since the shell automatically creates the session in the variable spark for users.

# Source parameters
src_ws_id  = 'any_workspace_id1'
src_lh_id  = 'any_lakhouse_id1'
src_folder = 'any_subfolder' # (in the form 'subdir/.../subdir' or empty string '')
src_file   = 'any_Excel_file_name'

# Destination parameters
dst_ws_id  = 'any_workspace_id2'
dst_lh_id  = 'any_lakhouse_id2'
dst_table  = 'any_table_name'

def mount_lakehouse(workspace_id, lakehouse_id, mount_point='/mnt'):
    """
    Mount a remote lakehouse filesystem to the local filesystem of the notebook.

    :param workspace_id: ID of the lakehouse workspace.
    :param lakehouse_id: ID of the lakehouse.
    :param mount_point: Mount point in the local filesystem (optional, default is '/mnt').
    :return: The mount point in the local filesystem.
    """
    try:
        # Construct the URL of the remote filesystem
        remote_url = f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}'
        
        # Mount the remote filesystem to the local filesystem
        nu.fs.mount(remote_url, mount_point)
    except Exception as e:
        # Handle exceptions and exit the notebook in case of an error
        exit_value = f'Error type: {type(e).__name__}\nError description: {e}'
        nu.notebook.exit(exit_value)
    else:
        print('Lakehouse filesystem successfully mounted!')
        return mount_point

def unmount_lakehouse(mount_point='/mnt'):
    """
    Unmount a lakehouse filesystem from the local filesystem.

    :param mount_point: Mount point in the local filesystem (optional, default is '/mnt').
    :return: Result of the unmount operation.
    """
    try:
        # Unmount the filesystem from the local mount point
        result = nu.fs.unmount(mount_point)
    except Exception as e:
        # Handle exceptions and exit the notebook in case of an error
        exit_value = f'Error type: {type(e).__name__}\nError description: {e}'
        nu.notebook.exit(exit_value)
    else:
        print('Lakehouse filesystem successfully unmounted!')
        return result

# Mount source lakehouse (because Pandas can't read abfss absolute paths)
src_mount_point = mount_lakehouse(src_ws_id, src_lh_id, src_mount_point)

# Get full local fs path for source lakehouse
src_path = f'{nu.fs.getMountPath(src_mount_point)}/Files/{src_folder}/{src_file}'

# Read an Excel file into a Pandas dataframe
# in this case I read the first 4 columns in the second sheet skipping the first 4 rows and using the first row (row n#5 because 4 are skipped) as header
df = pd.read_excel(f'{src_path}', sheet_name = 2, skiprows = 4, usecols = 'A:D')

# at this point do what you have to do with this dataframe
# and then...

# Convert pandas df in spark df
spark_df = spark.createDataFrame(df) 

# Construct the URL of the remote filesystem
delta_table_abfss_path = f'abfss://{dst_ws_id}@onelake.dfs.fabric.microsoft.com/{dst_lh_id}/Tables/{dst_table}'

# Write the Spark dataframe in a delta table
spark_df.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(delta_table_abfss_path)

# Exit notebook reporting success
exit_value = f'The notebook succeeded!'
nu.notebook.exit(exit_value)

 

The above solution is a general-purpose solution working in any situation, please make only sure you have permission to access the source and destination lakehouses for write operations.

 

I hope thi may help

Expiscornovus
Impactful Individual
Impactful Individual

Hi @poojitha_yeluru,

 

In this thread I have seen somebody use Pandas library for that:

https://community.fabric.microsoft.com/t5/General-Discussion/How-to-load-an-excel-file-from-Sharepoi...

 

Below is an example

 

import pandas as pd
df = pd.read_excel("abfss://<id>@onelake.dfs.fabric.microsoft.com/<id>/Files/Excel/TestBook.xlsx")
display(df)

 

readexcel_notebook.png



Happy to help out 🙂

I share #PowerAutomate and #SharePointOnline content on my Blog, Bluesky profile or Youtube Channel

Pandas can't read from absolute abfss:// paths.
In fact, in your screenshot we read:

_maclura__0-1733403718180.png

I hope this may help

Hi @Expiscornovus ,

Thanks for sharing , I had already gone to above way but pandas is a costly operation. In databricks we will install library ie. (com.crealytics:spark-excel_2.12:0.13.5), is there any way like to install a library in fabric.So that we can read excel file..

 

Hi @poojitha_yeluru,

 

Yes, it should be possible to install libraries into Fabric as well. You can manage them via an environment, https://learn.microsoft.com/en-us/fabric/data-engineering/environment-manage-library

 



Happy to help out 🙂

I share #PowerAutomate and #SharePointOnline content on my Blog, Bluesky profile or Youtube Channel

Hi @Expiscornovus ,
Yes  From public library we have to select PYPI and we have to use xlrd it is also pointing to pandas. Is there any other way.

Hi @poojitha_yeluru 
As mentioned by @Expiscornovus we can use Environments in Fabric. and also Pandas library to read an excel file. These are the available options for now. 

Please let me know if you have any further questions.

I want to add to this by saying that installing libraries as .jar files for Spark use is not supported via Environments. So the previously mentioned library to read excel as Spark is not possible via Environment.

Manage Apache Spark libraries - Microsoft Fabric | Microsoft Learn

Helpful resources

Announcements
November Carousel

Fabric Community Update - November 2024

Find out what's new and trending in the Fabric Community.

Live Sessions with Fabric DB

Be one of the first to start using Fabric Databases

Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.

November Update

Fabric Monthly Update - November 2024

Check out the November 2024 Fabric update to learn about new features.

Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early Bird pricing ends December 9th.