Starting December 3, join live sessions with database experts and the Microsoft product team to learn just how easy it is to get started
Learn moreGet certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now
In Databricks to read a excel file we will use com.crealytics.spark.excel by installing libraries . In case of Fabric notebook how can we read an excel file with out using data pipeline
Solved! Go to Solution.
Hi @poojitha_yeluru,
Yes, it should be possible to install libraries into Fabric as well. You can manage them via an environment, https://learn.microsoft.com/en-us/fabric/data-engineering/environment-manage-library
Hi @poojitha_yeluru
As mentioned by @Expiscornovus we can use Environments in Fabric. and also Pandas library to read an excel file. These are the available options for now.
Please let me know if you have any further questions.
It make Microsoft fabric stand out as the ultimate analytics tool for Data Engineers
The easiest general-purpose solution I found until now to process Excel files in Fabric using only notebooks (PySpark) is illustrated below.
Conditions:
- The Excel source file has been uploaded in a lakehouse not attached to the notebook.
- A specific range in an Excel spreadsheet must be loaded into a dataframe for processing.
- The processed dataframe must be saved in a delta table in another lakehouse not attached to the notebook.
Strategy:
Use the Pandas Python library to read the source Excel file content.
Challenges:
The Pandas library (as well as other binary I/O activities in Python, like copy an Excel File from Sharepoint to lakehouse, but this is another topic) can't manage absolute lakehouse "abfss" file paths, only locally mounted filesystem paths.
Solution:
Details:
import notebookutils as nu # required to mount lakehouse filesystem for Pandas
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() # you can omit this row since the shell automatically creates the session in the variable spark for users.
# Source parameters
src_ws_id = 'any_workspace_id1'
src_lh_id = 'any_lakhouse_id1'
src_folder = 'any_subfolder' # (in the form 'subdir/.../subdir' or empty string '')
src_file = 'any_Excel_file_name'
# Destination parameters
dst_ws_id = 'any_workspace_id2'
dst_lh_id = 'any_lakhouse_id2'
dst_table = 'any_table_name'
def mount_lakehouse(workspace_id, lakehouse_id, mount_point='/mnt'):
"""
Mount a remote lakehouse filesystem to the local filesystem of the notebook.
:param workspace_id: ID of the lakehouse workspace.
:param lakehouse_id: ID of the lakehouse.
:param mount_point: Mount point in the local filesystem (optional, default is '/mnt').
:return: The mount point in the local filesystem.
"""
try:
# Construct the URL of the remote filesystem
remote_url = f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}'
# Mount the remote filesystem to the local filesystem
nu.fs.mount(remote_url, mount_point)
except Exception as e:
# Handle exceptions and exit the notebook in case of an error
exit_value = f'Error type: {type(e).__name__}\nError description: {e}'
nu.notebook.exit(exit_value)
else:
print('Lakehouse filesystem successfully mounted!')
return mount_point
def unmount_lakehouse(mount_point='/mnt'):
"""
Unmount a lakehouse filesystem from the local filesystem.
:param mount_point: Mount point in the local filesystem (optional, default is '/mnt').
:return: Result of the unmount operation.
"""
try:
# Unmount the filesystem from the local mount point
result = nu.fs.unmount(mount_point)
except Exception as e:
# Handle exceptions and exit the notebook in case of an error
exit_value = f'Error type: {type(e).__name__}\nError description: {e}'
nu.notebook.exit(exit_value)
else:
print('Lakehouse filesystem successfully unmounted!')
return result
# Mount source lakehouse (because Pandas can't read abfss absolute paths)
src_mount_point = mount_lakehouse(src_ws_id, src_lh_id, src_mount_point)
# Get full local fs path for source lakehouse
src_path = f'{nu.fs.getMountPath(src_mount_point)}/Files/{src_folder}/{src_file}'
# Read an Excel file into a Pandas dataframe
# in this case I read the first 4 columns in the second sheet skipping the first 4 rows and using the first row (row n#5 because 4 are skipped) as header
df = pd.read_excel(f'{src_path}', sheet_name = 2, skiprows = 4, usecols = 'A:D')
# at this point do what you have to do with this dataframe
# and then...
# Convert pandas df in spark df
spark_df = spark.createDataFrame(df)
# Construct the URL of the remote filesystem
delta_table_abfss_path = f'abfss://{dst_ws_id}@onelake.dfs.fabric.microsoft.com/{dst_lh_id}/Tables/{dst_table}'
# Write the Spark dataframe in a delta table
spark_df.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(delta_table_abfss_path)
# Exit notebook reporting success
exit_value = f'The notebook succeeded!'
nu.notebook.exit(exit_value)
The above solution is a general-purpose solution working in any situation, please make only sure you have permission to access the source and destination lakehouses for write operations.
I hope thi may help
Hi @poojitha_yeluru,
In this thread I have seen somebody use Pandas library for that:
Below is an example
import pandas as pd
df = pd.read_excel("abfss://<id>@onelake.dfs.fabric.microsoft.com/<id>/Files/Excel/TestBook.xlsx")
display(df)
Pandas can't read from absolute abfss:// paths.
In fact, in your screenshot we read:
I hope this may help
Hi @Expiscornovus ,
Thanks for sharing , I had already gone to above way but pandas is a costly operation. In databricks we will install library ie. (com.crealytics:spark-excel_2.12:0.13.5), is there any way like to install a library in fabric.So that we can read excel file..
Hi @poojitha_yeluru,
Yes, it should be possible to install libraries into Fabric as well. You can manage them via an environment, https://learn.microsoft.com/en-us/fabric/data-engineering/environment-manage-library
Hi @Expiscornovus ,
Yes From public library we have to select PYPI and we have to use xlrd it is also pointing to pandas. Is there any other way.
Hi @poojitha_yeluru
As mentioned by @Expiscornovus we can use Environments in Fabric. and also Pandas library to read an excel file. These are the available options for now.
Please let me know if you have any further questions.
I want to add to this by saying that installing libraries as .jar files for Spark use is not supported via Environments. So the previously mentioned library to read excel as Spark is not possible via Environment.
Manage Apache Spark libraries - Microsoft Fabric | Microsoft Learn
Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.
Check out the November 2024 Fabric update to learn about new features.
User | Count |
---|---|
16 | |
12 | |
9 | |
9 | |
4 |