Achieve an End-to-End Data Analysis for your data ...

Someleze · ‎08-01-2024

Data plays an important role in decision making, and organizations that can harness its power can gain a competitive edge in the market. However, data alone is not enough. To extract value from data, organizations need to build end-to-end data analysis pipelines that can handle the entire data lifecycle, from data ingestion, data cleaning, data transformation, to data visualization and analysis. In this blog post, we will explore why end-to-end data analysis is important for organizations and how it can help them achieve their goals.

A Scenario of End-to-End Data Analysis for Contoso Cuisines

Contoso Cuisines, a fictional company that sells food products around the world, wants to analyze their revenue performance based on their different products, the performance of their employees, and the impact of the number of days to ship a product on the revenue. To do this, they need to build an end-to-end data analysis pipeline that can handle the following steps:

Data Ingestion & Transformation
Data Cleaning
Visualization

By following these steps, Contoso Cuisines can leverage their data to gain a better understanding of their business performance and identify opportunities for improvement or growth.

1 – Data Ingestion & Transformation

The foundation of Microsoft Fabric is a Lakehouse, which is built on top of the OneLake scalable storage layer and uses Apache Spark and SQL compute engines for big data processing. A Lakehouse is a unified platform that combines: The flexible and scalable storage of a data lake and the ability to query and analyze data of a data warehouse.

In the Lakehouse explorer, you see options to load data into Lakehouse. Select New Dataflow Gen2. If you haven’t created a Lakehouse, follow these steps to create one.
On the new dataflow pane, select Get data from another source.
Search for the OData source type using the search input on the Choose data source pop up
Next, you need to specify the URL your data is coming from. Enter this url: https://services.odata.org/V4/Contoso Cuisines/Contoso Cuisines.svc/ then select Next.
On the next page you will see a list of tables you can use. For this solution, choose these tables: Order_Details, Categories and Supplier then select Create.
In your Dataflow, switch to diagram view to see your chosen tables. Once done, select the Categories table, rename it to Products and remove the Picture column then select the Extend button next to Product and choose these columns:
- ProductID
- ProductName
- SupplierID
- UnitPrice
- UnitsInStock
- UnitsOnOrder
  
  Note: Make sure that the Data Destination is set to your Lakehouse for all your tables/queries as this will send the data into your Lakehouse once you publish the Dataflow.
Select the Order_Details table and extend Order by selecting the Extend button. Choose the following columns:
- OrderDate (transform this to DateOnly datatype)
- ShippedDate OrderDate (transform this to DateOnly datatype)
- Customer
- Employee
Then extend Customer by selecting the Extend button and choose the following columns:
- CompanyName
- ContactName
- ContactTitle
- City
- Country
Then extend Employee by selecting the Extend button and choose the following columns:
- LastName
- FirstName
- Title
Lastly, extend Shipper by selecting the Extend button and choose this column and rename it to Shipper_CompanyName
- CompanyName
Remove the Product column in the Order_Details table.
The company wants to calculate the revenue generated from the orders it has gotten from customers. To achieve this, add these custom columns and make sure they are rounded off (where needed) and of the right data types:
- Gross Revenue – calculate this by multiplying the Unit_Price by Quantity
- Discount Value – calculate this by multiplying the Gross Revenue by the Discount %
- Net Revenue - calculate this by getting difference between the Gross Revenue and the Discount Value
- Days to Ship - calculate this by getting the difference between the ShippedDate and OrderDate
- Month – use this expression to get the month: Date.MonthName() from the OrderDate
Select the Suppliers table and keep only these columns SupplierID, CompanyName, ContactName, ContactTitle, City, Country and remove the rest.
Once done, select Publish to publish your data tables to your Lakehouse

2 – Cleaning your Data

Now that you have ingested your data you have to do a bit of cleaning first before you can visualize your data so you can have an accurate analysis for this scenario.

To achieve this, you can use Notebooks by following these steps:

Navigate to your Lakehouse and select Open Notebook then choose New Notebook and select the Open button.
Once your Notebook is opened, you need to load the Order_Details table data into a Dataframe. You can achieve this by using the below script:

# Load Data
df_order_details = spark.sql("SELECT * FROM northwind_lkh.Order_Details LIMIT 1000")

Before you clean your data, you need to double check if it needs to be cleaned for example does it have any missing values or duplicates. You can display your newly created Dataframe’s summary using the below script in a new code cell.

# Display Dataframe summary
display(df_order_details, summary=True)

Notice that the ShippedDate and Days_to_Ship columns have missing values. You can remove any rows that have missing values by running the below script in a new code cell.

df_order_details_pandas = df_order_details.toPandas()

def clean_data(df_order_details_pandas):
    # Drop rows with missing data in column: 'ShippedDate'
    df_order_details_pandas = df_order_details_pandas.dropna(subset=['ShippedDate'])
    return df_order_details_pandas

df_order_details_clean = clean_data(df_order_details_pandas.copy())
display(df_order_details_clean)

Note: You removed the missing values for ShippedDate and it automatically removed missing values for Days to Ship as well because they are related.

Once that’s done, you can use the new and cleaned Dataframe to create a new table in your Lakehouse with the cleaned data. You can achieve this by running this script in a new code cell.

table_name = "Order_Details_clean"

sparkDF=spark.createDataFrame(df_order_details_clean) 
sparkDF.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")

Once that’s done, you can refresh the list of tables in your Lakehouse and you will see that newly created table.
Save your Notebook as Data Cleaning

3 – Ready your data for analysis

An additional requirement for this scenario is to ensure that the company is able to see the revenue generated per order made.

To achieve this, you will need measures which you will add by following these steps:

Navigate to your Lakehouse, and at the top right corner of the screen switch from Lakehouse to SQL analytics endpoint.
Once done, select the Order_Details_clean table and add the following measures:
- Orders – you have to count the distinct Orders by OrderID
- Net Revenue per order – you have to get the sum of the Net Revenue and divide it by the number of Orders made (hint: use the Orders measure) then round it off to 2 decimal places
Once done, you need to make that tables have relationships for smooth visualizations creation. So you need to relate the tables by dragging and dropping the common ID columns in both tables
- ProductID – in Order_Details_clean and in Products (many to one)
- SupplierID – in Suppliers and Products (one to many)

Note: you have to switch from the Data view to the Model view to achieve this.

Once done, on the top navigation bar select Reporting and then select New semantic model and configure the following:
- Name – your chosen name
- Workspace – leave as is
- Objects/tables – select Products, Suppliers and Order_Details_clean

Done! Creating this semantic model will allow you to access your tables and data in Power BI where you will be creating a new Dashboard for this scenario.

End-to-end data analysis can help organizations achieve their strategic goals, such as increasing revenue, reducing costs, improving efficiency, enhancing customer satisfaction, gaining competitive advantage, innovating new products and services, etc. By building end-to-end data analysis pipelines, organizations can leverage their data as a strategic asset and a source of competitive advantage.

Stay tuned for Part 2!!

Homework!!

Now that you have went through an example of building a simple end-to-end solution with Fabric, you’re tasked with:

Using the same dataset to train machine learning models with Data Science in Microsoft Fabric to predict future sales performance and customer behavior.
Following the same solution structure with your own dataset this time and tag me on LinkedIn to share your solution!
Get certified with the Microsoft Fabric Analytics Engineer DP:600 exam