This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
Ever wondered how cutting-edge technology can transform the way we explore and understand vast datasets? Join me on a journey through the code as we harness the power of OpenAI to unlock insights from a rich dataset of the city of London.
Our exploration begins with the London Workplace Zone Classification dataset information available on London Datastore. To make our exploration seamless, I've converted the original XLS file to a CSV format, paving the way for easier processing and analysis.
This post is based on research that I was doing and it uses the following as a reference:
Azure OpenAI Service embeddings tutorial - Azure OpenAI | Microsoft Learn
Azure OpenAI Service embeddings - Azure OpenAI - embeddings and cosine similarity | Microsoft Learn
Azure-OpenAI-Docs-Samples-embedding_billsum(github.com)
Follow the docs to proceed with the request:
What is Azure OpenAI Service? - Azure AI services | Microsoft Learn
Request Access to Azure OpenAI Service (microsoft.com)
Azure OpenAI Service embeddings tutorial - Azure OpenAI | Microsoft Learn
Api_key and the Service Endpoint. Docs for more information: Connect using API keys - Azure Cognitive Search | Microsoft Learn
Fabric_Change_the_Game_strong_Unleashing_the_Power_of_Microsoft_Fabric_and_OpenA
4. Still inside of the Azure Open AI services in the Azure Portal click on Model Deployments -> Manage Deployment. As Fig 2 - Manage Deployment, shows:
Fabric_Change_the_Game_strong_Unleashing_the_Power_of_Microsoft_Fabric_and_OpenA
5. Create a new deployment and choose the mode text-embedding-ada-002 version 2, this model refers to a specific language model provided by OpenAI for text embeddings. Fig 3 - Ada, shows:
Fabric_Change_the_Game_strong_Unleashing_the_Power_of_Microsoft_Fabric_and_OpenA
6. Fabric - Create a workspace, if you do not have one. Create a Lakehouse and create a new notebook. If you need more details on how to do it, follow the docs: Data warehouse tutorial - create a workspace - Microsoft Fabric | Microsoft Learn, Create a lakehouse with OneLake - Microsoft Fabric | Microsoft Learn, How to use notebooks - Microsoft Fabric | Microsoft Learn. Once you have done that, proceed to upload the dataset using the CSV format through the Onelake as Fig 4 - upload shows:
Fabric_Change_the_Game_strong_Unleashing_the_Power_of_Microsoft_Fabric_and_OpenA
7 - Inside Fabric Notebook, lets code - You will need to install the following libraries:
Code: python
============================================
%pip install openai --upgrade num2words matplotlib plotly scipy scikit-learn pandas
============================================
8 - Use the following configuration for the variables. Please note you will need the information that you collected at Step 3.
Code: python
=============================================
import openai
openai.api_key = "replace the API Key"
endpoint = "https://<your_service_name>.openai.azure.com/"
openai.api_type = "azure"
embedding_deployed_model = "text-embedding-ada-002"
===============================================
Cleaning the Dataset:
utils.embeddings_utils was removed, so the embeddings will be created with a customized function. This is taken from the docs as most of the code used for this example: EmbeddingsCode: python
===========================================================================
import re
def normalize_text(s, sep_token = " \n "):
s = re.sub(r'\s+', ' ', s).strip()
s = re.sub(r". ,","",s)
# remove all instances of multiple spaces
s = s.replace("..",".")
s = s.replace(". .",".")
s = s.replace("\n", "")
s = s.strip()
return s
===========================================================================
Generate Embeddings and Search functions:
Code: python
===========================================================================
import pandas as pd
import openai
from openai import AzureOpenAI
import numpy as np
from ast import literal_eval
def get_embedding(text, model="text-embedding-ada-002"):
return client.embeddings.create(input=[text], model=model).data[0].embedding
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def search_docs(df, user_query, top_n=4, to_print=True):
embedding = get_embedding(user_query, model="text-embedding-ada-002")
df['ada_embedding'] = df['ada_embedding'].apply(literal_eval)
df = df[df['ada_embedding'].apply(lambda x: isinstance(x, np.ndarray))]
# Calculate cosine similarity for each row in the DataFrame
df["similarities"] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
# Sort and get the top N results
res = df.sort_values("similarities", ascending=False).head(top_n)
if to_print:
display(res)
return res
===========================================================================
get_embedding and again use the same function to take an input and return the embedding vector from that. Follow this, use the cosine_similarity function to calculate the cosine between two vectors a and b.In other words: Cross the information using the embedding created in the file, with the text we are trying to filter in a vector format and use the help of cosine calculation to find similarities in between.
Note:
Fabric_Change_the_Game_strong_Unleashing_the_Power_of_Microsoft_Fabric_and_OpenA
Code: python
===========================================================================
client = AzureOpenAI( api_key = openai.api_key, api_version = "2023-05-15", azure_endpoint = endpoint )
input_datapath = "/lakehouse/default/Files/Files_CSV/LWZC Classification.csv" #File path, do not forget to replace by your own path
df = pd.read_csv(input_datapath, index_col=0, nrows=20) # Read only the first 20 rows
text_columns = ['LA name', 'Group description', 'Subgroup description']
df['combined'] = df[text_columns].astype(str).apply(lambda row: ' '.join(row), axis=1)
df['combined'] = df["combined"].apply(lambda x : normalize_text(x))
df['ada_embedding'] = df["combined"].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('/lakehouse/default/Files/Files_CSV/output/LWZC Classification_embeddindgs_20rows.csv', index=False) #File path, do not forget to replace by your own path
===========================================================================
Code: python
===========================================================================
# Load data into pandas DataFrame from Onelake
df = pd.read_csv("/lakehouse/default/" + "Files/Files_CSV/output/LWZC Classification_embeddindgs_20rows.csv")
df['ada_embedding'] = df.ada_embedding.apply(eval).apply(np.array)
df.head(20)
===========================================================================
Fabric_Change_the_Game_strong_Unleashing_the_Power_of_Microsoft_Fabric_and_OpenA
12. Search -"Professional, retail & leisure in dynamic" and I want the dataset to be filtered accordingly. Hence, the function will convert my sentence into a vector format and the cosine similarity will find the similarity between the vector that was created in the dataset.
Code: python
===========================================================================
# Assuming df is your DataFrame and it is using 'ada_embedding' as my example
embedding = get_embedding("Professional, retail & leisure in dynamic", model="text-embedding-ada-002")
df["similarities"] = df['ada_embedding'].apply(lambda x: cosine_similarity(x, embedding))
res = df.sort_values("similarities", ascending=False).head(4)
# Display the results
res.head()
Follow the results - Fig 7 Result:
Fabric_Change_the_Game_strong_Unleashing_the_Power_of_Microsoft_Fabric_and_OpenA
===========================================================================
In summary, I utilized Microsoft Fabric to establish a connection with the OpenAI service. Through this connection, I generated embeddings, which serve as numerical representations of words, phrases, or documents within a continuous vector space. These embeddings are specifically crafted to capture semantic relationships between words, enhancing the capability of algorithms to comprehend and process textual data more efficiently. Furthermore, I conducted a search within a dataset using this concept, leveraging the embeddings to enhance the accuracy and relevance of the search results.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.