This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
In this new post of our ongoing series, we'll explore setting up Azure Cosmos DB for NoSQL, leveraging the Vector Search capabilities of AI Search Services through Microsoft Fabric's Lakehouse features. Additionally, we'll explore the integration of Cosmos DB Mirror, highlighting the seamless integration with Microsoft Fabric. It's important to note that this approach harnesses the search services' capabilities, with Python coding facilitated through Lakehouse. This is just one of the myriad possibilities available within Fabric, particularly useful if your data resides in Cosmos DB and you wish to utilize Fabric's integration capabilities for search or data mirroring. Whether it's for search enhancement or data replication, Fabric stands ready for integration, offering flexibility and efficiency.
As for Azure Cosmos DB for No SQL specifically the configuration for Vector Search involves Azure Open AI and Cognitive search services.
Api_key and the Service Endpoint.Connect using API keys – Azure Cognitive Search | Microsoft LearnFabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
3. Also, look for Keys inside of your Cosmos DB and copy the Primary Key in the notepad, as Fig 2 - Keys shows:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
4. With the information provided above, let's proceed to create the container within the Fabric Lakehouse. Alternatively, you can click and create the container through the Cosmos UI.
%pip install azure-cosmosfrom azure.cosmos import CosmosClient
from azure.cosmos import exceptions, CosmosClient, PartitionKey
cosmos_db_api_endpoint="COPY THE URI HERE"
cosmos_db_api_key = "COPY THE KEY HERE"
database_name = "Vector_DB"###this is your Database name
text_table_name = 'text_sample'###this is your container name
# Initialize the Cosmos DB client
client = CosmosClient(cosmos_db_api_endpoint, credential=cosmos_db_api_key)
database = client.create_database_if_not_exists(id=database_name)
try:
container = database.create_container_if_not_exists(
id=text_table_name,
partition_key=PartitionKey(path="/id") )
print(f"Document {container} created successfully")except Exception as e:
print(f"Error: {e}")
5.Upload the data inside Cosmos.
Data:azure-vector-database-samples/code_samples/data/text/product_docs_embeddings.json at main · Azure-Sa...
When it comes to insertion or uploading, you have the freedom to choose your preferred method. The repositories I mentioned earlier provide Python examples, and Microsoft Documentation offers some Bash examples as well. To simplify matters, I'll proceed by inserting the embedding file directly from Onelake/ Fabric Lakehouse.
import pandas as pd
cosmosdb_container_name = text_table_name
container = database.get_container_client(cosmosdb_container_name)
# Read data from the JSON file
text_df = pd.read_json('/API PATH/product_docs_embeddings.json')
records = text_df.to_dict(orient='records')
# Iterate through the data and insert the files with the embeddings into the container
item['@search.action'] = 'upload'
# Convert the 'id' attribute to a string
item['id'] = str(item['id'])
# Insert the item into the container
container.create_item(body=item)
print(f"Data items inserted into the Cosmos DB {cosmosdb_container_name}")
except exceptions.CosmosResourceExistsError as e:
print(f"Document {container} with ID {item['id']} already exists...")
print(f"Error: {e}")
except Exception as e:
# Handle other exceptions
print(f"Error: {e}")
6. Let's use Azure AI services for the Search. Note: Vector database - Azure Cosmos DB | Microsoft Learn
Create the DataSource
First, let's create the DataSource for Azure Cosmos DB, using the Search Service in Azure Portal. as Fig 3 - Datasource:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
Connectionstring for my database example - Vector_DB: "AccountEndpoint=URI;AccountKey=YOURKEY==;Database=Vector_DB;"
Create the index
As shown in Figures 4 and 5 (Index and Fields, respectively), let's continue within the Search Service interface. Utilizing the UI, we'll configure the index. Since this process is performed via the UI, you'll need to add each field individually.
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
Please note for title_vector and content_vector you have will one extra step which includes create the profile - Fig 6 - profile:
Note:
"We are using HSNW "Hierarchical Navigable Small World (HNSW): HNSW is a leading ANN algorithm optimized for high-recall, low-latency applications where data distribution is unknown or can change frequently. " Ref: VectorSearch
About the Vector Size: "For each vector field, Azure AI Search constructs an internal vector index using the algorithm parameters specified on the field. Each vector is usually an array of single-precision floating-point numbers, in a field of type Collection(Edm.Single)"Ref: Vector Size
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
Once the fields are created as Fig 5 - Fields (above) just, define a name for the index and hit the button create.
Create the indexer
Now still inside of the Search Service -> create the indexer using the DataSource and the index that was created previously, save and run. As Fig 7 - indexer, shows
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
For more information how the Vector Search Service works: VectorSearch Works
"On the indexing side, Azure AI Search takes vector embeddings and uses a nearest neighbors algorithm to place similar vectors close together in an index. Internally, it creates vector indexes for each vector field."
So, all the configuration is done, now let's Search!!
Libraries:
%pip install azure-cosmos openai --upgrade azure-search-documents===11.4.0
import json
import datetime
import time
from azure.core.exceptions import AzureError
from azure.core.credentials import AzureKeyCredential
from azure.cosmos import exceptions, CosmosClient, PartitionKey
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.models import (
QueryAnswerType,
QueryCaptionType,
QueryType )
from azure.core.credentials import AzureKeyCredential
import numpy as np
from typing import List
import pandas as pd
from ast import literal_eval
import openai
Functions for the vector search:
def get_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return client.embeddings.create(input = [text], model=model).data[0].embedding
def cosine_similarity(a, b):
# Convert the input arrays to numpy arrays
a = np.asarray(a, dtype=np.float64)
b = np.asarray(b, dtype=np.float64) # Check for empty arrays or arrays with zero norms
if np.all(a == 0) or np.all(b == 0):
return 0.0 dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
similarity = dot_product / (norm_a * norm_b)
return similarity
Initialize the Connection:
database_name = "Vector_DB"
text_table_name = 'YOUR CONTAINER NAME'###mine is text_sample3
cosmos_db_api_endpoint="URI"
cosmos_db_api_key = "YOUR KEY"
# Configure Azure Cognitive Search
cog_search_endpoint = "https://YOURSERVICENAME.search.windows.net"
cog_search_key = "KEY of your service"
index_name = "YOUR Index Name" ##my example is index_textsample3
credential = AzureKeyCredential(str(cog_search_key))
openai.api_type = "azure"
openai.api_key = "YOUR open AI Key"
openai.api_base = "https://YOUROpenAIService.openai.azure.com/"
cosmos_client = CosmosClient(cosmos_db_api_endpoint, cosmos_db_api_key)
database = cosmos_client.get_database_client(database_name)
Script for the Search:
from openai import AzureOpenAI
container_name =text_table_name client = AzureOpenAI(
api_key = openai.api_key,
api_version = "2023-05-15",
azure_endpoint = openai.api_base)
container = database.get_container_client(container_name)
search_client = SearchClient(cog_search_endpoint, index_name, credential)
query = 'tools for software development'##example
query_vector = get_embedding(query, model = model)
# Perform Azure Cognitive Search query
search_results = search_client.search(search_text=query, select=["title", "content", "category", "title_vector", "content_vector"])
for result in search_results:
result_vector = result.get("content_vector", None)
if result_vector is not None and len(result_vector) > 0:
similarity_score = cosine_similarity(query_vector, result_vector)
print(f"Title: {result['title']}")
print(f"Score: {result['@search.score']}")
print(f"Content: {result['content']}")
print(f"Category: {result['category']}")
print(f"Cosine Similarity: {similarity_score}\n")
else:
print(f"Skipping result with empty or missing vector.\n")
Results - Fig 8- Search:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
As for Vector Search, if you are interested, I encourage you to check the repositories I mentioned at the beginning of this post, and you will see the many options and implementations. The python code can be reused inside of the Lakehouse with a few changes.
Our earlier example showcased how to build the Vector Search using the AI Search services with Cosmos for No SQL and the Lakehouse instead of Python scripts. Now, let's explore another option: mirroring it into Fabric. Once the mirroring process is finalized, shortcuts can be established across Microsoft Fabric workspaces, directing to the mirror. Furthermore, the SQL Endpoint can be employed to create queries, it means you can use T-SQL commands that query data objects but not manipulate the data in teh SQl Endpoint, as it's a read-only copy.
Note: Mirrors can be stopped at any given time.
Review the doc to understand the solution: Microsoft Fabric mirrored databases from Azure Cosmos DB (Preview) - Microsoft Fabric | Microsoft Le...
1 - Inside Fabric - Choose Mirror Azure Cosmos DB.
As Fig 9 - Cosmos DB option illustrates:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
2 - Name the Mirror that will be created, as Fig 10 - Name mirror, shows:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
3 - Choose Cosmos DB for No SQL currently in preview. As Fig 11 - Cosmos option will show as follows:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
4 - Inside Azure Portal look for your Cosmos DB NO SQL, open and copy and paste the URI in a notepad as Fig. 12-URI shows, mine for example is - https://lilem.documents.azure.com:443/. <This is same step as I did for Vector Search.>
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
5 - Look for Keys inside of your Cosmos DB and copy one the Primary Key in the notepad, as Fig 13 - key shows:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
6 - Use the information you copied earlier in step 4 and step 5 and input it into their respective fields as shown in Figure 14 - Mirror Fields.
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
7 - Next connect -> Select the database and start to mirror:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
There are some preliminary steps missing in the mirror configuration. The error message indicates: "The database cannot be mirrored to Fabric due to the following error: Continuous backup must be enabled before you mirror an Azure Cosmos DB database to Fabric. Please enable 7-day or 30-day continuous backup on your Azure Cosmos DB account from the Azure portal."
Therefore, before proceeding with the mirror setup, ensure that continuous backup is enabled on your Azure Cosmos DB account for either 7-day or 30-day retention period via the Azure portal.
According to the docs: Microsoft Fabric mirrored databases from Azure Cosmos DB (Preview) - Microsoft Fabric | Microsoft Le... "When you enable mirroring on your Azure Cosmos DB database, inserts, update, and delete operations on your online transaction processing (OLTP) data continuously replicates into Fabric OneLake for analytics consumption.The continuous backup feature is a prerequisite for mirroring. "
So, let's fix!!
Reopen the Azure Portal for Cosmos DB, locate the database you intend to mirror, and navigate to the Backup and Restore section. Select the continuous backup option, as indicated in the message. Refer to Figure 16 - Continuous, which illustrates this configuration.
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
After making this change, please wait for a moment. You'll notice that the Point in Time Restore option mentioned in the documentation (Migrate an Azure Cosmos DB account from periodic to continuous backup mode | Microsoft Learn) will become available. If you select this option, you'll see a message stating, "The Backup Policy is migrating," as shown in Figure 17 - Policy. Hence, while the migration is in progress, please wait until it's completed before attempting to restart the mirror in Fabric.
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
Once the backup policy is finished to be migrated, you can go back to Fabric and hit the Mirror button, as Fig 18 - Mirror, shows:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
8 - Now you can query Cosmos DB for No SQL from the SQL Endpoint, as Fig 19 - CosmosSQL shows:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
9 - You could even open Cosmos from SSMS by connecting to the SQL Endpoint. Copy the SQL Connection string as Fig 20- Endpoint and open SSMS as Fig 21 -SSMS:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
And as mentioned before shortcuts from the Lakehouse can be created in different workspaces ( given the right permissions) to access the Cosmos DB mirrored Data, for example can be created to access the Mirror Data as Fig 22- Shortcuts:
Fabric_Change_the_Game_Embracing_Azure_Cosmos_DB_for_NoSQL
This post explores the diverse options available when integrating Cosmos DB for No SQL with Microsoft Fabric. It has delved into configuring Azure Cosmos DB for NO SQL with Vector Search services, leveraging Microsoft Fabric's Lakehouse capabilities. We've also explored the integration of Cosmos DB Mirror, highlighting its seamless collaboration with Microsoft Fabric. It's essential to recognize that this approach maximizes the search services' potential, with Python coding streamlined through Lakehouse. This represents just one of the myriad possibilities within Fabric, particularly beneficial if your data resides in Cosmos DB, allowing you to harness Fabric's integration capabilities for search or data mirroring needs. Whether it's for enhancing search functionalities or replicating data, Fabric offers a versatile and efficient integration solution.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.