This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
Hello!
I'm collecting company's PBI metadata via PBI API (Scan Jobs to be precise).
Following documentation, I call PostWorkspaceInfo endpoint with additional params
getArtifactUsers=True&lineage=True&datasourceDetails=True&datasetSchema=True&datasetExpressions=TrueSCAN_GETINFO_PARAMS = {
"getArtifactUsers": "true",
"lineage": "true",
"datasourceDetails": "true",
"datasetSchema": "true",
"datasetExpressions": "true",
}
PBI_SCOPE = "https://analysis.windows.net/powerbi/api/.default"
class PowerBIOAuth2ClientCredentialsAuth(AuthBase):
"""HttpHook auth adapter around dlt OAuth2 client credentials."""
def __init__(self, login: str, password: str) -> None:
tenant_id = "..."
client_id = "..."
if not password:
raise AirflowException("Power BI client secret is empty")
self._auth = OAuth2ClientCredentials(
access_token_url=f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token",
client_id=client_id,
client_secret=password,
access_token_request_data={"scope": PBI_SCOPE},
)
def __call__(self, request: PreparedRequest) -> PreparedRequest:
return self._auth(request)
def create_scan(
batch_index: int,
ch_conn_id: str,
select_ids_sql: str,
powerbi_conn_id: str,
batch_size: int = 100,
) -> str:
"""Create a Power BI scan for one mapped workspace batch.
Args:
batch_index: Batch number produced by dynamic task mapping.
ch_conn_id: Airflow ClickHouse connection id used to fetch workspace ids.
select_ids_sql: SQL query that returns workspace ids and supports
`limit` and `offset` parameters.
powerbi_conn_id: Airflow HTTP connection id for Power BI API access.
batch_size: Number of workspace ids requested from ClickHouse per batch.
Returns:
The created Power BI `scan_id`.
Raises:
AirflowException: If the batch is empty, API response is non-202, or the
response payload contains an API error or misses the `id` field.
"""
...
http = HttpHook(
method="POST",
http_conn_id=powerbi_conn_id,
auth_type=PowerBIOAuth2ClientCredentialsAuth,
)
response = http.run(
endpoint=f"v1.0/myorg/admin/workspaces/getInfo?{urlencode(SCAN_GETINFO_PARAMS)}",
data=json.dumps({"workspaces": chunk}),
headers={"Content-Type": "application/json", "Accept": "application/json"},
extra_options={"check_response": False},
)
body: dict[str, Any] = response.json()
if body.get("error"):
raise AirflowException(f"getInfo API error for batch {batch_index}: {body['error']}")
scan_id = body.get("id")
if not scan_id:
raise AirflowException(f"getInfo response missing `id`: {body}")
logging.info("Created scan %s for batch %s (%s workspaces)", scan_id, batch_index, len(chunk))
return scan_id POST request is on screenshot below (tried both 'true' and "True" - no changes)
INFO - Calling Power BI getInfo endpoint for batch 0: v1.0/myorg/admin/workspaces/getInfo?getArtifactUsers=True&lineage=True&datasourceDetails=True&datasetSchema=True&datasetExpressions=TrueIn repsonse for GET GetScanResult I expect to use "datasourceInstances" list data later instead of use Datasets GetDatasourcesAsAdmin to get exact the same info, but I didn't get this list. Only related data I have it's "datasourceInstanceId" in "datasourceUsage" list in "datasets" list.
"datasets": [
{
"id": "",
"name": "",
"tables": [],
"configuredBy": "",
"configuredById": "",
"isEffectiveIdentityRequired": false,
"isEffectiveIdentityRolesRequired": false,
"refreshSchedule": {
"days": [
"Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday"
],
"times": [
"05:30"
],
"enabled": true,
"localTimeZoneId": "UTC",
"notifyOption": "MailOnFailure"
},
"targetStorageMode": "Abf",
"createdDate": "",
"contentProviderType": "",
"datasourceUsages": [
{
"datasourceInstanceId": "005f54d6-6e55-496c-93da-e2a96356b72a"
}
],
"tags": []
},
I've read that datasourceDetails=True might be an issue, but I pass this param into requst url.
Could you help me please figure out why 'datasourceInstances' list is not in the GetScanResult response?
Hi @1ng4lipt ,
Thank you for providing the detailed request flow. I have reviewed your implementation, and your PostWorkspaceInfo call appears to be correct. so the parameters are set appropriately. The use of true versus True will not affect the response in this context.
Notably, your scan result returns datasourceUsages with a valid datasourceInstanceId, indicating that the scan is successfully identifying the relationship between the dataset and its datasource. If there were an issue with the request, this information would typically not be present.
The absence of the datasourceInstances section generally occurs when the scan API cannot fully resolve the underlying datasource metadata. As outlined in the Microsoft documentation for GetScanResult, the API may only return certain properties depending on the metadata available in the Power BI service. As a result, the response can vary based on datasource type, connection configuration, gateway setup, or how the dataset was created.
This behavior is common with specific cloud connections, custom connectors, parameterized connections, dataflows, and certain Fabric related sources where the relationship is detected, but complete datasource details are not provided in the scan response.
Since your request is properly configured and the scan completes successfully, this appears to be related to the API’s metadata return for that datasource, rather than an issue with your code.
If you require comprehensive datasource connection details, I recommend using the Get Datasources As Admin endpoint, which is generally more reliable when datasourceInstances is missing.
For further reference, please see the following Microsoft documentation:
GetScanResult API documentation
https://learn.microsoft.com/rest/api/power-bi/admin/workspace-info-get-scan-result
PostWorkspaceInfo API documentation
https://learn.microsoft.com/rest/api/power-bi/admin/workspace-info-post-workspace-info
Get Datasources As Admin documentation
https://learn.microsoft.com/rest/api/power-bi/admin/datasets-get-datasources-as-admin
Thank you.
Hello @v-tejrama ,
Thank you for this detailed explanation.
I want to add that dataRetrievalState column has values
UpstreamLineageErrors; DatasetSchemaDisabledByAdmin; DatasetExpressionsDisabledByAdmin
and
DatasetSchemaDisabledByAdmin; DatasetExpressionsDisabledByAdmin
which is obviously because of the situation you mentioned.
Still, I want to ask does it make sense to go to PBI admin/app settings to check what metadata is allowed to collect or something like that (if it is possible at all. Unfortunately, I am not a PBI admin and have no clue what is inside PBI admin panel). If so, could you please suggest which settings should I ask to check.
Hello @lbendlin ,
Thanks for the screenshot. I'll ask our admin to check these settings and come back to you with more questions if you don't mind 🙂
Check out the April 2026 Fabric update to learn about new features.
Sign up to receive a private message when registration opens and key events begin.
| User | Count |
|---|---|
| 9 | |
| 8 | |
| 7 | |
| 6 | |
| 5 |
| User | Count |
|---|---|
| 26 | |
| 16 | |
| 12 | |
| 10 | |
| 6 |