Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, the Microsoft Fabric Community team will be offering free DP-600 exam vouchers. Prepare now

Reply
arpost
Kudo Collector
Kudo Collector

Do pipelines replicate data or just metadata about the source dataset?

I'm interested in using the new Deployment Pipelines feature but want to be sure I'm correct in my understanding of the nature of pipelines. When I deploy via a pipeline from a Dev workspace, I see that my dataflows and datasets now appear as separate artifacts or "objects" in the Test workspace, which concerned me as some of our datasets may require imports of large volumes of data.

However, the linked section of this Microsoft KB entry seems to suggest that pipelines only copy metadata and not the actual data contained in the referenced data source. Is this correct?

 

For example, let's say a Dev workspace had the following:

  1. 2 dataflows connected to a SQL database;
  2. 1 dataset that models the 2 dataflows, which result in 500,000 imported records;
  3. 1 report that uses the dataset above.

If this is then deployed to the Test stage, does this mean that the 500,000 records are only stored ONCE or am I now working with the original 500,000 and duplicated 500,000 records for a total of 1 million across my workspaces?

1 ACCEPTED SOLUTION

The intent is you only use the prod workspace for consumption...dev is for development and test is for testing.  All downstream dataflows, datasets and reports should always be consuming from the prod workspace.  Only promote to prod what has been tested.

 

There no additional cost associated with storing data in all three environments--depending on your data source there may be other costs associated with extraction, but using pipelines enables you to test without overwriting prod.

 

An option to limit data is to create a parameter is in your dataset (like "DevelopmentMode", and set to "dev", "test", or "prod").  Have your queries check this parameter and when "dev" only import a small numer of rows (I use 10), if "test" then import a medium-high amount (I use 10,000), and if "prod" or blank, don't impose a filter at all.

 

You can then assign deployment rules to set the parameter in each stage's Workspace in the pipeline settings.

View solution in original post

3 REPLIES 3
jeffshieldsdev
Solution Sage
Solution Sage

Correct, data is not copied. Each Workspace only has the data contained within it--be it dataflows or datasets. The idea is to just work with assets within one environment: dev, test or prod.

 

When promoting to Test existing data may be retained if there are no structural changes; you'll have to refresh the dataflows and datasets in Test to populate them.

Thanks for the reply, @jeffshieldsdev. Is there a recommended method, then, of working with a single dataset in a central workspace rather than having each workspace contain data in a dataset? I thought parameterization might be a viable option, but it sounds like the pipeline would actually "undo" that by requiring each workspace to contain its own dataset.

 

Just trying to think about long-term performance and not consuming data storage unnecessarily. 

 

Oh, and to confirm I'm understanding, when you said the following:


@jeffshieldsdev wrote:

Correct, data is not copied. Each Workspace only has the data contained within it--be it dataflows or datasets. The idea is to just work with assets within one environment: dev, test or prod.

 


you were saying "the idea is to ensure all assets are contained within a single workspace" as opposed to saying "the idea is to store data in one workspace and have other stages reference data in that workspace?"

The intent is you only use the prod workspace for consumption...dev is for development and test is for testing.  All downstream dataflows, datasets and reports should always be consuming from the prod workspace.  Only promote to prod what has been tested.

 

There no additional cost associated with storing data in all three environments--depending on your data source there may be other costs associated with extraction, but using pipelines enables you to test without overwriting prod.

 

An option to limit data is to create a parameter is in your dataset (like "DevelopmentMode", and set to "dev", "test", or "prod").  Have your queries check this parameter and when "dev" only import a small numer of rows (I use 10), if "test" then import a medium-high amount (I use 10,000), and if "prod" or blank, don't impose a filter at all.

 

You can then assign deployment rules to set the parameter in each stage's Workspace in the pipeline settings.

Helpful resources

Announcements
OCT PBI Update Carousel

Power BI Monthly Update - October 2024

Check out the October 2024 Power BI update to learn about new features.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

October NL Carousel

Fabric Community Update - October 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors