Prefer CSV or JSON for my dataflows

Anonymous · ‎07-29-2022

Hi,

I am new to PowerBI and will appreciate any help with my question.

I have created multiple dataflows which extracts data from folder locations in my file system with the help of an on-premise gateway. The folders consist of multiple (~60) json files (about 15-20 MB each) which are combined and then transformed in the dataflows. Also, these dataflows are to be refreshed every day.

I also have an aleternative to use CSV files instead of JSONs in my dataflows.

To anyone who has experience in using both JSONs and CSVs in dataflows, I wanted to ask which should be the preferred file format that I should be using for me to:

1) minimize the use of Power BI resources/capacity as much as possible.

2) minimize the refresh time for the dataflows.

3) reduce refresh failures as much as possible.

I am aware that my transformation querries that I am using will affect the above mentioned points but I wanted to know if using different file formats can also make a difference.

Regards,

Diptanshu Lal

bcdobbs · ‎07-29-2022

I can't find it at the moment but I did see a blog comparing load times and think csv came out top. Under the hood when you run a data flow it stores the data as csv in azure data lake storage so it's very optimised for that sort of thing.

My hunch would be that you're unlikely to notice a huge difference. (Worth testing!)

Ben Dobbs

LinkedIn | Twitter | Blog

Did I answer your question? Mark my post as a solution! This will help others on the forum!
Appreciate your Kudos!!

Anonymous · ‎07-29-2022

Hi @bcdobbs ,

Do you mean to say that, according to that blog, csv seemed to be taking the least time? It would be great if you could direct me to that blog or any other source discussing this issue.

Also, yes I have made separate dataflows to perform the extraction and transformation operations on. This actually does help a lot.

Best Regards,

Diptanshu Lal

bcdobbs · ‎07-29-2022

Sorry. Blog was csv vs parquet.

https://www.datalineo.com/post/parquet-and-csv-querying-processing-in-power-bi

Both csv and json are plain text formats so there won't be much in it. However json carries all the field names for every record and so for the same volume of data the files will be larger. Therefore I would expect csv to load marginally faster than json (a lot -

also depends on how flat the json files are).

Honestly I don't think you'll find a huge performance impact either way.

Ben Dobbs

LinkedIn | Twitter | Blog

Did I answer your question? Mark my post as a solution! This will help others on the forum!
Appreciate your Kudos!!

bcdobbs · ‎07-29-2022

One thing to consider that will make a difference if you're reading lots of csv (or json) in is two separate the ingest from transformations in dataflows (assuming you have premium or premium per user).

Create a first data flow that simply pulls in the raw data.

Then create a second which uses the ingested data in the first to do transformations on.

Ben Dobbs

LinkedIn | Twitter | Blog

Did I answer your question? Mark my post as a solution! This will help others on the forum!
Appreciate your Kudos!!

Prefer CSV or JSON for my dataflows

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly