Solved: Re: Progrmatically write files in delta

smpa01 · ‎08-10-2024

How can I progrmatically write files (create if not exits + write/overwrite daily) in the files section of the delta lake. the following did not work where I am trying to create a .txt file with a string (not df)

import os

#content to write
str = 'lorem ipsum dolores'
#base path
base_path = "Files/materialized/"
#desired file type
query_file = "daily.txt"
#file path
file_path = base_path +query_file
#create if not exists
os.makedirs(os.path.dirname(file_path), exist_ok=True)
#write content
with open (file_path,'w') as file:
    file.write(str)

Also, which file extesion is optimum (from delta lake compression perspective; hence more performant). I need to log a string in the file daily and I need to be read back the string I wrote by calling the file.

I have also tried this but I have realized that it creates a folder called f1 with mutiple text file and a _success file. Is there any way to control the file name as f1.txt. However, the Copy Activity in pipeline creates the file with exact desired name in the destination and a single file. Can I do what is achievable inpipeline in notebook

import os

#content to write
str = 'lorem ipsum dolores'
#base path
df = spark.createDataFrame([Row(value=str)])
//write a single row df
df.write.mode("overwrite").parquet("Files/f1")

Thank you in advance.

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

frithjof_v · ‎08-11-2024

I was able to use this code to write to a simple text file:

import os

# Define the sentence you want to write
sentence = "This is the sentence that will be written to the text file."

# Specify the folder path and file name
folder_base_path = "/lakehouse/default/Files/"
folder_relative_path = "sentence_files"
file_name = "output.txt"
folder_path = os.path.join(folder_base_path, folder_relative_path)

# Combine the folder path and file name
file_path = os.path.join(folder_path, file_name)

# Create the directory if it doesn't exist
os.makedirs(folder_path, exist_ok=True)

# Open the file in write mode and write the sentence
with open(file_path, "w") as file:
    file.write(sentence)

print(f"Sentence written to {file_path}")

The folder_base_path will depend on whether your notebook has a default lakehouse or if you are just mounting lakehouses to your notebook.

In the code I show above, the folder_base_path assumes that the notebook has a default lakehouse.

If you don't want to use a default lakehouse, then you will need to mount a lakehouse instead. However if you don't have any specific requirements, I would say just use a default lakehouse for your notebook.

https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-notebook-explore#switch-lakehous...

https://fabric.guru/how-to-mount-a-lakehouse-and-identify-the-mounted-lakehouse-in-fabric-notebook

By default, PySpark creates a folder with multiple files. I guess this is because PySpark uses distributed processing on multiple worker nodes.

If you want to write a (not too big) dataframe to a single file, I think the easiest way is to use Pandas.

https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-I-just-write-a-CSV-file-to-a-lakeh...

https://www.reddit.com/r/MicrosoftFabric/s/J38eNFH1gw

View solution in original post

frithjof_v · ‎08-11-2024

"The file name control is an important aspect of my workflow, hence I cant stick writing to files using codes in notebook. I don't know if there is a lakehouse api that lets you write(upload) a file with developer created content such that the file name could be exactly same as dev desired."

This can be done with similar code like the one I used in the previous comment.

It can also be done with Pandas.

I think also the ADLS Gen2 API can be used, it supposedly works with OneLake. Here is an example of someone who has used the API to connect to OneLake from Power Automate. I guess you can use the API from any client, not just from Power Automate. https://www.linkedin.com/pulse/how-call-onelake-api-from-power-automate-enterprise-app-nigel-smith-4...

PowerShell might also be an option:

https://learn.microsoft.com/en-us/fabric/onelake/onelake-powershell

There is also the OneLake explorer, where we can interact with OneLake as a folder structure on our local machine.

smpa01 · ‎08-11-2024

@frithjof_v it worked. Many thanks for this. I just tried out. This is exactly what I had in my mind.

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

frithjof_v · ‎08-11-2024

I was able to use this code to write to a simple text file:

import os

# Define the sentence you want to write
sentence = "This is the sentence that will be written to the text file."

# Specify the folder path and file name
folder_base_path = "/lakehouse/default/Files/"
folder_relative_path = "sentence_files"
file_name = "output.txt"
folder_path = os.path.join(folder_base_path, folder_relative_path)

# Combine the folder path and file name
file_path = os.path.join(folder_path, file_name)

# Create the directory if it doesn't exist
os.makedirs(folder_path, exist_ok=True)

# Open the file in write mode and write the sentence
with open(file_path, "w") as file:
    file.write(sentence)

print(f"Sentence written to {file_path}")

The folder_base_path will depend on whether your notebook has a default lakehouse or if you are just mounting lakehouses to your notebook.

In the code I show above, the folder_base_path assumes that the notebook has a default lakehouse.

If you don't want to use a default lakehouse, then you will need to mount a lakehouse instead. However if you don't have any specific requirements, I would say just use a default lakehouse for your notebook.

https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-notebook-explore#switch-lakehous...

https://fabric.guru/how-to-mount-a-lakehouse-and-identify-the-mounted-lakehouse-in-fabric-notebook

By default, PySpark creates a folder with multiple files. I guess this is because PySpark uses distributed processing on multiple worker nodes.

If you want to write a (not too big) dataframe to a single file, I think the easiest way is to use Pandas.

https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-I-just-write-a-CSV-file-to-a-lakeh...

https://www.reddit.com/r/MicrosoftFabric/s/J38eNFH1gw

smpa01 · ‎08-11-2024

I usually write to delta tables and not to files. Write to tables create the table with exact same desiredTblName. I was hoping for the same to happen when it comes to files. But to my surprise, it creates a folder in the files section with desiredName (that I was hoping to be the file name) and the actual file name is spark generated.

I have also observed that if you are manual uploading a file / using pipeline to copy a parquet (or any other available format) file to a sink/destination, the system writes it with exact same desiredName.

The file name control is an important aspect of my workflow, hence I cant stick writing to files using codes in notebook. I don't know if there is a lakehouse api that lets you write(upload) a file with developer created content such that the file name could be exactly same as dev desired. If it happens, I want to try it out, but for now I have broken my code to let pipeline handle that part.

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar