Ingest of json files is inconsistent

smeetsh · ‎08-06-2024

Hi All,

We use 2 pipelines that pull data from an API, which gets written to several json files. In the next step of that pipeline, we ingest those json files into a lakehause "raw" table. Once the json files are in the raw table, the next step is an ETL to the production tables.

The problem we are facing is inconsitency in the ingestion of the json files to the raw table. The pipelines automatically run at 5 am and 6 am , but we see that some data in the json, is actualy not ingested into the raw table, while other data is. (we have kept them an hour appart to rule out one pipeline interfering with the other.)

When I look at the actual json file, I see the data is there, in the file itself, so there is no reason for it not to be in the raw table. The realy weird thing is, if i run the pipeline manualy later at say like 11 am, new files are created, which than get ingested in the raw table correctly.

I have done a compare between the 5 am and the 11 am files and they are identical, except of course for the date they were created.

The pipeline itself completes without any error. One of the pipelines we use has concurrency, ingesting up to 5 json files at the time, the other hasn't. This would rule out imho concurrency being the issue.

Since there seems to be no logic to why it will work in one run and not in the other, I have no clue where to even start troubleshooting this.

Does anyone have any idea where to look for more clues, or is this a bug (if it is it is a serious one!)?

Warm regards
Hans.

smeetsh · ‎08-19-2024

What happened to all the other replies to this topic, Including the link to the MS article and the workaround we have discovered? I only see three replies?

Anonymous · ‎08-13-2024

Hi @smeetsh

Have you solved your problem?Could you please mark the helpful post as Answered? It will help the others in the community find the solution easily if they face the same problem as yours. Thank you.

Best Regards

Zhengdong Xu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Shravan133 · ‎08-06-2024

It sounds like a challenging issue. Here are a few areas to investigate that might help troubleshoot the problem:

1. Pipeline Logs and Monitoring

Logs: Check the logs for both successful and failed pipeline runs. Look for any warnings or errors that might provide hints about why some data isn't being ingested.
Monitoring Tools: If you have monitoring tools set up, review any alerts or anomalies that coincide with the times of the issues.

2. Concurrency and Resource Limits

Pipeline Concurrency: Even though you mentioned that one of the pipelines doesn’t have concurrency and the other does, ensure that resource limits and quotas are not being hit during the scheduled runs.
Resource Allocation: Verify if there are resource constraints or throttling issues during the scheduled times.

3. File System and Data Storage

File System Checks: Ensure there are no issues with the file system where the JSON files are stored. Sometimes issues with file locking or permissions can cause intermittent problems.
Data Storage: Check if the raw table has any data ingestion limits or quotas that might affect its ability to ingest data correctly.

4. File Consistency

File Integrity: Even if the files appear identical when compared, there could be timing issues with file availability. Check if the files are fully written and not being updated or locked during the ingestion process.
File Naming and Handling: Ensure that there are no issues with file naming conventions or file handling that might cause inconsistencies.

5. Pipeline Configuration

Data Mapping: Double-check the mapping and transformations in the pipelines to ensure that there are no misconfigurations causing data to be missed.
Pipeline Timing: Confirm that the pipelines are correctly handling data written during the time of the scheduled runs.

6. ETL Process

ETL Logs: Review the ETL process logs for any issues or errors in data transformation that might affect the final output.
Data Validation: Implement additional data validation steps in the ETL process to ensure all data is correctly processed.

7. Retry Mechanisms

Retry Logic: If not already implemented, consider adding retry mechanisms to handle intermittent issues with data ingestion.

8. Manual vs. Scheduled Runs

Environment Differences: Verify if there are any differences between the environments or configurations used during manual runs versus scheduled runs.

By examining these areas, you might uncover the root cause of the issue and find a solution. If the problem persists and seems like it could be a bug, consider reaching out to the support team of the tool or service you are using.