Deduplication troubles

Anonymous · ‎06-19-2023

Im sure im missing something simple but appricate the help. Im importing a big XML file that has some duplications in it.

Changes some info for provacy but the main point is I get duplications on the IP and ID ccolumns. For asmpe the first two lines you have ID 105146 twice. In the tracking column one has AGENT and one has IP. I want to deploucate all these remccing the link with IP in the tracking column.

Help would be appreciated. Thanks.

Anonymous · ‎06-25-2023

BUMP

Anonymous · ‎06-29-2023

Hi @Anonymous,

Can you please share some dummy data that keep the raw data structure with expected results? It should help us clarify your scenario and test to coding formula.

How to Get Your Question Answered Quickly

Regards,

Xiaoxin Sheng

Anonymous · ‎06-29-2023

Raw data looks like this

Name	Tracking	Owner	Application	IP	QID
server1	Agent	john	windows	1.2.3.4	112334
server1	IP	john	windows	1.2.3.4	112334
server2	Agent	john	linux	5.6.7.8	113445
server2	IP	john	linux	5.6.7.8	113445
server3	Agent	john	sql	10.11.12.13	115667
server3	IP	john	sql	10.11.12.13	115667

There are rows wich are duplicates except for the tracking column which is different between the two duplicate row.

I need to find duplicate rows and remove the row with IP in the tracking. End result would be this:

Name	Tracking	Owner	Application	IP	QID
server1	Agent	john	windows	1.2.3.4	112334
server2	Agent	john	linux	5.6.7.8	113445
server3	Agent	john	sql	10.11.12.13	115667

Anonymous · ‎06-20-2023

HI @Anonymous,

I think they may relate to your source data structure. For this scenario, you can add filter on the category to filter records equal to blank or filter on tracking field if it equal to IP.

Regards,

Xiaoxin Sheng

Anonymous · ‎06-20-2023

Your right the source data contains the duplciation but i am pulling to from an API and I have no ability to clean it up before ingesting the data.

I considerd just filtering out field equal to IP however there are plenty of rows that IP is not a duplcate so i need to keep them in the data. I need to find duplicate rows based on IP and ID matching and remove the line which has IP in the tracking column.