This quick ETL tutorial has the purposed of showing how to categorize the newest records in a table when there are duplicated items in a dataset.
Below is an example using an employee’s table as a starting point:
For some reason, John McGuire and Mary Campbell records are duplicated:
SELECT * FROM EMPLOYEES;
In this example, the goal is to keep only the newest records for each person.
In order to do so, a simple row_number() function needs to be applied to the columns NAME and AGE. See below:
select NAME, AGE, ETL_INSERT_DATE,
row_number() over(partition by NAME, AGE order by ETL_INSERT_DATE desc) asRowNum
from EMPLOYEES
When using the ETL_INSERT_DATE as a parameter, the RowNumber = 1 will always be unique and the newest inserted record, which is the one we want to keep in this exercise.
We just need to filter the records with RowNum = 1, and delete the other records:
with CTE as (
select NAME, AGE, ETL_INSERT_DATE,
row_number() over(partition by NAME, AGE order by ETL_INSERT_DATE desc) asRowNum
from EMPLOYEES
)
select *
from CTE
where RowNum = 1
Since Mark and Lincoln are not duplicated, they were not affected. So RowNumber will be 1 for all employees.