This quick ETL tutorial has the purposed of showing how to categorize the newest records in a table when there are duplicated items in a dataset.

Below is an example using an employee’s table as a starting point:

For some reason, John McGuire and Mary Campbell records are duplicated:


In this example, the goal is to keep only the newest records for each person.

In order to do so, a simple row_number() function needs to be applied to the columns NAME and AGE. See below:

      row_number() over(partition by NAME, AGE order by ETL_INSERT_DATE desc) asRowNum

When using the ETL_INSERT_DATE as a parameter, the RowNumber = 1 will always be unique and the newest inserted record, which is the one we want to keep in this exercise.

We just need to filter the records with RowNum = 1, and delete the other records:

with CTE as (
          row_number() over(partition by NAME, AGE order by ETL_INSERT_DATE desc) asRowNum
       from EMPLOYEES
select *
     from CTE
          where RowNum = 1

Since Mark and Lincoln are not duplicated, they were not affected. So RowNumber will be 1 for all employees.