Where is the duplicate

I have often said  ‘One test is worth a thousand speculations.’    I never assume that one method is right or even right in every case.   I look at what might work and then test.   So there is a forum I follow,  where there was a question on how to delete duplicates.   The general consensus was the way I have done it in the past:

DELETE
  FROM tab
 WHERE ROWID NOT IN
    (SELECT   MIN (ROWID)
       FROM tab
      GROUP BY <dup column list> )

But there were also some new ideas, to me at least,  on alternative solutions.  The example was how to delete duplicate firstname and lastname columns in the emp table.  The first suggestions were variations on the template above:

/* del_by_group.sql */

DELETE FROM emp WHERE ROWID NOT IN (
SELECT MIN (ROWID) FROM emp GROUP BY firstname , lastname)

The next was using an analytic function:

/* del_by_analytic.sql */ 
delete emp where rowid in  (
select lag(rowid) 
 over(partition by firstname, lastname order by empid) from emp ) 
/

the last was inspired by using analytics and is using TOP-N logic.

/* del_by_rownumber.sql */
DELETE
  FROM EMP t
 WHERE t.ROWID IN
    ( SELECT   dupl.RID
       FROM ( SELECT row_number() 
               OVER (PARTITION BY FirstName, 
                                  LastName 
                      ORDER BY EmpID) AS ROWNR,
            FirstName  ,
            LastName   ,
            ROWID AS RID
       FROM EMP
   ORDER BY FirstName,
            LastName ) dupl
  WHERE dupl.ROWNR > 1 ) /

So I created an emp table with empid, firstname, and lastname by building data from all_objects,  then tested  the various methods.  Here is the script to create the test data.

/* del_dup_test */
truncate table emp;
drop table emp;
create table emp as ( select object_id as empid, OWNER firstname , OBJECT_NAME lastname from all_objects ) ;
insert into  emp  ( select object_id+26000,  initcap( owner ) ,initcap( OBJECT_NAME ) from all_objects ) ;
insert into  emp  ( select object_id+52000,  initcap( owner ) , ( OBJECT_NAME ) from all_objects ) ;
insert into  emp  ( select object_id+78000,  initcap( owner ) , ( OBJECT_NAME ) from all_objects where owner <> 'SYS'  ) ;
insert into  emp  ( select object_id+104000,  initcap( owner ) , initcap( OBJECT_NAME ) from all_objects where owner <> 'SYS'  ) ;
commit;
select count(*) from emp ; 
@del_by_group
select count(*) from emp ;
roll
/
@del_by_analytic
select count(*) from emp ;
roll
/
@del_by_rownumber
select count(*) from emp ;
roll
/
exec dbms_output.put_line( ' del_by_group');
set autotrace traceonly exp stat ;
@del_by_group
SET autotrace off ;
roll
exec dbms_output.put_line( ' del_by_analytic');
set autotrace traceonly exp stat ;
@del_by_analytic
SET autotrace off ;
roll
exec dbms_output.put_line( ' del_by_rownumber')
set autotrace traceonly exp stat ;
@del_by_rownumber
SET autotrace off ;
roll

If you try this you may need to change the multiple of 26000 to prevent duplicate  empid’s.   I create three rows for every row in all_objects, the base row, a row where both first and last names are initcap’ed and the third row where only the first is initcap’ed.   Then I create duplicates for those extra row excluding the ‘SYS’ schema.

The rest of the script calls the other three scripts, first to show they all produce the same result and then calls them again with tracing on (there are other ways this can be evaluated but this gives you the idea).

Now here’s the spooled output from running the script.  This is with feed and timing off to reduce the clutter.   The times were all sub second.

SQL> @del_dup_test

    COUNT(*)
------------
       87865

    COUNT(*)
------------
       58239

    COUNT(*)
------------
       87865

    COUNT(*)
------------
       58239

    COUNT(*)
------------
       87865

    COUNT(*)
------------
       58239

    COUNT(*)
------------
       87865
 del_by_group

Execution Plan
----------------------------------------------------------
Plan hash value: 3239269824

------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT      |          | 46243 |  1083K|       |   366   (3)| 00:00:05 |
|   1 |  DELETE               | EMP      |       |       |       |            |          |
|*  2 |   HASH JOIN ANTI      |          | 46243 |  1083K|  1088K|   366   (3)| 00:00:05 |
|   3 |    TABLE ACCESS FULL  | EMP      | 46243 |   541K|       |   127   (1)| 00:00:02 |
|   4 |    VIEW               | VW_NSO_1 | 46243 |   541K|       |   131   (4)| 00:00:02 |
|   5 |     SORT GROUP BY     |          | 46243 |  2077K|       |   131   (4)| 00:00:02 |
|   6 |      TABLE ACCESS FULL| EMP      | 46243 |  2077K|       |   127   (1)| 00:00:02 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(ROWID="MIN(ROWID)")

Note
-----
   - dynamic sampling used for this statement (level=2)

Statistics
----------------------------------------------------------
         36  recursive calls
      30884  db block gets
        940  consistent gets
          0  physical reads
    8308824  redo size
       1141  bytes sent via SQL*Net to client
       1346  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
      29626  rows processed

 del_by_analytic

Execution Plan
----------------------------------------------------------
Plan hash value: 1245347917

-------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT       |          |    38M|   873M|       |  2501  (15)| 00:00:31 |
|   1 |  DELETE                | EMP      |       |       |       |            |          |
|*  2 |   HASH JOIN            |          |    38M|   873M|  1448K|  2501  (15)| 00:00:31 |
|   3 |    VIEW                | VW_NSO_1 | 61779 |   723K|       |  1892   (1)| 00:00:23 |
|   4 |     SORT UNIQUE        |          | 61779 |  3559K|  4384K|  1892   (1)| 00:00:23 |
|   5 |      WINDOW SORT       |          | 61779 |  3559K|  4384K|  1892   (1)| 00:00:23 |
|   6 |       TABLE ACCESS FULL| EMP      | 61779 |  3559K|       |   127   (1)| 00:00:02 |
|   7 |    TABLE ACCESS FULL   | EMP      | 61779 |   723K|       |   127   (1)| 00:00:02 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(ROWID="$kkqu_col_1")

Note
-----
   - dynamic sampling used for this statement (level=2)

Statistics
----------------------------------------------------------
         36  recursive calls
      30863  db block gets
        940  consistent gets
          0  physical reads
    8303216  redo size
       1141  bytes sent via SQL*Net to client
       1362  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
      29626  rows processed

 del_by_rownumber

Execution Plan
----------------------------------------------------------
Plan hash value: 3671588261

-----------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT             |          |     1 |    24 |   342   (2)| 00:00:05 |
|   1 |  DELETE                      | EMP      |       |       |            |          |
|   2 |   NESTED LOOPS               |          |     1 |    24 |   342   (2)| 00:00:05 |
|   3 |    VIEW                      | VW_NSO_1 | 46251 |   542K|   127   (1)| 00:00:02 |
|   4 |     SORT UNIQUE              |          |     1 |  1129K|            |          |
|*  5 |      VIEW                    |          | 46251 |  1129K|   127   (1)| 00:00:02 |
|   6 |       WINDOW SORT            |          | 46251 |  2664K|   127   (1)| 00:00:02 |
|   7 |        TABLE ACCESS FULL     | EMP      | 46251 |  2664K|   127   (1)| 00:00:02 |
|   8 |    TABLE ACCESS BY USER ROWID| EMP      |     1 |    12 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("DUPL"."ROWNR">1)

Note
-----
   - dynamic sampling used for this statement (level=2)

Statistics
----------------------------------------------------------
         36  recursive calls
      30883  db block gets
        940  consistent gets
          0  physical reads
    8309716  redo size
       1141  bytes sent via SQL*Net to client
       1499  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
      29626  rows processed

So there were some surprises for me.   First that the tried and true way did so well.   But also the one that looked the most complex produced the best trace.    But even so, I’ll stick with the way I have always done delete of dupes as I can remember how to do it off the top of my head,  and my databases do not have duplicates very often.   In regards to the group by, changing min to max will easily delete different entries.   This is also not a complete test.  The duplicates are based on only two columns, and the percentage of duplicates is fairly high.   Sometimes there are duplicates and other columns  have a bearing on which records are to be removed.

Bookmark the permalink.

Leave a Reply