This is the third article of my 12c Adaptive Sampling series. Previously, we saw in Part 1 how ADS works for a single table (also for a group by clause) and in Part 2 for joins. As you know join cardinality is calculated (as Jonathan Lewis has explained in his book)
Join Selectivity =
((num_rows(t1) – num_nulls(t1.c1)) / num_rows(t1)) *
((num_rows(t2) – num_nulls(t2.c2)) / num_rows(t2)) /
greater(num_distinct(t1.c1), num_distinct(t2.c2))
Join Cardinality =
Join Selectivity *
filtered cardinality(t1) * filtered cardinality(t2)
So, to estimate Join cardinality you need to estimate three factors: the cardinality of both tables included in the join and join selectivity. Can this mechanism benefit from dynamic sampling? Of course, estimating cardinality of both tables can benefit from dynamic sampling, but what about join selectivity? The answer is yes. Using dynamic sampling, the DBMS can estimate or calculate column statistics, the number of distinct values and the number of null values. And this information is enough to estimate join cardinality. In the previous articles of this series we used t2 and t3 tables. In this part we also will use these tables but their size are greater than previously. To increase the segment size I have used several insert into select from clauses, so our case is:
SEGMENT_NAME MB (user_segments)
———— ——-
T2 3110.00
T3 3589.00
TABLE_NAME NUM_ROWS BLOCKS (user_tables)
T3 14208000 231391
T2 11810048 204150
TABLE_NAME COLUMN_NAME NUM_DISTINCT HISTOGRAM (user_tab_col_statistics)
T2 OBJECT_NAME 53744 NONE
T3 TABLE_NAME 9004 NONE
TABLE_NAME STALE_STATS (user_tab_statistics)
———— ———–
T2 YES
T3 YES
And we will see the following query
select
count(*)
from
t2
,t3
where
–t2.owner=t3.owner and
t2.object_name=t3.table_name
Now, I have decided to set the NDV of t2.object_name to 15000. In reality, it is 53744. This means that the number of distinct values in the dictionary will be less than actual number of distinct values. The object statistics are stale, but the number of distinct values 53744 represents the reality, because several insert into select from statements do not change the number of distinct values of the object_name column. But why did I decide to change the column statistic? Soon you will know that. I am trying to change it as:
DECLARE
l_distcnt NUMBER;
l_density NUMBER;
l_nullcnt NUMBER;
l_srec DBMS_STATS.StatRec;
l_avgclen NUMBER;
BEGIN
DBMS_STATS.get_column_stats (
ownname => 'sh',
tabname => 't2',
colname => 'object_name',
distcnt => l_distcnt,
density => l_density,
nullcnt => l_nullcnt,
srec => l_srec,
avgclen => l_avgclen);
l_distcnt:=15000;
l_density:=1/15000;
DBMS_STATS.set_column_stats (
ownname => 'sh',
tabname => 't2',
colname => 'object_name',
distcnt => l_distcnt,
density => l_density,
nullcnt => l_nullcnt,
srec => l_srec,
avgclen => l_avgclen);
END;
/
First, I want to note that Oracle completely ignores dynamic sampling statistics for computing any cardinality estimation. For the Q3 query from the 10046 trace file I got the following (we can see that for both tables but I'm providing it only for the T3 table):
SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring
optimizer_features_enable(default) no_parallel */ sum(vsize(C1))/count(*) ,
substrb(dump(max(substrb(C2,1,32)), 16,0,32), 1,120) ,
substrb(dump(min(substrb(C3,1,32)), 16,0,32), 1,120) , SUM(C4),
COUNT(DISTINCT C5)
FROM
(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T3") */ "T3"."TABLE_NAME"
AS C1, "T3"."TABLE_NAME" AS C2, "T3"."TABLE_NAME" AS C3, CASE WHEN
("T3"."TABLE_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T3"."TABLE_NAME" AS
C5 FROM "T3" SAMPLE BLOCK(0.174483, 8) SEED(1) "T3") innerQuery
Sometimes, we see that the sample size is increased (generally doubled), as below:
SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring
optimizer_features_enable(default) no_parallel */ sum(vsize(C1))/count(*) ,
substrb(dump(max(substrb(C2,1,32)), 16,0,32), 1,120) ,
substrb(dump(min(substrb(C3,1,32)), 16,0,32), 1,120) , SUM(C4),
COUNT(DISTINCT C5)
FROM
(SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T3") */ "T3"."TABLE_NAME"
AS C1, "T3"."TABLE_NAME" AS C2, "T3"."TABLE_NAME" AS C3, CASE WHEN
("T3"."TABLE_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T3"."TABLE_NAME" AS
C5 FROM "T3" SAMPLE BLOCK(0.348966, 8) SEED(2) "T3") innerQuery
What does this SQL mean? This statement uses the estimate/calculate column (join) statistics which are involved in the join (the approach can also be applied for the filter predicates). And this is the third method for estimating joins cardinality. Here COUNT(DISTINCT C5) indicates the number of distinct values of the join (or filter) column and SUM(C4) is the number of null values. I have removed all columns from the SQL except the NDV for t3 table, then:
SQL> SELECT
2 COUNT(DISTINCT C5) as num_dist
3 FROM
4 (SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T2") */ "T2"."OBJECT_NAME"
5 AS C1, "T2"."OBJECT_NAME" AS C2, "T2"."OBJECT_NAME" AS C3, CASE WHEN
6 ("T2"."OBJECT_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T2"."OBJECT_NAME" AS
7 C5 FROM "T2" SAMPLE BLOCK(0.402778, 8) SEED(2) "T2") innerQuery;
NUM_DIST
———-
37492
I updated the NDV to 15000 for that column, and the plan was
——————————————–
| Id | Operation | Name | Rows |
——————————————–
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
|* 2 | HASH JOIN | | 11G|
| 3 | TABLE ACCESS FULL| T2 | 11M|
| 4 | TABLE ACCESS FULL| T3 | 14M|
——————————————–
Predicate Information (identified by operation id):
—————————————————
2 – access("T2"."OBJECT_NAME"="T3"."TABLE_NAME")
Note
—–
– Dynamic statistics used: dynamic sampling (level=AUTO)
And from the 10053 trace file
Best NL cost: 2037555914706.766602
resc: 2037555914706.766602 resc_io: 2030576990663.999756 resc_cpu: 75477063522522944
resp: 2037555914706.766602 resp_io: 2030576990663.999756 resc_cpu: 75477063522522944
SPD: Return code in qosdDSDirSetup: NOCTX, estType = JOIN
Join Card: 11186477465.600000 = outer (11810048.000000) * inner (14208000.000000) * sel (6.6667e-05)
As you can see, the optimizer did not use the number of distinct values that were computed using the sampling; instead it used num_distinct from the dictionary. The optimizer has such a mechanism but it does not use it as expected; however the object statistics in the dictionary are STALE; also using sampling the number of distinct values (37492) is greater than dba_tab_col_statistics.num_distinct (15000). This means, in this instance dynamic sampling gives us more correct information than the dictionary, but the optimizer ignores that fact. It should not happen.
Conclusion
ADS will help if your tables are small; otherwise it will not help or can be bad. Depending on the size of the tables involved in the joins and the predicate type (join/filter), dynamic sampling can be completely ignored. Also, the optimizer tries (in some cases) to estimate column statistics for join selectivity but it cannot advantage of that. I hope this situation will be fixed in an upcoming release. In addition, ADS increase the parse time of the statements; therefore it can produce additional concurrency in OLTP environments such as latches and mutexes.
Start the discussion at forums.toadworld.com