Real data often shows correlations between column values. Because the optimizer does not know these real-world relationships, it will have difficulty estimating the cardinality and may pick sub-optimal execution plans for your queries. For example, it may join tables in an inefficient order or ignore the most efficient index etc.
Let’s create a very simple table to demonstrate how optimizer miscalculates the cardinality when there are correlated columns in WHERE clause:
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE TABLE sampletable( id NUMBER, hotelname VARCHAR2(50), city VARCHAR2(50), country VARCHAR2(20)); INSERT INTO sampletable VALUES( 1, 'Pullman','Barcelona','Spain' ); INSERT INTO sampletable VALUES( 2, 'Gran Melia Fenix','Madrid','Spain' ); INSERT INTO sampletable VALUES( 3, 'Melia Castilla','Madrid','Spain' ); INSERT INTO sampletable VALUES( 4, 'Trump International','New York','US' ); INSERT INTO sampletable VALUES( 5, 'Four Seasons','New York','US' ); INSERT INTO sampletable VALUES( 6, 'Ambasciatori Palace','Rome','Italy' ); COMMIT; |
We’ll gather statistics of our table:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
EXEC SYS.DBMS_STATS.GATHER_TABLE_STATS ( OwnName => 'GOKHAN', TabName => 'sampletable', Method_Opt => 'FOR ALL COLUMNS', Cascade => TRUE ); SELECT COLUMN_NAME, NUM_DISTINCT FROM USER_TAB_COL_STATISTICS WHERE TABLE_NAME = 'SAMPLETABLE'; COLUMN_NAME NUM_DISTINCT ------------------------------ ------------ ID 6 HOTELNAME 6 CITY 4 COUNTRY 3 |
We gathered stats and the optimizer will show the correct cardinality for queries with single predicates. “E-Rows” means estimated rows (cardinality):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
EXPLAIN PLAN FOR SELECT * FROM sampletable WHERE city = 'Madrid'; SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY( format => 'ALLSTATS' )); -------------------------------------------------- | Id | Operation | Name | E-Rows | -------------------------------------------------- | 0 | SELECT STATEMENT | | 2 | |* 1 | TABLE ACCESS FULL| SAMPLETABLE | 2 | -------------------------------------------------- EXPLAIN PLAN FOR SELECT * FROM sampletable WHERE country = 'Spain'; SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY( format => 'ALLSTATS' )); -------------------------------------------------- | Id | Operation | Name | E-Rows | -------------------------------------------------- | 0 | SELECT STATEMENT | | 3 | |* 1 | TABLE ACCESS FULL| SAMPLETABLE | 3 | -------------------------------------------------- |
In next query, we will use two predicates. Logically one of them is ineffective but optimizer will think that both predicates will filter the rows:
1 2 3 4 5 6 7 8 9 10 11 12 |
EXPLAIN PLAN FOR SELECT * FROM sampletable WHERE city = 'Madrid' and country = 'Spain'; SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY( format => 'ALLSTATS' )); -------------------------------------------------- | Id | Operation | Name | E-Rows | -------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | |* 1 | TABLE ACCESS FULL| SAMPLETABLE | 1 | -------------------------------------------------- |
As you see, the optimizer assumes each predicates will filter the rows so it expects only one returns from full table scan. In Oracle 11g, we can create extended statistics, so the optimizer can include the correlation factor between the columns when estimating the cardinality.
Here’s the function to create extended statistics:
DBMS_STATS.CREATE_EXTENDED_STATS(
ownname VARCHAR2,
tabname VARCHAR2,
extension VARCHAR2)
RETURN VARCHAR2;
Extension can be either a column group or an expression. Suppose the specified table has two columns c1, c2. An example column group can be “(c1, c2)” and an example expression can be “(c1 + c2)”.
This function creates a column statistics entry in the system for a user specified column group or an expression in a table. Statistics for this extension will be gathered when user or auto statistics gathering job gathers statistics for the table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
SELECT SYS.DBMS_STATS.CREATE_EXTENDED_STATS ('GOKHAN','sampletable','(city,country)' ) FROM DUAL; EXEC SYS.DBMS_STATS.GATHER_TABLE_STATS ( OwnName => 'GOKHAN', TabName => 'sampletable', Method_Opt => 'FOR ALL COLUMNS', Cascade => TRUE ); EXPLAIN PLAN FOR SELECT * FROM sampletable WHERE country = 'Spain' and city = 'Madrid' ; SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY( format => 'ALLSTATS' )); -------------------------------------------------- | Id | Operation | Name | E-Rows | -------------------------------------------------- | 0 | SELECT STATEMENT | | 2 | |* 1 | TABLE ACCESS FULL| SAMPLETABLE | 2 | -------------------------------------------------- |
Now the optimizer knows that there’s a correlation between these columns, and makes a better estimation for the cardinality.
Here are some notes about extended statistics:
1) An extension cannot contain a virtual column.
2) You can not create extended statistics on the tables owned by SYS.
3) You can not create extended statistics on cluster tables, index organized tables, temporary tables, external tables.
4) COMPATIBLE parameter needs to be 11.0.0.0.0 or greater.
5) Extended statistics only works with equality predicates and in-lists.
It’s possible to drop the extended statistics:
PROCEDURE DROP_EXTENDED_STATS(
ownname VARCHAR2,
tabname VARCHAR2,
extension VARCHAR2);