Oracle SQL | Category Archives: adaptive serial direct path reads

Triaging Smart Scan

Posted on April 8, 2021 by Roger MacNicol Posted in adaptive serial direct path reads, cell_offload, oracle, SmartScan, trace 2,810 Page views Leave a comment

This document is my attempt to bring together the available options that can be used to determine the root cause of an issue in order to create a roadmap to help support engineers narrow down the cause of concern.

It is a living document and will be edited and amended as time goes by. Please do check back again in the future.

Warning: these parameters should only be used in conjunction with an Oracle Support Engineer and are not intended for DBAs to self-triage; also they should not be left set after triage without discussion with an Oracle Support Engineer.

Continue reading→

Serial Scans failing to offload

Posted on October 13, 2020 by Roger MacNicol Posted in adaptive serial direct path reads, cell_offload, oracle, SmartScan 1,870 Page views 1 Comment

Very Large Buffer Cache

We’ve observed databases with very large buffer caches where Serial Scans don’t make use of Smart Scan when that would have executed faster: improvements to the decision making for Serial Scans have been made under bug 31626438. This fix is back-portable.

A key difference between PQ and Serial is that as part of granule generation PQ sums the sizes of all the partitions that have not been pruned and passes that total size to the buffer cache decision making logic. Because the entire size to be scanned is considered, we make an accurate determination of smart scan benefits and the risk of cache thrashing.

Serial Scans on partitioned tables do not involved the coordinator and have no opportunity to get the larger picture, instead they start work immediately so each partition is considered one at a time and only that one partition’s size is considered by the decision for using Buffer Cache or Direct Read (and hence offload). In the presence of very large buffer caches any given partition can fail the “Is Medium” test (or even the “Is Small” test) and so not get offloaded.

In order to avoid this situation an upper bound of 100MB for using a buffer cache scan has been implemented for any serially scanned segment that:

isn’t using Automatic Big Table Caching (ABTC).
hasn’t had the Small Table parameter changed to a non-default value.

Any partitions larger than 100 MB will now automatically use Direct Read and hence offload on Exadata.

NSMTIO: kxfxghwm:[HWM_NOT_FOUND]

Another case to watch out for is when NSMTIO tracing shows HWM_NOT_FOUND and then choosing a Buffer Cache scan when a Direct Read offloaded scan would have been faster. This can happen when a PQ query gets executed serially (NB: this is NOT the downgrade to serial case, this is still PQ but on a single thread). In this case the coordinator again does not have the opportunity to process all the partitions and as part of that gather the High Water Mark (HWM) for each segment and checkpoint them so we fall back on buffer cache scans. A fix for this is currently being investigated.

Mixed Block Sizes

I have consistently advised against mixing block sizes in a database without a compelling reason backed up by empirical evidence, but for those who must the “Is Medium Table” logic for whether to use buffer cache or direct read has been improved when the database has more than one block size in use. This is tracked by bug 24655250 and fixed in 20.1.

Smart Scan and Recursive queries

Posted on March 5, 2020 by Roger MacNicol Posted in adaptive serial direct path reads, CBO, hints, oracle, parallel, SmartScan, trace, troubleshooting 1,715 Page views Leave a comment

Since Christmas I have been asked to investigate two different “failures to use Smart Scan”. It turns out they both fell into the same little known restriction on the use of Direct Read. Smart Scan critically depends on Direct Read in order to read the synthetic output blocks into private buffers in PGA so with Direct Read disabled Smart Scan is also disabled. In these two cases the restriction is on using Direct Read on Serial Recursive queries.

Case 1: Materialized View Refresh

A customer asked me to investigate why his MView refresh was running slowly and was failing to use Smart Scan. He had used 'trace[NSMTIO] disk=highest' which showed the cause as:

Direct Read for serial qry: disabled(::recursive_call::kctfsage:::)

Continue reading→

Correct syntax for the table_stats hint

Posted on April 16, 2019 by Roger MacNicol Posted in adaptive serial direct path reads, CBO, hints, oracle, SmartScan, trace, troubleshooting, undocumented 2,490 Page views 3 Comments

A friend contacted me to ask why they were having problems using the table_stats hint to influence optimizer decision making and also to influence the decision to use direct read or buffer cache scan so this is just a quick blog post to clarify the syntax as it is not well documented.

table_stats(<table_name> <method> {<keyword>=<value>} )

Method is one of: DEFAULT, SET, SCALE, SAMPLE

Keyword is one of: BLOCKS, ROWS, ROW_LENGTH

Continue reading→

Top-N again: fetch first N rows only vs rownum

Posted on December 30, 2018 by Sayan Malakshinov Posted in adaptive serial direct path reads, CBO, oracle, query optimizing, SQL, troubleshooting 7,440 Page views Leave a comment

Three interesting myths about rowlimiting clause vs rownum have recently been posted on our Russian forum:

TopN query with rownum<=N is always faster than "fetch first N rows only" (ie. row_number()over(order by ...)<=N)
“fetch first N rows only” is always faster than rownum<=N
“SORT ORDER BY STOPKEY” stores just N top records during sorting, while “WINDOW SORT PUSHED RANK” sorts all input and stores all records sorted in memory.

Interestingly that after Vyacheslav posted first statement as an axiom and someone posted old tests(from 2009) and few people made own tests which showed that “fetch first N rows” is about 2-3 times faster than the query with rownum, the final decision was that “fetch first” is always faster.

First of all I want to show that statement #3 is wrong and “WINDOW SORT PUSHED RANK” with row_number works similarly as “SORT ORDER BY STOPKEY”:
It’s pretty easy to show using sort trace:
Let’s create simple small table Tests1 with 1000 rows where A is in range 1-1000 (just 1 block):

create table test1(a not null, b) as
  select level, level from dual connect by level<=1000;

alter session set max_dump_file_size=unlimited;
ALTER SESSION SET EVENTS '10032 trace name context forever, level 10';

ALTER SESSION SET tracefile_identifier = 'rownum';
select * from (select * from test1 order by a) where rownum<=10;

ALTER SESSION SET tracefile_identifier = 'rownumber';
select * from test1 order by a fetch first 10 rows only;

And we can see from the trace files that both queries did the same number of comparisons:

rownum:

[sourcecode language=”sql” highlight=”7″]
—– Current SQL Statement for this session (sql_id=bbg66rcbt76zt) —–
select * from (select * from test1 order by a) where rownum<=10

—- Sort Statistics ——————————
Input records 1000
Output records 10
Total number of comparisons performed 999
Comparisons performed by in-memory sort 999
Total amount of memory used 2048
Uses version 1 sort
—- End of Sort Statistics ———————–
[/sourcecode]

[collapse]

row_number

[sourcecode language=”sql” highlight=”7″]
—– Current SQL Statement for this session (sql_id=duuy4bvaz3d0q) —–
select * from test1 order by a fetch first 10 rows only

[collapse]

Ie. each row (except first one) was compared with the biggest value from top 10 values and since they were bigger than top 10 value, oracle doesn’t compare it with other TopN values.

And if we change the order of rows in the table both of these queries will do the same number of comparisons again:

from 999 to 0

[sourcecode language=”sql”]
create table test1(a not null, b) as
select 1000-level, level from dual connect by level<=1000;

alter session set max_dump_file_size=unlimited;
ALTER SESSION SET EVENTS ‘10032 trace name context forever, level 10’;

ALTER SESSION SET tracefile_identifier = ‘rownum’;
select * from (select * from test1 order by a) where rownum<=10;

ALTER SESSION SET tracefile_identifier = ‘rownumber’;
select * from test1 order by a fetch first 10 rows only;
[/sourcecode]

[collapse]

rownum

[sourcecode language=”sql” highlight=”7″]
—– Current SQL Statement for this session (sql_id=bbg66rcbt76zt) —–
select * from (select * from test1 order by a) where rownum<=10

—- Sort Statistics ——————————
Input records 1000
Output records 1000
Total number of comparisons performed 4976
Comparisons performed by in-memory sort 4976
Total amount of memory used 2048
Uses version 1 sort
—- End of Sort Statistics ———————–
[/sourcecode]

[collapse]

row_number

[sourcecode language=”sql” highlight=”7″]
—– Current SQL Statement for this session (sql_id=duuy4bvaz3d0q) —–
select * from test1 order by a fetch first 10 rows only

[collapse]

We can see that both queries required much more comparisons(4976) here, that’s because each new value is smaller than the biggest value from the topN and even smaller than lowest value, so oracle should get right position for it and it requires 5 comparisons for that (it compares with 10th value, then with 6th, 3rd, 2nd and 1st values from top10). Obviously it makes less comparisons for the first 10 rows.

Now let’s talk about statements #1 and #2:
We know that rownum forces optimizer_mode to switch to “first K rows”, because of the parameter “_optimizer_rownum_pred_based_fkr”

SQL> @param_ rownum

NAME                               VALUE  DEFLT  TYPE      DESCRIPTION
---------------------------------- ------ ------ --------- ------------------------------------------------------
_optimizer_rownum_bind_default     10     TRUE   number    Default value to use for rownum bind
_optimizer_rownum_pred_based_fkr   TRUE   TRUE   boolean   enable the use of first K rows due to rownum predicate
_px_rownum_pd                      TRUE   TRUE   boolean   turn off/on parallel rownum pushdown optimization

while fetch first/row_number doesn’t (it will be changed after the patch #22174392) and it leads to the following consequences:
1. first_rows disables serial direct reads optimization(or smartscan on Exadata), that’s why the tests with big tables showed that “fetch first” were much faster than the query with rownum.
So if we set “_serial_direct_read”=always, we get the same performance in both tests (within the margin of error).

2. In cases when index access (index full scan/index range scan) is better, CBO differently calculates the cardinality of underlying INDEX FULL(range) SCAN:
the query with rownum is optimized for first_k_rows and the cardinality of index access is equal to K rows, but CBO doesn’t reduce cardinality for “fetch first”, so the cost of index access is much higher, compare them:

rownum

[sourcecode language=”sql” highlight=”13″]
SQL> explain plan for
2 select *
3 from (select * from test order by a,b)
4 where rownum<=10;

——————————————————————————————–
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
——————————————————————————————–
| 0 | SELECT STATEMENT | | 10 | 390 | 4 (0)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 10 | 390 | 4 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID| TEST | 1000K| 12M| 4 (0)| 00:00:01 |
| 4 | INDEX FULL SCAN | IX_TEST_AB | 10 | | 3 (0)| 00:00:01 |
——————————————————————————————–

Predicate Information (identified by operation id):
—————————————————

1 – filter(ROWNUM<=10)
[/sourcecode]

[collapse]

fetch first

[sourcecode language=”sql” highlight=”13″]
SQL> explain plan for
2 select *
3 from test
4 order by a,b
5 fetch first 10 rows only;

—————————————————————————————–
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
—————————————————————————————–
| 0 | SELECT STATEMENT | | 10 | 780 | | 5438 (1)| 00:00:01 |
|* 1 | VIEW | | 10 | 780 | | 5438 (1)| 00:00:01 |
|* 2 | WINDOW SORT PUSHED RANK| | 1000K| 12M| 22M| 5438 (1)| 00:00:01 |
| 3 | TABLE ACCESS FULL | TEST | 1000K| 12M| | 690 (1)| 00:00:01 |
—————————————————————————————–

Predicate Information (identified by operation id):
—————————————————

1 – filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=10)
2 – filter(ROW_NUMBER() OVER ( ORDER BY "TEST"."A","TEST"."B")<=10)
[/sourcecode]

[collapse]

fetch first + first_rows

[sourcecode language=”sql” highlight=”14″]
SQL> explain plan for
2 select/*+ first_rows */ *
3 from test
4 order by a,b
5 fetch first 10 rows only;

——————————————————————————————–
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
——————————————————————————————–
| 0 | SELECT STATEMENT | | 10 | 780 | 27376 (1)| 00:00:02 |
|* 1 | VIEW | | 10 | 780 | 27376 (1)| 00:00:02 |
|* 2 | WINDOW NOSORT STOPKEY | | 1000K| 12M| 27376 (1)| 00:00:02 |
| 3 | TABLE ACCESS BY INDEX ROWID| TEST | 1000K| 12M| 27376 (1)| 00:00:02 |
| 4 | INDEX FULL SCAN | IX_TEST_AB | 1000K| | 2637 (1)| 00:00:01 |
——————————————————————————————–

Predicate Information (identified by operation id):
—————————————————

1 – filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=10)
2 – filter(ROW_NUMBER() OVER ( ORDER BY "TEST"."A","TEST"."B")<=10)
[/sourcecode]

[collapse]

fetch first + index

[sourcecode language=”sql” highlight=”14″]
SQL> explain plan for
2 select/*+ index(test (a,b)) */ *
3 from test
4 order by a,b
5 fetch first 10 rows only;

Predicate Information (identified by operation id):
—————————————————

1 – filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=10)
2 – filter(ROW_NUMBER() OVER ( ORDER BY "TEST"."A","TEST"."B")<=10)
[/sourcecode]

[collapse]

So in this case we can add hints “first_rows” or “index”, or install the patch #22174392.

ps. I thought to post this note later, since I hadn’t time enough to add other interesting details about the different TopN variants, including “with tie”, rank(), etc, so I’ll post another note with more details later.

Adaptive serial direct path read decision ignores object statistics since 12.1

Posted on November 19, 2017 by Sayan Malakshinov Posted in 12c, adaptive serial direct path reads, oracle 1,904 Page views 1 Comment

On versions 11.2.0.2 – 11.2.0.4 Oracle uses objects’ statistics to make direct path reads decision (of course, if “_direct_read_decision_statistics_driven” haven’t changed it to “false”), and we can force serial direct reads on statement level using sql profiles with hints INDEX_STATS/TABLES_STATS, but since at least 12.1.0.2 this decision ignores statistics.

Btw, thanks to Jure Bratina, we know now, that we need to repeat hint TABLE_STATS at least twice to make it work 🙂 And from the following test case we know that it takes parameters from second one 🙂

Compare trace files of the following test cases:

table_stats2.sql

[sourcecode language=”sql”]
drop table t1;

create table t1 as select * from dual;
—————————-
pro ######################################;
exec dbms_stats.gather_table_stats(”,’T1′);
exec dbms_stats.set_table_stats(user,’T1′,numblks => 33333333);

col value new_val oldval noprint;
select value from v$statname n, v$mystat s
where n.statistic#=s.statistic# and n.name =’physical reads direct’;

—————————-
alter session set tracefile_identifier=’table_stats2′;
–alter session set events ‘10053 trace name context forever, level 1’;
alter session set events ‘trace[nsmtio] disk highest’;
select/*+ table_stats(t1, scale, blocks=66666666 rows=2222222222)
table_stats(t1, scale, blocks=99999999 rows=4444444444)
*/
*
from t1;
select value-&oldval directreads, value from v$statname n, v$mystat s
where n.statistic#=s.statistic# and n.name =’physical reads direct’;

[/sourcecode]

[collapse]

You can see that our hint successfully changed number of blocks and forced direct path reads on 11.2.0.4:

Oracle 11.2.0.4:

[sourcecode language=”sql” highlight=”3″]
NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=797445
Object# = 78376, Object_Size = 66666666 blocks
SqlId = 7naawntkc57yx, plan_hash_value = 3617692013, Partition# = 0
[/sourcecode]

[collapse]

But on 12.1.0.2 and 12.2.0.1 we can see 2 lines with “NSMTIO: kcbism” with the different types(2 and 3) and different number of blocks, and the direct path read decision was based on second one – from segment header:

Oracle 12.1.0.2:

[sourcecode language=”sql” highlight=”1,2″]
NSMTIO: kcbism: islarge 1 next 0 nblks 66666666 type 3, bpid 65535, kcbisdbfc 0 kcbnhl 4096 kcbstt 983 keep_nb 0 kcbnbh 59010 kcbnwp 1
NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 4096 kcbstt 983 keep_nb 0 kcbnbh 59010 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect’s size: 4 (blocks), Threshold: MTT(4917 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: 7naawntkc57yx, plan_hash_value: 3617692013, Object#: 302342, Parition#: 0 DW_scan: disabled
[/sourcecode]

[collapse]

Oracle 12.2.0.1:

[sourcecode language=”sql” highlight=”1,2″]
NSMTIO: kcbism: islarge 1 next 0 nblks 66666666 type 3, bpid 65535, kcbisdbfc 0 kcbnhl 4096 kcbstt 1214 keep_nb 0 kcbnbh 45026 kcbnwp 1
NSMTIO: kcbism: islarge 0 next 0 nblks 4 type 2, bpid 3, kcbisdbfc 0 kcbnhl 4096 kcbstt 1214 keep_nb 0 kcbnbh 45026 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect’s size: 4 (blocks), Threshold: MTT(6072 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(:::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: 7naawntkc57yx, plan_hash_value: 3617692013, Object#: 174411, Parition#: 0 DW_scan: disabled
[/sourcecode]

[collapse]

And similar example, but for IFFS(index fast full scan):
nb: I set the number of index blocks using dbms_stats to 33333000 and hinted the query with 77777700

index_stats2.sql

[sourcecode language=”sql”]
drop table t2 purge;
ALTER SESSION SET optimizer_dynamic_sampling = 0;
ALTER SESSION SET "_optimizer_use_feedback" = FALSE;
ALTER SESSION SET optimizer_adaptive_features = FALSE;
ALTER SESSION SET optimizer_adaptive_plans=FALSE;

create table t2(x) as select level from dual connect by level<=1000;
create index t2_ix on t2(1,x,rpad(x,100));

begin
dbms_stats.gather_table_stats(”,’T2′,cascade => true);
dbms_stats.set_table_stats(user,’T2′ ,numblks => 33333333);
dbms_stats.set_index_stats(user,’T2_IX’,numlblks => 33333000);
end;
/

col value new_val oldval noprint;
select value from v$statname n, v$mystat s
where n.statistic#=s.statistic# and n.name =’physical reads direct’;

alter session set tracefile_identifier=’index_stats2′;
alter session set events ‘trace[nsmtio] disk highest’;
select/*+ index_stats(t2, t2_ix, scale, blocks=7777700)
index_ffs(t2 t2_ix)
dynamic_sampling(0)
*/
count(*) cnt2
from t2;
select value-&oldval directreads, value from v$statname n, v$mystat s
where n.statistic#=s.statistic# and n.name =’physical reads direct’;

disc;
[/sourcecode]

[collapse]

You can see that on 11.2.0.4 oracle gets number of blocks from the hint (7777700)

Oracle 11.2.0.4 - index_stats:

[sourcecode language=”sql” highlight=”2″]
NSMTIO: qerixFetchFastFullScan:DirectRead[OBJECT_SIZE > VLOT]:
NSMTIO: AdditionalInfo: Object_size: 7777700 (blocks), vlot=797445
SqlId=by2zv0k566hj5, plan_hash_value=3419274230,Object#=78375
[/sourcecode]

[collapse]

From the first line we can see that kcbism takes the hinted number of blocks, but later kcbivlo rewrites it with the number from segment header:

Oracle 12.1.0.2 - index_stats:

[sourcecode language=”sql” highlight=”2″]
NSMTIO: kcbism: islarge 1 next 0 nblks 7777700 type 3, bpid 65535, kcbisdbfc 0 kcbnhl 4096 kcbstt 983 keep_nb 0 kcbnbh 59010 kcbnwp 1
NSMTIO: kcbivlo: nblks 22 vlot 500 pnb 49175 kcbisdbfc 0 is_large 0
NSMTIO: qerixFetchFastFullScan:[MTT < OBJECT_SIZE < VLOT]:NSMTIO: AdditionalInfo: Object_size: 22 (blocks), vlot=245875
SqlId=by2zv0k566hj5, plan_hash_value=3419274230,Object#=302347
[/sourcecode]

[collapse]

Oracle 12.2.0.1 - index_stats:

[sourcecode language=”sql” highlight=”2″]
NSMTIO: kcbism: islarge 1 next 0 nblks 7777700 type 3, bpid 65535, kcbisdbfc 0 kcbnhl 4096 kcbstt 1214 keep_nb 0 kcbnbh 45026 kcbnwp 1
NSMTIO: kcbivlo: nblks 22 vlot 500 pnb 60729 kcbisdbfc 0 is_large 0
NSMTIO: qerixFetchFastFullScan:[MTT < OBJECT_SIZE < VLOT]:NSMTIO: AdditionalInfo: Object_size: 22 (blocks), vlot=303645
SqlId=by2zv0k566hj5, plan_hash_value=3419274230,Object#=174409
[/sourcecode]

[collapse]

So we have 2 options to force direct reads:
1. to execute alter session set “_serial_direct_read”=’ALWAYS’;
2. or to force parallel plan to get parallel direct path reads (we can do it with even with dop=1)

Revisiting buffer cache Very Large Object Threshold

Posted on November 15, 2017 by Roger MacNicol Posted in adaptive serial direct path reads, oracle, SmartScan 1,941 Page views Leave a comment

If you turn on NSMTIO tracing you will see references to VLOT:

qertbFetch:[MTT &lt; OBJECT_SIZE &lt; VLOT]: Checking cost to read from caches (local/remote) and checking storage reduction factors (OLTP/EHCC Comp)

I had said you could ignore VLOT and Frits Hoogland pointed out that tracing showed it had some impact, so let me clarify:

VLOT is the absolute upper bound that cached reads can even be considered.

This defaults to 500% of the number of buffers in the cache i.e.

_very_large_object_threshold = 500

While this number is not used in any calculations, it is used in two places as a cutoff to consider those calculations

1) Can we consider using Automatic Big Table Caching (a.k.a. DWSCAN) for this object?

2) Should we do a cost analysis for Buffer Cache scan vs Direct Read scan on tables larger than the MTT?

The logic for tables above the calculated medium table threshold (MTT) and that are NOT part of searched DMLs and are NOT on Exadata with statistics based storage reduction factor enabled (_statistics_based_srf_enabled) is:

If _serial_direct_read == ALWAYS, use Direct Read
If _serial_direct_read == NEVER, use Buffer Cache
If _serial_direct_read == AUTO and #blocks in table < VLOT, use cost model
Else use Direct Read “qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]”

In practice 5X buffer cache is so large the cost based decision will come to the same conclusion anyway – the default VLOT simply saves time spent doing the analysis.

For example, I got a quick count of the number of blocks in non-partitioned TPC_H Scale 1 lineitem

select segment_name,sum(blocks),sum(bytes) from user_extents where segment_name='LINEITEM'

and created my buffer cache to be exactly the same size. With this setup, setting _very_large_object_threshold=100 bypassed the cost model and went straight to DR scan, while setting it to 200 forced the use of the cost model.

The moral of this is that the default value of VLOT rarely changes the decisions made unless you reduce VLOT to a much smaller multiplier of the cache size and can start to see it cause a few more of your larger buffer cache scans move to direct read when they are no longer eligible for cost analysis. If you wish to stop some of the largest buffer cache scans from happening you would need to set _very_large_object_threshold less than 200.

The beginners guide to Oracle Table Scans

Posted on August 2, 2017 by Roger MacNicol Posted in adaptive serial direct path reads, cell_offload, inmemory, oracle, SmartScan, TurboScan 1,876 Page views Leave a comment

I was asked a question yesterday that reminded me there are always people completely new to the topic who need an introduction – somewhere to start before the other articles make sense. So, here’s my brief write-up of everything you need to know about the basic of Oracle Table Scans.

Oracle has four main ways of scanning a table: the pre-9ir2 table scan, the 9ir2 TurboScan, the 11.1.0.1 Exadata SmartScan, and the 12.1.0.1 In-Memory Scan. Before we summarize each one, the other fundamental piece of information is the Oracle dictum that all blocks much be self-describing: a table scan routine should be able to inspect a block and understand what object it belongs, whether it needs an undo applying, and how the data is laid out without reference to any external structures or secondary storage.

The original table scan routine

Oracle uses a “dataflow” query engine which means a query plan is built from nodes like a sausage machine that have three basic operations: Open, Next, Close. ‘Open’ means you ask the next node in the chain to prepare to do some work including acquiring any resources it may need, ‘Next’ means you fetch one unit of work from your child e.g. a row, and ‘Close’ means to tell your child node to shut down and release any resources it may be holding. You build a query by connecting the right kinds of nodes together in the order you want: one node just sorts, another groups, another does hash joins. The end of the sausage machine is the node seen on query plans as “Table Access Full”

This node would ask the data layer to fetch a block from disk then get rows one at a time from the data layer. This is the work horse table scan: it can scan any kind of data and do SCN manipulations like row versions but it is not the fastest way to scan a table.

9ir2 TurboScan

In 9ir2 we introduced a much faster way of scanning tables called TurboScan. The data layer function which had been handing out rows one at a time was replaced by one that stays in a tight loop retrieving rows from disk and pushing them into a callback supplied by “Table Access Full”. An automation tool was used to generate several versions of this routine that optimized out common choices that has to be made: does the user need rowids to be projected? do they need predicates applying? is the data compressed or? is the data column-major or row-major? etc etc Every time a CPU reaches a branch in the code it tries to guess which side of the branch will be taken but if it guess wrong there can be a considerable stall during which no work gets done. By removing most of the branches, the code runs much much more quickly.

TurboScan is used for all queries which do not use RAW datatypes and which do not need special SCN processing.

Both pre-9ir2 scan and TurboScan can use the buffer cache to get blocks (typically small to medium tables) or use Direct Read to get blocks (typically medium to large tables).

See: When bloggers get it wrong – part 1

TurboScan can be disabled for triage purposes by setting:

SQL> alter session set events='12099 trace name context forever, level 1';

or specifically you can disable it only for HCC tables by setting:

SQL> alter session set "_arch_comp_dbg_scan"=1;

Exadata SmartScan

In 11.1.0.1 we introduced Exadata SmartScan intelligent storage. This is where a thin layer of database processing is embedded in the storage cells and the table scan routine offloads simple search criteria and a list of the columns it needs to storage and the storage cells pre-process the blocks to remove rows that fail the search criteria and remove columns which are not needed by the table scan. If all the rows are removed, the block doesn’t have to be sent back at all.

SmartScan can drastically reduce the amount of data returned on the Interconnect and put on the RDBMS memory bus and the space used in SGA by the returned data. An additional significant benefit is gained when the CPU fetches the reduced blocks into the CPU cache since only relevant information exists on the block there is not space wasted by unwanted columns interspersing the wanted columns meaning more relevant data can fit in memory and the CPU prefetch can do a better job of predicting which memory cache line to fetch next.

Only TurboScan Direct Read scans can use this offload capability. You can disable SmartScan for triage purposes by setting:

SQL> alter session set cell_offload_processing=FALSE;

SQL> select /*+ opt_param('cell_offload_processing','false') */  <col> from <tab> where <predicate>;

In-Memory Scans

In-Memory scans were introduced in 12.1.0.1 and brought a revolutionary increase in table scan speeds. With In-Memory scans the table or partition is loaded into a in-memory tablespace in SGA known as the inmemory-area. Data is stored in compressed columnar format typically up to 500,000 values in each columnar compression unit. This tablespace is kept transactionally consistent with the data on disk via means of an invalidation bitmap.

Just like with SmartScan, only TurboScan can use In-Memory scans with In-Memory objects. Instead of getting a block from disk, the specialized version of the scan routines fetches a column run from each column of interest, process the search criteria, then returns column runs with the failing rows removed to the “Table Access Full” node.

If any rows have been modified and committed by other users or the users own transaction has modified any rows the scan will see these rows set in the invalidation bitmap. These rows are removed from the columnar results and the additional rows required are fetched from the buffer cache before moving on to the next set of column runs. This works well because the most recently modified blocks are the ones most likely to still be in the buffer cache.

When bloggers get it wrong – part 2

Posted on May 4, 2017 by Roger MacNicol Posted in adaptive serial direct path reads, oracle, SmartScan, trace, troubleshooting 2,056 Page views 1 Comment

In Part 2 we are going to look at making use of the trace events that show what was discussed in Part 1.
NB: Oracle no longer adds new numeric trace events, going forward new trace events use the Unified Tracing Service whose grammer is much simpler. The elements we need are:

trace[[x.]y] disk = [ lowest | low | medium | high | highest ]

For example Table Scan tracing is in the DATA hierachy:

[1] DATA
[2] KDS    “Kernel Data Scan”
[3] KDSFTS  “Full Table Scan”
[3] KDSRID  “RowID”

‘trace[KDSFTS] disk low’ – only trace full table scans
‘trace[KDSRID] disk low’ – only trace fetch by rowid
‘trace[KDS.*] disk low’ – trace both table scans and fetch by rowid
NB: don’t use ‘lowest’ with KDS – it is used for memory tracing only

Tracing Full Table Scans: KDSFTS

At the beginning of a granule we see if it is possible to use Turbo Scan (which is a prerequisite for using Exadata Smart Scan) and the data object number being scanned:

Continue reading→

When bloggers get it wrong – part 1

Posted on May 4, 2017 by Roger MacNicol Posted in adaptive serial direct path reads, oracle, SmartScan 2,379 Page views 3 Comments

I’ve read a number of blog entries from various people who’ve clearly put great diligence into trying to understand how the decision to use the buffer cache for a table scan or whether it is better to use direct read for the scan is made. Since this is critical decision from a performance perspective, I’ve decided to write a more definitive account. Part 1 deals with a few principles and part 2 (which will follow very shortly) will show how this works with tracing and clear up a few misunderstandings. Part 3 will deal with PQ in more detail as soon as I have time.

Serial Scans

Small Table Threshold (STT)

Serial scans use direct path reads when the object is not a small table and, since 11.2.0.2, the size used to determine whether it is “small” has come stats rather than the size from segment header.

_small_table_threshold defaults to the 2% buffer cache or to 20 blocks whichever is bigger (NB: this parameter ignores the effects of compression). An object is small enough for buffer cache reads to be more effective than direct path read if the number of blocks it contains is less than or equal to the value of _small_table_threshold. Such scans will bump the stat “table scans (short tables)”

Medium Table Threshold (MTT)

So what happens when an object is bigger than STT? We fall into the MTT case: MTT is not directly configurable and defaults to 10% of the buffer cache blocks. But it is also depends on a cost based decision that includes how busy the cache is.
NB: PQ and Exadata have different rules.
NB: MTT also depends on _object_statistics being enabled.

For STT < #blocks < MTT, if the cost based decision means a buffer cache scan is chosen, the stats “table scans (short tables)” will still get bumped.

So when is a medium table not worth caching?

First, when MTT < #blocks (ignore VLOT in blogs, it hasn’t been used for many years but persists in the trace and blogs for historic reasons).

Second, if the scan is not a select for update: different rules apply because cache scans are more useful if the segment being updated is first the subject of a query. Table larger than MTT also do factor in storage reduction factor. The ratios used are 10X for HCC and 3.3.X for OLTP.

Third, a cost based analysis is done comprising the % currently in cache, the % buffers are dirty (check-pointing cost), the current I/O bottleneck, and any RAC access costs. If the buffer cache is still used the stat “table scans (long tables)” will get bumped. But, if Direct Read is chosen the stat “table scans (direct read)” will get bumped.

Automatic Big Table Caching (ABTC)

So does this mean the buffer cache never helps with large tables? Not at all, large tables can still use the buffer cache for a portion of their scan based on the public parameter DB_BIG_TABLE_CACHE_PERCENT_TARGET which sets aside a percentage of the buffer cache for larger tables

On a single instance, ABTC works with both Serial and PQ scans BUT mixing Serial and PQ may lead to fewer cache hits. On RAC, it only works with PQ and requires that PARALLEL_DEGREE_POLICY set to AUTO or ADAPTIVE.

ABTC works by tracking the heat of medium and large tables and is designed to take advantage of any runs of buffers currently cached. Then if the decision to use ABTC is made, it periodically checks the cache for how many buffers in the next section are already in memory to see whether Direct Read or Buffer Cache scan would be better for the next chunk. Let’s see how the heat tracking works which looking at the v$bt_scan_obj_temps view.

SQL> select DATAOBJ#,SIZE_IN_BLKS,TEMPERATURE from V$BT_SCAN_OBJ_TEMPS; 

DATAOBJ#   SIZE_IN_BLKS TEMPERATURE
---------- ------------ -----------    
     79042         1689        3000

Each time we scan a medium or larger table it will bump that object’s temperature by 1000. Then, the object with the highest temperature will get absolute priority for caching and, if needed, will evict any previously cached ABTC blocks of other objects.
If the hottest object only partially fits in the buffer cache (which is quite likely), the first section that fits will be read through the buffer cache and cached then the rest of the scan will switch to Direct Read when the ABTC allocation in the cache is fully utilized.
NB: The object will still be check-pointed at the beginning so that it is safe to switch to Direct Read at any time.

Part 2 will include tracing to how how to make use of this feature.