Oracle SQL
  • LICENSE

What you need to know about Offload Quarantines

Posted on March 24, 2016 by Roger MacNicol Posted in oracle, SmartScan 1 Comment

Several questions have couple up in the last few weeks about offload quarantines which means a blog post on this topic is overdue. We work hard to stress test every new rpm that is released but on rare occasions customers can encounter an issue with the thin database layer that exists in the offload server. This layer is known externally as “Smart Scan” and internally as “FPLIB” (a.k.a. Filter Projection Library).

A crash in the thin database layer could because of either an issue with some aspect of the sql_id (for example, with the predicates) or because of an issue with the data on some region of disk (for example, with the OLTP compression symbol table). The worst, and rarest, form of crashes are where striping leads to every offload server failing simultaneously: these are known colloquially as “Railroad Crashes”). The most important thing is to make sure the retry mechanism doesn’t immediately resubmit the query and re-crash the offload server causing a halt to the business operating. In a hospital, the floor nurse would call a code and the crash team would come running with a crash cart to stabilize the patient. Two members of my family are nurses and I’m reminded that nurses are a lot like technical support engineers in that while doing their job they sometimes have to deal with abuse from frustrated patients (customers): please remember that both groups work hard to resolve your issues and be kind to them!

What are Quarantines?

The option of calling a crash cart is not available to us here so starting in early 11.2.0.3, we created a quarantine system where, after a crash, the exception handler remembers both the sql_id and the disk region being processed and creates a persistent quarantine for both. When a sql_id or a disk region is quarantined any Smart Scan operations on them will be executed in passthru mode.

Currently it is hard to see when this is happening, but an upcoming release has a new stat to make this easier. If an operation has been quarantined you can see its effects by monitoring:

  • cell num bytes in passthru due to quarantine

Listing Quarantines 

When a quarantine has been created you can look at it in detail using CellCLI:

CellCLI> list quarantine 1 detail
     name:                      1
         clientPID:             12798
         crashReason:           ORA-600[17099]
         creationTime:          2011-02-07T17:18:13-08:00
         dbUniqueID:            2022407934
         dbUniqueName:          YAMA
         incidentID:            16
         planLineID:            37
         quarantineReason:      Crash
         quarantineType:        "SQL PLAN"
         remoteHostName:        sclbndb02.us.oracle.com
         rpmVersion:            OSS_11.2.0.3.0_LINUX.X64_1012
         sqlID:                 1jw05wutgpfyf
         sqlPlanHashValue:      3386023660

The ‘list detail’ version of the command gives us everything we would need to know about exactly what has been quarantined and why it was quarantined. CellCLI also supports manually creating a quarantine using the attributes shown by ‘list detail’.

Quarantine Escalation

This is the topic that has caused the most confusion: if three new quarantines are generated within a 24 hour period the quarantine is escalated to a database quarantine. Using the ‘list detail’ option we would then see:

quarantinePlan:         SYSTEM
quarantineReason:       Escalated
quarantineType:         DATABASE

Note: the number of quarantines in 24 hours before escalation is configurable via a cellinit param: please contact Technical Support if you feel you have a valid need to change this.

The final level of escalation is where if more than one database has been escalated to a database quarantine, the system will escalate to a complete offload quarantine where Smart Scan is disabled completely and all I/O goes through regular block I/O. I’m glad to say that I’ve have never seen this happen.

Dropping Quarantines 

The next question is how and when are the quarantines removed. Any quarantine can be removed manually using CellCLI. Quarantines are also automatically dropped by certain operations:

  • Plan Step Quarantines:
    • Dropped on rpm upgrades
  • Disk Region Quarantines
    • Dropped on rpm upgrades, or on successful writes to the quarantined disk region
  • Database quarantines
    • Dropped on rpm upgrades
  • Offload quarantines
    • Dropped on rpm upgrades



What about CDB?

In 12.1, we changed the architecture of cellsrv to support multiple RDBMS versions running at the same time by introducing the concept of offload servers. When a new rpm is installed it typically contains offload servers for 11.2.0.4, 12.1.0.1, 12.1.0.2, (and 12.2.0.1). This is known internally as multi-DB. Any given operation is tagged with the RDBMS version it is coming from and routed to the offload server for that version. A crash in Smart Scan typically means that only the offload server has to restart and not the central cellsrv that maintains Storage Index and does Smart IO. A side effect of this is that all operations for that RDBMS version can revert to Block IO while the offload server restarts minimizing disruption.

The architecture change necessitated a change to the way quarantines are created, checked, and dropped. In multi-DB, installation of a new rpm no longer drops all quarantines. Instead, system created quarantines now record the offload server’s rpm version. Manually created quarantines can also optionally specify offload rpm they are to effect. In multi-DB, a quarantine is observed if the offload rpm specified matches the actual offload rpm the operation will be sent to or if no offload rpm is specified regardless of offloadgroup.
Multi-DB quarantines are dropped if the matching offload rpm is uninstalled or a new rpm installed for that offload version. Multi-DB quarantines with no offload rpm specified must be dropped manually.

Please let me know if you have any questions. 

 Roger

Print PDF eBook
Offload Quarantine oracle Roger MacNicol SmartScan
« WINDOW NOSORT STOPKEY + RANK()
Maven: how to copy files after a build into several distribution directories »
photo Sayan Malakshinov

Oracle Database Developer Choice Award winner
Oracle ACE Oracle ACE
Oracle performance tuning expert
UK / Aylesbury / Transmedia Dynamics
sayan@orasql.org

Simple Android Oracle client

Get it on Google Play

Recent Posts

  • Where does the commit or rollback happen in PL/SQL code?
  • :1 and SP2-0553: Illegal variable name “1”.
  • ORA exceptions that can’t be caught by exception handler
  • Another interesting troubleshooting case
  • SQL*Plus tips #9: Reading traces and incident files

Recent Comments

  • 255 column catalogue | Oracle Scratchpad on Intra-block row chaining optimization in 12.2
  • 255 column catalogue | Oracle Scratchpad on row pieces, 255 columns, intra-block row chaining in details
  • opt_estimate catalogue | Oracle Scratchpad on Correct syntax for the table_stats hint
  • Oracle SQL | Triaging Smart Scan on Tracing Hybrid Columnar Compression in an offload server
  • Oracle SQL | Create Quarantine on What you need to know about Offload Quarantines
  • Oracle SQL | Shining some light on Database In-Memory vs the Exadata Columnar Cache in 12.1.0.2 on Tracing Hybrid Columnar Compression in an offload server
  • Oracle SQL | Triaging Smart Scan on Using trace events with an offload server
  • Oracle SQL | More on tracing the offload server on Tracing Hybrid Columnar Compression in an offload server
  • Oracle SQL | Tracing Hybrid Columnar Compression in an offload server on Using trace events with an offload server
  • Oracle SQL | Serial Scans failing to offload on Random thoughts on block sizes

Blogroll

  • Alex Fatkulin
  • Alexander Anokhin
  • Andrey Nikolaev
  • Charles Hooper
  • Christian Antognini
  • Coskan Gundogar
  • David Fitzjarrell
  • Igor Usoltsev
  • Jonathan Lewis
  • Karl Arao
  • Mark Bobak
  • Martin Bach
  • Martin Berger
  • Neil Chandler
  • Randolf Geist
  • Richard Foote
  • Riyaj Shamsudeen
  • Tanel Poder
  • Timur Akhmadeev
  • Valentin Nikotin

Categories

Aggregated by OraNA Aggregated by OraFAQ

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
©Sayan Malakshinov. Oracle SQL