I had noted in my first post that using the highest level of tracing caused timeout issues with the offload server heartbeat monitor. Heartbeat issues can also occur with expensive (and badly formed) regexp expressions. By default the heartbeat monitor is set to 6 seconds which is the maximum permitted to process 1MB data in the offload server and mark the task completed and is far more time than is reasonably expected to take.
Operations such as expensive tracing to disk or badly formed regexp expressions that cause that time period to be exceeded lead to this in the alert log:
State dump signal delivered to CELLOFLSRV<10180> by pid - 9860, uid - 3318 Thu Mar 5 12:26:31 2015 561 msec State dump completed for CELLOFLSRV<10180> Clean shutdown signal delivered to CELLOFLSRV<10180> by pid - 9860, uid - 3318 CELLOFLSRV <10180> is exiting with code 1
where the restart server bounces the offload server to clear the perceived hang. Increasing the timeout via:
CellCLI> alter cell events = "immediate cellsrv.cellsrv_setparam('_cell_oflsrv_heartbeat_timeout_sec','60')"
enables the tracing to proceed without causing the restart server.
My point in writing this entry was to provide a work-around when tracing is needed but also to address a couple of blog posts I’d seen that recommend leaving it set at 60 or 90 seconds. This is not a good idea. The heartbeat exists to catch genuine but rare issues and leaving this set to an increased value will hinder the offload server restarting quickly to resume work. This is one parameter that shoud be reset to the default when the work-around is no longer needed unless otherwise directed by support.