The Scenario

It's 09:47 on a Tuesday. Your monitoring fires. The EBS production server is at load average 120. There are 247 active Oracle sessions. Users are calling saying nothing works. Your manager is calling. Your manager's manager is about to call.

This is the moment where having a practiced triage methodology is the difference between a 45-minute incident and a 4-hour one.

First 60 Seconds: OS-Level Triage

Do not log into Oracle yet. Start at the OS.

# Load average: how many processes are runnable or waiting for I/O?
uptime

What is actually consuming CPU?

top -b -n 1 | head -30

Is this CPU or I/O pressure?

vmstat 2 5

If wa is high, identify the I/O pattern

iostat -x 2 5

Memory: are we swapping?

free -h

High CPU, low I/O wait: Bad SQL plans, parsing storm, latch contention.
High I/O wait: Full table scan on large table, UNDO or TEMP I/O.
Swapping: Oracle SGA/PGA sized too large for available RAM.

Oracle Session Analysis

-- How many sessions? What state are they in?
SELECT STATUS, COUNT(*) FROM V$SESSION GROUP BY STATUS;

-- Top 25 active sessions — most waited first
SELECT s.sid, s.serial#, s.username, s.status,
s.event, s.wait_class, s.seconds_in_wait, s.sql_id, s.blocking_session
FROM V$SESSION s
WHERE s.status = 'ACTIVE' AND s.username IS NOT NULL
ORDER BY s.seconds_in_wait DESC
FETCH FIRST 25 ROWS ONLY;

The wait_class column is your first filter:


Finding the Blocking Chain

-- Find blocker and all blocked sessions
SELECT
    l1.sid AS blocker_sid, s1.username AS blocker_user,
    s1.sql_id AS blocker_sql, s1.seconds_in_wait AS blocker_secs,
    l2.sid AS blocked_sid, s2.username AS blocked_user
FROM V$LOCK l1
    JOIN V$LOCK l2 ON l1.id1 = l2.id1 AND l1.id2 = l2.id2
    JOIN V$SESSION s1 ON l1.sid = s1.sid
    JOIN V$SESSION s2 ON l2.sid = s2.sid
WHERE l1.block = 1 AND l2.request > 0
ORDER BY blocker_secs DESC;

Finding Top CPU Consumers

-- Top 15 SQL statements by CPU in the shared pool
SELECT sql_id,
       ROUND(cpu_time/1000000, 1) cpu_secs,
       ROUND(elapsed_time/1000000, 1) ela_secs,
       executions,
       SUBSTR(sql_text, 1, 80) sql_preview
FROM V$SQL
WHERE executions > 0
ORDER BY cpu_time DESC
FETCH FIRST 15 ROWS ONLY;

ASH: The Crisis Investigator

-- What have sessions been waiting on in the last 10 minutes?
SELECT event, wait_class, COUNT(*) samples,
       ROUND(COUNT() / SUM(COUNT()) OVER () * 100, 1) pct
FROM V$ACTIVE_SESSION_HISTORY
WHERE sample_time > SYSDATE - 10/1440 AND session_type = 'FOREGROUND'
GROUP BY event, wait_class
ORDER BY samples DESC
FETCH FIRST 15 ROWS ONLY;

Emergency Interventions

Kill a Blocking Session

ALTER SYSTEM KILL SESSION '123,4567' IMMEDIATE;

Terminate a Runaway SQL

ALTER SYSTEM CANCEL SQL 'SID=123, SERIAL=4567, SQL_ID=abc123def';

As a Last Resort: Flush Shared Pool

-- Do NOT do this without understanding the impact
ALTER SYSTEM FLUSH SHARED_POOL;

Root Cause Analysis After the Crisis

-- Find AWR snapshot IDs covering the incident
SELECT snap_id, TO_CHAR(begin_interval_time,'YYYY-MM-DD HH24:MI') snap_time
FROM DBA_HIST_SNAPSHOT
WHERE begin_interval_time > SYSDATE - 4/24
ORDER BY snap_id;

-- Check optimizer statistics freshness
SELECT owner, table_name, last_analyzed,
ROUND((SYSDATE - last_analyzed)*24,1) hours_since_analyze
FROM DBA_TABLES
WHERE owner IN ('APPS','APPLSYS') AND last_analyzed < SYSDATE - 7
ORDER BY hours_since_analyze DESC
FETCH FIRST 20 ROWS ONLY;

Prevention

  1. Automated statistics jobs — Run FND_STATS on a schedule.
  2. SQL Plan Baselines — Pin correct plans: DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE.
  3. Resource Manager — Cap runaway queries.
  4. Redo log sizing — Less than one switch every 15–20 minutes.
  5. Monitoring thresholds — Alert on active session count, CPU%, log switch frequency.