Oracle Database Performance Crisis: Triage, Analysis, and Resolution

The Scenario

It's 09:47 on a Tuesday. Your monitoring fires. The EBS production server is at load average 120. There are 247 active Oracle sessions. Users are calling saying nothing works. Your manager is calling. Your manager's manager is about to call.

This is the moment where having a practiced triage methodology is the difference between a 45-minute incident and a 4-hour one.

I have been in this exact situation more than once. Here is the methodology I follow, in order, every time.

First 60 Seconds: OS-Level Triage

Do not log into Oracle yet. Start at the OS.

# Load average: how many processes are runnable or waiting for I/O?
uptime
# 09:47:13 up 312 days, load average: 121.34, 98.22, 75.61

# What is actually consuming CPU?
top -b -n 1 | head -30

# Is this CPU or I/O pressure? vmstat tells you
vmstat 2 5
# procs: r=runnable, b=blocked waiting for I/O
# us=userspace CPU, sy=kernel CPU, wa=I/O wait
# A high 'wa' means I/O pressure. High 'us' means CPU contention.

# If wa is high, identify the I/O pattern
iostat -x 2 5
# util% near 100% on a device = that device is the bottleneck

# Memory: are we swapping?
free -h
cat /proc/meminfo | grep -E "MemAvailable|SwapFree|Cached"

From these five commands you will know within 60 seconds whether you are looking at a CPU issue, I/O issue, or memory pressure. This determines your next move.

High CPU, low I/O wait: Bad SQL plans, parsing storm, latch contention, or a runaway background job.

High I/O wait: Full table scan on large table, UNDO or TEMP I/O, or insufficient I/O bandwidth.

Swapping: Oracle SGA/PGA sized too large for available RAM, or a memory leak in a non-Oracle process.

Oracle Session Analysis

With the OS picture clear, connect to Oracle immediately:

-- How many sessions? What state are they in?
SELECT STATUS, COUNT(*) FROM V$SESSION GROUP BY STATUS;

-- Top 25 active sessions — most waited first
SELECT s.sid, s.serial#, s.username, s.status,
       s.event, s.wait_class,
       s.seconds_in_wait,
       s.sql_id,
       s.blocking_session,
       s.module,
       s.action
FROM   V$SESSION s
WHERE  s.status = 'ACTIVE'
  AND  s.username IS NOT NULL
ORDER BY s.seconds_in_wait DESC
FETCH FIRST 25 ROWS ONLY;

The wait_class column is your first filter:

Concurrency — latch or library cache contention (parsing storm, hot block)
User I/O — SQL causing heavy reads
Application — lock waits (someone has a row-level or table-level lock)
System I/O — control file or redo log I/O
CPU — actually on CPU, not waiting (this looks odd but happens when CPU queue is full)

If you see 200 sessions all waiting on the same event — say, enq: TX - row lock contention — you have a blocking chain. If they're all on different events you have a broader load problem.

Finding the Blocking Chain

Row-level locking during a crisis is the most common cause of cascading waits:

-- Find blocker and all blocked sessions
SELECT
    l1.sid AS blocker_sid,
    l1.serial# AS blocker_serial,
    s1.username AS blocker_user,
    s1.sql_id AS blocker_sql,
    s1.event AS blocker_wait,
    s1.seconds_in_wait AS blocker_secs,
    l2.sid AS blocked_sid,
    s2.username AS blocked_user
FROM
    V$LOCK l1
    JOIN V$LOCK l2 ON l1.id1 = l2.id1 AND l1.id2 = l2.id2
    JOIN V$SESSION s1 ON l1.sid = s1.sid
    JOIN V$SESSION s2 ON l2.sid = s2.sid
WHERE l1.block = 1
  AND l2.request > 0
ORDER BY blocker_secs DESC;

-- How many sessions are blocked by a single session?
SELECT blocking_session, COUNT(*) blockers
FROM V$SESSION
WHERE blocking_session IS NOT NULL
GROUP BY blocking_session
ORDER BY 2 DESC;

If one session is blocking 180 others, that session is your root cause. Get its SQL and decide: is it a legitimate long-running transaction, or is it a stuck/hung process?

-- Get the actual SQL being executed by the blocker
SELECT sq.sql_text, sq.executions, sq.cpu_time/1000000 cpu_s,
       sq.elapsed_time/1000000 ela_s, sq.buffer_gets, sq.disk_reads
FROM V$SQL sq
WHERE sq.sql_id = '&blocker_sql_id';

Finding Top CPU Consumers

If it is not a blocking chain but overall CPU pressure:

-- Top 15 SQL statements by CPU in the shared pool right now
SELECT sql_id,
       ROUND(cpu_time/1000000, 1) cpu_secs,
       ROUND(elapsed_time/1000000, 1) ela_secs,
       executions,
       ROUND(cpu_time/NULLIF(executions,0)/1000000, 3) avg_cpu_per_exec,
       buffer_gets,
       disk_reads,
       SUBSTR(sql_text, 1, 80) sql_preview
FROM V$SQL
WHERE executions > 0
ORDER BY cpu_time DESC
FETCH FIRST 15 ROWS ONLY;

-- Also check for high-parse statements (soft or hard parse storms)
SELECT sql_id, parse_calls, executions,
       ROUND(parse_calls/NULLIF(executions,0)*100,1) parse_ratio_pct,
       SUBSTR(sql_text, 1, 80) sql_preview
FROM V$SQL
WHERE parse_calls > 100
ORDER BY parse_calls DESC
FETCH FIRST 15 ROWS ONLY;

A high parse_ratio_pct (close to 100%) on frequently executed SQL suggests the application is not using bind variables or the cursor is being aged out of the shared pool and re-parsed constantly.

ASH: The Crisis Investigator

Active Session History (ASH) is invaluable during a performance crisis because it records the last N seconds of session activity without the overhead of AWR snapshot collection:

-- What have sessions been waiting on in the last 10 minutes?
SELECT event, wait_class, COUNT(*) samples,
       ROUND(COUNT(*) * 10 / 600, 1) avg_active_sessions,
       ROUND(COUNT(*) / SUM(COUNT(*)) OVER () * 100, 1) pct
FROM V$ACTIVE_SESSION_HISTORY
WHERE sample_time > SYSDATE - 10/1440
  AND session_type = 'FOREGROUND'
GROUP BY event, wait_class
ORDER BY samples DESC
FETCH FIRST 15 ROWS ONLY;

-- Top SQL in ASH for the last 10 minutes
SELECT sql_id, COUNT(*) ash_samples,
       ROUND(COUNT(*) / SUM(COUNT(*)) OVER () * 100, 1) pct_activity,
       MIN(TO_CHAR(sample_time, 'HH24:MI:SS')) first_seen,
       MAX(TO_CHAR(sample_time, 'HH24:MI:SS')) last_seen
FROM V$ACTIVE_SESSION_HISTORY
WHERE sample_time > SYSDATE - 10/1440
  AND sql_id IS NOT NULL
  AND session_type = 'FOREGROUND'
GROUP BY sql_id
ORDER BY ash_samples DESC
FETCH FIRST 10 ROWS ONLY;

ASH does not require Diagnostics Pack licensing for V$ACTIVE_SESSION_HISTORY (in-memory portion). DBA_HIST_ACTIVE_SESS_HISTORY requires the license.

Latch Contention

Latch contention shows up as "latch: ..." wait events and usually indicates either a very hot block (segment header or common index block getting massive concurrent access) or library cache latch contention from a parsing storm.

-- Check latch hit ratios
SELECT name, sleeps, ROUND((1 - (sleeps/NULLIF(gets,0)))*100, 2) hit_ratio_pct
FROM V$LATCH
WHERE gets > 0
  AND sleeps > 1000
ORDER BY sleeps DESC
FETCH FIRST 10 ROWS ONLY;

-- Find hot blocks (high buffer busy waits)
SELECT obj#, current_obj#, current_file#, current_block#,
       COUNT(*) waits
FROM V$ACTIVE_SESSION_HISTORY
WHERE event = 'buffer busy waits'
  AND sample_time > SYSDATE - 5/1440
GROUP BY obj#, current_obj#, current_file#, current_block#
ORDER BY waits DESC;

Redo Log Issues

Excessive log switch frequency causes "log file switch" waits and can spike CPU as the LGWR process flushes more frequently:

-- Log switch frequency
SELECT TO_CHAR(FIRST_TIME,'YYYY-MM-DD HH24') hour_bucket,
       COUNT(*) switches
FROM V$LOG_HISTORY
WHERE FIRST_TIME > SYSDATE - 1/24
GROUP BY TO_CHAR(FIRST_TIME,'YYYY-MM-DD HH24')
ORDER BY 1;

-- Current redo log sizing
SELECT l.group#, l.members, l.status, lf.member filename,
       ROUND(l.bytes/1024/1024) size_mb
FROM V$LOG l JOIN V$LOGFILE lf ON l.group# = lf.group#
ORDER BY l.group#;

If you're seeing 100+ log switches per hour, your redo logs are too small. The immediate fix: increase redo log size. The emergency fix during crisis: force a checkpoint to reduce pending writes.

-- Do not do this lightly in production — it causes a brief I/O spike
ALTER SYSTEM CHECKPOINT;

Emergency Interventions

In a production crisis, you may need to take action while still investigating:

Kill a Blocking Session

-- Kill the top blocker (get SID and SERIAL# from the blocking query above)
ALTER SYSTEM KILL SESSION '123,4567' IMMEDIATE;

-- If that doesn't work immediately (session holds resources)
-- Find the OS process ID
SELECT p.spid OS_PID FROM V$SESSION s JOIN V$PROCESS p ON s.paddr = p.addr
WHERE s.sid = 123;

-- Kill at OS level
kill -9 <spid>

Terminate a Runaway SQL

-- Cancel a specific SQL without killing the session
ALTER SYSTEM CANCEL SQL 'SID=123, SERIAL=4567, SQL_ID=abc123def';

Resource Manager Throttle

If you need to limit a runaway user or module without killing sessions:

-- Temporarily cap CPU for a specific consumer group
DBMS_RESOURCE_MANAGER.UPDATE_PLAN_DIRECTIVE(
  plan => 'DEFAULT_PLAN',
  group_or_subplan => 'LOW_GROUP',
  new_cpu_p1 => 5
);

As a Last Resort: Flush Shared Pool

Flushing the shared pool forces re-parsing of all SQL, which can clear a bad plan but will spike CPU briefly as everything re-parses. Use only if you have confirmed library cache corruption or an unfixable bad plan:

-- Do NOT do this without understanding the impact
-- It will cause a brief but significant CPU spike as all SQL re-parses
ALTER SYSTEM FLUSH SHARED_POOL;

Root Cause Analysis After the Crisis

Once load returns to normal, investigate systematically:

Review AWR for the Crisis Window

-- Find AWR snapshot IDs covering the incident
SELECT snap_id, TO_CHAR(begin_interval_time,'YYYY-MM-DD HH24:MI') snap_time
FROM DBA_HIST_SNAPSHOT
WHERE begin_interval_time > SYSDATE - 4/24
ORDER BY snap_id;

-- Generate AWR report covering the incident window
-- From SQL*Plus:
-- @?/rdbms/admin/awrrpt.sql
-- Or use OEM/Grid Control

Check for Bad Plan Capture

-- Were any plans changed in the last 24 hours?
SELECT sql_id, plan_hash_value, timestamp, operation, options
FROM DBA_HIST_SQL_PLAN
WHERE timestamp > SYSDATE - 1
  AND id = 0  -- root node of the plan
ORDER BY timestamp DESC
FETCH FIRST 20 ROWS ONLY;

Check Optimizer Statistics Freshness

Stale statistics are the most common cause of a sudden bad plan:

SELECT owner, table_name, last_analyzed,
       ROUND((SYSDATE - last_analyzed)*24,1) hours_since_analyze,
       num_rows
FROM DBA_TABLES
WHERE owner IN ('APPS','APPLSYS')
  AND last_analyzed < SYSDATE - 7
  AND num_rows > 100000
ORDER BY hours_since_analyze DESC
FETCH FIRST 20 ROWS ONLY;

Prevention: What to Put in Place

After an incident like this, the work is building systems that either prevent recurrence or reduce detection time:

Automated statistics jobs — EBS ships with FND_STATS, but many sites disable or misconfigure it. Run FND_STATS.GATHER_ALL_COLUMN_STATS on a schedule.

SQL Plan Baselines — After identifying the correct plans for your critical SQL, pin them: DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE.

Resource Manager — Configure consumer groups to cap runaway queries before they saturate the box.

Redo log sizing — Size logs for less than one switch every 15–20 minutes under normal load.

Monitoring thresholds — Set alerts for active session count, CPU%, and log switch frequency, not just disk space.

TuneVault and Performance Visibility

TuneVault's performance checks surface the conditions that lead to this kind of crisis before it happens: stale optimizer statistics, top-CPU SQL, blocking session history, redo log sizing, and wait event trends. After an incident, running a health check gives you a structured view of what was already degraded heading into the crisis. The goal is never to debug live — it is to know which levers need pulling before load spikes.