Oracle RAC Troubleshooting — Interconnect, Voting Disk, and CRS

Oracle RAC (Real Application Clusters) distributes a single database across multiple nodes for high availability and scalability. When it works, it's invisible. When it doesn't, the failure modes are unique to RAC and require specific diagnostic approaches. This article covers the most common RAC issues — interconnect problems, voting disk failures, CRS stack failures, and instance evictions — with exact diagnostic commands and fixes.

RAC Architecture Quick Reference

Before troubleshooting, understand the layers:

CRS (Cluster Ready Services) — the top-level Oracle HA daemon that manages all cluster resources
OCSSD (Oracle Cluster Synchronization Services Daemon) — manages CSS, the cluster heartbeat mechanism
Voting disks — shared storage that CSS uses to break split-brain scenarios (odd number required, typically 3 or 5)
OCR (Oracle Cluster Registry) — stores cluster configuration (resource definitions, node membership)
Interconnect — the private network between nodes used for cache fusion (block shipping between instances)

# Check overall CRS stack status
crsctl check cluster -all

# Check individual CRS components
crsctl check crs
crsctl check css
crsctl check ctss

# Check all cluster resources
crsctl status resource -t

# Check node membership
olsnodes -s -t

Interconnect Diagnostics

The interconnect is the most common source of RAC performance problems. Cache fusion — the mechanism that ships dirty blocks between nodes rather than forcing them to disk — depends entirely on low-latency interconnect.

# Check interconnect configuration
oifcfg getif

# Verify the right interface is used for cluster interconnect
# Should show cluster_interconnect, not public
oifcfg getif | grep cluster_interconnect

# Check interconnect statistics
# On Linux, check the network interface stats
netstat -i | grep <interconnect_interface>  # e.g., eth1, bond1

# In Oracle, check interconnect performance

-- Check interconnect statistics from Oracle
SELECT name, value
FROM gv$sysstat
WHERE name IN (
  'gc cr blocks served',
  'gc cr blocks received',
  'gc current blocks served',
  'gc current blocks received',
  'gc cr block receive time',
  'gc current block receive time'
)
ORDER BY inst_id, name;

-- Average interconnect latency (should be < 1ms)
SELECT a.inst_id,
       ROUND(b.value / DECODE(a.value,0,1,a.value), 2) avg_cr_get_ms
FROM gv$sysstat a, gv$sysstat b
WHERE a.name = 'gc cr blocks received'
  AND b.name = 'gc cr block receive time'
  AND a.inst_id = b.inst_id
ORDER BY inst_id;

Interconnect latency above 1ms causes significant cache fusion overhead. Above 5ms, RAC overhead can make the cluster perform worse than a single instance.

Interconnect packet loss

# Ping the interconnect from each node to every other node
# Use the interconnect IP addresses specifically, not hostnames
ping -I <interconnect_interface> -c 1000 <other_node_interconnect_ip>

# Check for errors on the interconnect interface
ip -s link show <interconnect_interface>
# Look for RX errors, TX errors, dropped packets

Any packet loss on the interconnect is a critical finding. Even 0.1% loss causes significant GC (global cache) wait events.

Instance Eviction Diagnostics

Node eviction (a node being forcibly removed from the cluster) is the most alarming RAC event. The evicted node reboots, which is jarring but is the correct behavior — CSS evicts nodes to prevent split-brain data corruption.

# Find eviction reason in CSS log
grep -i "evict\|reboot\|eviction" /u01/app/grid/<version>/log/<hostname>/cssd/ocssd.log | tail -50

# Check CRS alert log
tail -200 /u01/app/grid/<version>/log/<hostname>/crsd/crsd.log

# Check OS messages around the eviction time
grep -i "oracle\|evict\|reboot" /var/log/messages | grep "<date>"

Common eviction causes

Missed CSS heartbeat — the node missed too many consecutive heartbeats on the voting disks. Caused by:

I/O latency to voting disks (the voting disk write must complete within the CSS misscount, typically 30 seconds)
OS swapping (when the system swaps, CSS daemons can be delayed)
CPU starvation (CSS daemon starved of CPU)

# Check voting disk I/O latency
# CSS must write to voting disks within misscount seconds
crsctl query css votedisk

# Check if voting disks are on shared storage with acceptable latency
# Voting disk write timeout is controlled by misscount parameter
crsctl get css misscount
# Default is 30 seconds — if voting disk writes take longer, eviction occurs

Network partition — the interconnect failed, CSS couldn't determine which nodes were alive, voted to evict the minority.

Hung I/O — OS I/O hung waiting for storage, starving the CSS daemon.

Voting Disk Issues

# List voting disks and their status
crsctl query css votedisk

# Check voting disk accessibility
dd if=/dev/sdc of=/dev/null bs=512 count=1  # replace /dev/sdc with your voting disk

# Replace a failed voting disk (requires majority to be available)
crsctl delete css votedisk <failed_disk_path>
crsctl add css votedisk <new_disk_path> -force

Never have an even number of voting disks — CSS requires a quorum majority. With 2 voting disks, losing one means 50% availability which CSS cannot resolve. Always use 1, 3, or 5 voting disks.

OCR Issues

# Check OCR integrity
ocrcheck

# Backup OCR manually (Oracle auto-backs up every 4 hours)
ocrconfig -manualbackup

# List OCR backups
ocrconfig -showbackup

# Restore OCR from backup (requires CRS stack down)
crsctl stop cluster -all
ocrconfig -restore <backup_file>
crsctl start cluster -all

CRS Stack Startup Issues

# Start CRS stack manually
crsctl start crs

# If CRS won't start, check the log
tail -200 /u01/app/grid/<version>/log/<hostname>/crsd/crsd.log
tail -200 /u01/app/grid/<version>/log/<hostname>/cssd/ocssd.log

# Start individual components
crsctl start css
crsctl start crs

ORA-29702: error in Cluster Group Service operation

Usually means OCSSD isn't running or can't reach voting disks:

# Start OCSSD manually
$ORACLE_HOME/bin/ocssd.bin

# Check if OCSSD can access voting disks
ls -la /dev/sd*  # check raw devices if using raw voting disks

RAC Wait Events

RAC introduces specific wait events that don't exist in single-instance:

-- Top RAC-specific wait events
SELECT event, total_waits, time_waited, average_wait
FROM gv$system_event
WHERE event LIKE 'gc%'
  OR event LIKE 'global cache%'
ORDER BY time_waited DESC
FETCH FIRST 15 ROWS ONLY;

-- Find hot blocks being shipped between instances (block contention)
SELECT p1 file#, p2 block#, COUNT(*) waits
FROM gv$session_wait
WHERE event IN ('gc buffer busy acquire', 'gc buffer busy release')
GROUP BY p1, p2
ORDER BY waits DESC;

-- Identify the object for a hot block
SELECT owner, segment_name, segment_type
FROM dba_extents
WHERE file_id = <file#>
  AND <block#> BETWEEN block_id AND block_id + blocks - 1;

gc buffer busy — multiple sessions on different nodes requesting the same block simultaneously. Usually indicates a hot table or index that needs partitioning or application-level redesign (sequence caching, for example).

gc cr block lost — blocks being shipped via interconnect are being lost. Interconnect packet loss or latency issue.

Sequence Cache and Insert Contention

In RAC, sequences without adequate caching cause significant gc waits because each instance must coordinate with others for the next sequence value:

-- Find sequences with small cache (contention risk in RAC)
SELECT sequence_owner, sequence_name, cache_size, last_number
FROM dba_sequences
WHERE cache_size < 100
  AND sequence_owner NOT IN ('SYS','SYSTEM')
ORDER BY cache_size;

-- Increase cache size
ALTER SEQUENCE apps.fnd_concurrent_requests_s CACHE 1000;

In EBS environments, many FND sequences ship with small cache sizes tuned for single-instance. Increase cache to 1000+ for RAC.

Log Files Reference

# CRS daemon log
/u01/app/grid/<version>/log/<hostname>/crsd/crsd.log

# CSS (heartbeat) log
/u01/app/grid/<version>/log/<hostname>/cssd/ocssd.log

# EVMD (event manager) log
/u01/app/grid/<version>/log/<hostname>/evmd/evmd.log

# Oracle ASM log
/u01/app/grid/<version>/log/<hostname>/asm/alert_+ASM<N>.log

# Database alert log
/u01/app/oracle/<version>/diag/rdbms/<db_name>/<sid>/trace/alert_<sid>.log

# Collect all cluster logs for a time range
diagcollection.pl --collect --crshome $GRID_HOME --from "2026-07-01 09:00:00" --to "2026-07-01 11:00:00"

Quick Diagnostic Sequence for Any RAC Issue

crsctl check cluster -all — is CRS healthy on all nodes?
crsctl status resource -t — which resources are offline?
olsnodes -s -t — are all nodes still members?
Check alert logs on all nodes around the time of the issue
Check ocssd.log for any eviction or heartbeat messages
Check interconnect latency with ping between nodes
Check voting disk accessibility

TuneVault's health checks include RAC-specific diagnostics — interconnect latency, GC wait ratios, voting disk status, and sequence cache analysis — all surfaced automatically in the health check report.