Blog

Using Snapshot Standby for DR drill(s)

Snapshot Standby Database functionality (that comes with EE licenses) can be used for yearly DR drills on 11g/12c databases. Here are the technical steps used for the same. 

Convert the Physical Standby Database to Snapshot database.

Please differ the dest_2 parameter in standby server.

SQL> alter system set log_archive_dest_state_2=DEFER scope=both sid=’*’;

$sqlplus / as sysdba

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;  (make sure MRP is stopped)

SQL> select flashback_on from v$database;

SQL> alter database flashback on;

Database altered.

SQL> select flashback_on from v$database;

FLASHBACK_ON
——————
YES

Activate the Physical Standby Database

$ sqlplus / as sysdba

SQL> select database_role,name,controlfile_type from v$database;

DATABASE_ROLE    NAME      CONTROL
—————- ——— ——-
PHYSICAL STANDBY NASCXTHZ   STANDBY

SQL> alter database convert to snapshot standby; (converting the physical standby database to snapshot database)

SQL> alter database open;
Database altered.

SQL> select database_role,name,controlfile_type,flashback_on from v$database;

DATABASE_ROLE    NAME      CONTROL FLASHBACK_ON
—————- ——— ——- ——————
SNAPSHOT STANDBY NASCXTHZ   CURRENT RESTORE POINT ONLY

SQL> select open_mode from v$database;

OPEN_MODE
——————–
READ WRITE

SQL> set pages 200 lines 200
SQL> select * from v$restore_point;

Below two commands will show how much flashback data has been generated.

SQL> select * from v$flashback_database_log;

OLDEST_FLASHBACK_SCN OLDEST_FL RETENTION_TARGET FLASHBACK_SIZE ESTIMATED_FLASHBACK_SIZE
——————– ——— —————- ————– ————————
6.7014E+12 23-APR-15             1440       63750144                        0

SQL> select * from v$flashback_database_stat;

BEGIN_TIM END_TIME  FLASHBACK_DATA    DB_DATA  REDO_DATA ESTIMATED_FLASHBACK_SIZE
——— ——— ————– ———- ———- ————————
23-APR-17 23-APR-17          49152     622592          0                        0

SQL> show parameter db_rec

SQL> select database_role,name,controlfile_type,flashback_on from v$database;

DATABASE_ROLE    NAME      CONTROL FLASHBACK_ON
—————- ——— ——- ——————
SNAPSHOT STANDBY NASCXTHZ   CURRENT RESTORE POINT ONLY

SQL> select open_mode from v$database;

OPEN_MODE
——————–
READ WRITE

Convert the Snapshot Database to Physical Standby database.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.

SQL> startup mount
ORACLE instance started.
Total System Global Area  830930944 bytes
Fixed Size                  2217912 bytes
Variable Size             603981896 bytes
Database Buffers          222298112 bytes
Redo Buffers                2433024 bytes
Database mounted.
SQL>

Next convert the snapshot to a physical standby.

SQL> alter database convert to physical standby;
Database altered.

SQL> select * from v$restore_point;

no rows selected

SQL> select * from v$flashback_database_log;

no rows selected

SQL> select * from v$flashback_database_stat;

no rows selected

SQL> alter database recover managed standby database disconnect from session;

Post DG Sync Validation

Run the following query on Primary DB

SQL> select thread#, max(sequence#) “Last Primary Seq Generated”
from v$archived_log val, v$database vdb
where val.resetlogs_change# = vdb.resetlogs_change#
group by thread# order by 1;

Run the following query on Standby DB

SQL> select al.thrd “Thread”, almax “Last Seq Received”, lhmax “Last Seq Applied”
from (select thread# thrd, max(sequence#) almax
from v$archived_log
where resetlogs_change#=(select resetlogs_change# from v$database)
group by thread#) al,
(select thread# thrd, max(sequence#) lhmax
from v$log_history
where resetlogs_change#=(select resetlogs_change# from v$database)
group by thread#) lh
where al.thrd = lh.thrd;

Ensure “Last Primary Seq Generated” on Primary matches “Last Seq Applied” on each Standby

ORA-00600: internal error code, arguments: [ktbdchk1: bad dscn]

Problem Description:

We observed below ORA-00600 in alert.log of 11.2.0.4 database, but fortunately DB was up and running

Dump file /u01/app/oracle/diag/rdbms/nascxthzprod/NASCXTHZPROD/incident/incdir_1517411/NASCXTHZPROD_ora_94908_i1517411.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0.4/dbhome_1
System name: Linux
Node name: nasodaprd051
Release: 2.6.39-400.290.1.el6uek.x86_64
Version: #1 SMP Mon Nov 7 12:59:33 PST 2016
Machine: x86_64
Instance name: NASCXTHZPROD
Redo thread mounted by this instance: 1
Oracle process number: 669
Unix process pid: 94908, image: oracle@nasodaprd051


*** 2017-07-29 12:05:19.586

Dump continued from file: /u01/app/oracle/diag/rdbms/nascxthzprod/NASCXTHZPROD/trace/NASCXTHZPROD_ora_94908.trc
ORA-00600: internal error code, arguments: [ktbdchk1: bad dscn], [], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 1517411 (ORA 600 [ktbdchk1: bad dscn]) ========
—– Beginning of Customized Incident Dump(s) —–

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Error/Stack/SQL/Version from file “NASCXTHZPROD_ora_94908_i1517411.trc”
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Error: ORA-00600 [ktbdchk1: bad dscn]

Error Stack: ORA-600[ktbdchk1: bad dscn]
Main Stack:
ktbValidateDependentScn <- ktbgcl1 <- ktbgtl0 <- kdiins0 <- kdiinsp <- kauxsin
<- qesltcLoadIndexList <- qesltcLoadIndexes <- qerltcNoKdtBufferedInsRowCBK
<- qerltcSingleRowLoad <- qerltcFetch <- insexe <- opiexe <- kpoal8 <- opiodr <- ttcpip
<- opitsk <- opiino <- opiodr <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main

Current SQL: INSERT INTO CASES (IC_MEMBER_ID, DATE_CREATED, IC_RULE_ID, ACTION, REPORTING_CATEGORY, CERTAINTY, BRIEF_JUSTIFICATION_ID, COMPR_JUSTIFICATION_ID, JUSTIFICATION_ID, MEMBER_EMPLOYEE_FLAG, STATE, DENY_ADJUSTMENT_CODE, ADD_ADJUSTMENT_CODE, SUPPORT_ADJUSTMENT_CODE, SAVINGS_PERCENTAGE, MODULE_NAME, MSG, H…

Cause:

Bug 22241601 : ORA-600 [KDSGRP1] IN ADG AFTER FAILOVER

Solution:

Install Patch for Bug 22241601  from note 22241601.8

Workaround 

Install the fix to prevent this issue from being introduced; installing the fix
also by default tries to repair existent invalid ITL commit scn (healing).
There is not need to set any parameter for it.

Databases already having init.ora parameter _ktb_debug_flags=8 can
remove the parameter after the fix is installed as _ktb_debug_flags=8 is now
the default so the healing is enabled by default. Note that this fix is not disabled
if _ktb_debug_flags is set to 0; a value not including 8 will only disable the healing
for already affected ITL SCN but the fix still solves the problem of the invalid ITL SCN
in the INDEX to be introduced as it does not depend of any parameter.

Sometimes the fix may not repair the block for an already existent invalid SCN on disk;
then repair this issue with:

1. Recreate the affected Index in the Primary Database
Create Index may also fail with the same ORA-600 [ktbdchk1: bad dscn] if Oracle
builds the index based on another index with the same corruption for which the
solution is to drop all the affected indexes for the table and recreate them.

(OR)

2. If the issue is in the standby database: refresh the affected file from primary to standby.
If the error is in the primary database, refresh the file from the standby to the primary database.

ORA-00603: ORACLE server session terminated by fatal error

Problem Description:

Below errors in alert.log after database goes down.

ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
kgefec: fatal error 0

As you have upgraded ODA from 12.1.2.10.0 to 12.1.2.11.0, There is a major change in the OS as well as kernel version which is 2.6.39-400 to 4.1.12-61.

This issue faced mainly due to OS memory fragmentation. If you clear the OS cache or reboot the server – as fragmentation is gone – You are not going to see this issue.

This is nothing related to RDBMS. This is related to OS level Memory page allocation to processes.

 

Solution

Please refer to the below MOS doc to fix this issue,

Oracle Linux: ORA-27301:OS Failure Message: No Buffer Space Available ( Doc ID 2041723)

Oracle Database Appliance was getting rebooted with reboot errors in system OS logs

Oracle Database Appliance was getting rebooted with reboot errors in system OS logs

[root@nasodacxt0101d1 ~]# last reboot | head -3
reboot system boot 4.1.12-94.3.9.el Sun Feb 18 20:34 – 03:00 (1+06:25) >>>>>>>>> this is the issue time
reboot system boot 4.1.12-94.3.9.el Sun Feb 18 02:13 – 21:36 (19:22)
reboot system boot 4.1.12-94.3.9.el Fri Jan 12 12:27 – 21:36 (37+09:08)

$ cat messages_nasodacxt0101d1 | grep -i signal
Feb 18 21:36:27 nasodacxt0101d1 init: initoak main process (3599) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: dcliagent main process (3601) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: odaBaseAgent main process (3602) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: oracle-ohasd main process (3604) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: oracle-tfa main process (5643) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: tty (/dev/tty1) main process (5647) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: tty (/dev/tty2) main process (5649) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: tty (/dev/tty3) main process (5655) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: tty (/dev/tty4) main process (5661) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: tty (/dev/tty5) main process (5667) killed by TERM signal
Feb 18 21:36:27 nasodacxt0101d1 init: tty (/dev/tty6) main process (5669) killed by TERM signal
Feb 18 21:39:31 nasodacxt0101d1 snmpd[5211]: Received TERM or STOP signal… shutting down…
Feb 18 21:39:31 nasodacxt0101d1 rpc.mountd[5017]: Caught signal 15, un-registering and exiting.
Feb 18 21:39:32 nasodacxt0101d1 ntpd[5265]: ntpd exiting on signal 15
Feb 18 21:39:34 nasodacxt0101d1 rpcbind: rpcbind terminating on signal. Restart with “rpcbind -w”
Feb 18 21:39:35 nasodacxt0101d1 rsyslogd: [origin software=”rsyslogd” swVersion=”5.8.10″ x-pid=”4568″ x-info=”http://www.rsyslog.com”%5D exiting on signal 15.

Cause

Press ctl-alt-del key sequence via ILOM remote console and server is rebooted.

Solution

Take a backup of /etc/init/control-alt-delete.conf file

Disable the settings in the /etc/init/control-alt-delete.conf file by completing the following:

  1. Create /etc/init/control-alt-delete.override file and
  2. Add the line “exec /bin/true” in the above file
  3. Run “initctl reload-configuration control-alt-delete” to reflect the changes

Try to reproduce the reboot by pressing ctl-alt-del key sequence via ILOM remote console

ORA-00068: invalid value 0 for parameter _query_execution_time_limit, must be between 1952541791 and 6252643

Problem Description:

Users were seeing below error in application servers

03/05/2018 06:01:43 AM : 0 : ClaimVersionDao : 0 : 2 : Generic TPS Error. Check system application log for additional details. ADOException exception (could not execute query

[ SELECT distinct IC_CLAIM_ID
FROM claim_version
WHERE 1=1
AND CLAIM_ID = :p0 ]
Name:claimId – Value:20149913944300
[SQL: SELECT distinct IC_CLAIM_ID
FROM claim_version
WHERE 1=1
AND CLAIM_ID = :p0] : ORA-00068: invalid value 0 for parameter _query_execution_time_limit, must be between 1952541791 and 6252643)
03/05/2018 06:01:43 AM : 0 :  :  :  : An error occurred — attempting to run the last workflow step.

Troubleshooting and Solution:

Reviewing the alert log and patch details for 12.1.0.2 database involved, proved that we applied the patches when Database was up and running. This seems to be the cause of this issue and
thus restart fixed the issue. Also the messages for “_disable_image_check” started as soon as we finished the patching when databases in live. Instances need to be down when one-off
are applied. If in case one-off is applied while instance is up, a restart should fix the issue created by that. In this case looks like restart is done.
So no more action seems needed now. And root cause of the issue is patch 27589110 was accidentally applied when instances were running.