Don Seiler

Subscribe to Don Seiler feed
Chronicling Life, Love, Linux and Database Administration.Don Seilernoreply@blogger.comBlogger389125
Updated: 8 hours 34 min ago

Sending notifications from Oracle Enterprise Manager to VictorOps

Thu, 2016-01-28 11:55
We use VictorOps for our paging/notification system, and we're pretty happy with it so far. On the DBA team, we've just been using a simple email gateway to send notifications from Oracle Enterprise Manager (EM) to VictorOps. Even then, we can only send the initial notification and not really send an automated recovery without more hacking than its worth. Not a big deal, but would be nice to have some more functionality.

So yesterday I decided I'd just sort it all out since VictorOps has a nice REST API and Enterprise Manager has a nice OS script notification method framework. The initial result can be found on my github: entmgr_to_victorops.sh.

It doesn't do anything fancy, but will handle the messages sent by your notification rules and pass them on to VictorOps. It keys on the incident ID to track which events it is sending follow-up (ie RECOVERY) messages for.

Please do let me know if you have any bugs, requests, suggestions for it.

Many thanks to Sentry Data Systems (my employer) for allowing me to share this code. It isn't mind-blowing stuff but should save you a few hours of banging your head against a wall.
Categories: DBA Blogs

Just XFS Things

Tue, 2015-11-10 16:03
$ uptime

16:36:42 up 4 days, 12:28, 6 users, load average: 1029.20, 995.42, 865.77


See https://www.centos.org/forums/viewtopic.php?f=47&t=52412 and https://access.redhat.com/solutions/532663 (requires RedHat subscription) for details. And defrag your XFS volumes.
Categories: DBA Blogs

The Leap Second is No Laughing Matter (if you have java on an older Linux kernel)

Tue, 2015-07-21 16:50
Earlier this month we began getting frequent email warnings from our EM12c server that some agents were experiencing read time outs. Then we saw that the emagent java process was using A LOT of CPU, regularly around 500% but sometimes as high as 800% as seen from "top". Restarting the agent did nothing.

I opened an SR with Oracle Support, where I was first instructed to apply a JDBC patch and then a PSU agent patch. No change in behavior.

Courtney Llamas from the Oracle EM team reached out and suggested it might be due to the leap second, directing me to these MOS docs:

  • Enterprise Manager Management Agent or OMS CPU Use Is Excessive near Leap Second Additions on Linux (Doc ID 1472651.1)
  • Leap Second Hang - CPU Can Be Seen at 100% (Doc ID 1472421.1)
The workaround is to restart ntpd (or reboot the server):

# /etc/init.d/ntpd stop
# date -s "`date`" (reset the system clock)
# /etc/init.d/ntpd start


I monitored top while my system admin restarted ntpd and reset the clock. As soon as he did, java CPU usage dropped like a rock.



While I'm incredibly grateful that Courtney provided the solution in basically 5 minutes, I'm even more upset that Oracle Support had me doing everything but for the 20 days that my original SR has been open.

Of course the real joke is on me, since I first reported the error on July 1 and we all joked on twitter how it was probably due the leap second. The fault also lies with me since I failed to notice that our kernel version (2.6.32-220) was still vulnerable to this (fixed in 2.6.32-279). See Maris Elsins' great write-up (which I apparently skimmed too lightly).
Categories: DBA Blogs

Upgrade to Oracle 12c, Get the Huge Trace Files for Free!

Tue, 2015-07-21 12:12
Last week we began testing a copy of our production database on Oracle 12c (12.1.0.2). This past weekend we were alerted that the disk holding our ADR diagnostic directory was near full. We noticed some pretty big (and recent) trace files there. Then it happened twice more. This last time filled the disk before we could get to it (thankfully only dev, and during the evening), meaning it filled up fast. The largest file was over 18GB and it only took 6 hours to get that big.

I saved the three biggest trace files to a large NFS mount and did a trace file purge just to get our dev database back up. When I looked at those files, I saw they were all sqlplus sessions running the same DELETE statement after reports like this:


----- Cursor Obsoletion Dump sql_id=7q0kj0sp5k779 -----
Parent cursor obsoleted 1 time(s). maxchild=1024 basephd=0xae2a2ca308 phd=0xae2a2ca308
----- Dump Cursor sql_id=7q0kj0sp5k779 xsc=0x7ffbf191fd50 cur=0x7ffbf2702670 -----


At first I thought it was a user session setting some oddball trace event. However our team found DocID 1955319.1 (Huge Trace Files Created Containing "----- Cursor Obsoletion Dump sql_id=%s -----")

Long story short, it's an unpublished bug introduced in 12.1.0.2 with the cursor obsoletion diagnostic dump "ehancement". I don't think they intended this though, even though they did say "Huge". The workaround is to disable it completely, via this hidden parameter:


alter system set "_kks_obsolete_dump_threshold" = 0;


There is this note at the end though:
Note: The underlying cursor sharing problem should always be highlighted and investigated to ensure that the reason for the non-sharing is known and fully understood.Which is definitely good advice.
Categories: DBA Blogs

Returning Error Codes from sqlplus to Shell Scripts

Tue, 2015-01-06 15:32
When I have to run SQL scripts, I prefer to do it via executable ksh scripts (I use ksh because it's usually an Oracle pre-requisite so I can rely on it being installed). One thing that I've just recently started adding into my scripts is exiting sqlplus on error and returning the ORA error code. This is done via the WHENEVER SQLERROR feature of sqlplus:

sqlplus / as sysdba <<EOF

        whenever sqlerror exit sql.sqlcode
        alter tablespace foo
                rename to foo_old;

        create tablespace foo
                datafile size 100m;

        alter table foo move tablespace $TABLESPACE_NAME nocompress;
EOF

RETURN_CODE=$?
if [ $RETURN_CODE -ne 0 ]; then
        echo "*** Tablespace renaming error code $RETURN_CODE. ***"
        exit $RETURN_CODE;
fi

In this example we rename a tablespace and then create a new tablespace in its place and move a table there. It seems rather contrived but this is actually what I'm doing to move tables to an uncompressed and unencrypted tablespace to test storage vendor deduplication claims. But I digress ...

To test this, I used a tablespace that doesn't exist. This results in an ORA-00959 error when the tablespace does not exist. However, the return code I get was 191. I spent a good portion of the day testing and retesting with all sorts of debug output until I stumbled across this comment. Since Linux (and I'm told Unix) return codes only go up to 255, the ORA sqlcode value of 959 is wrapped until there is a remainder. The value of 191 is the difference, achieved simply by modulo operation:

959 % 256 = 191

 And suddenly the skies were cleared.

As always, hope this helps!
Categories: DBA Blogs

Making Copies of Copies with Oracle RMAN

Wed, 2014-11-12 10:13
I recently had need to make a copy of an image copy in Oracle rman. Since it wasn't immediately obvious to me, I thought it was worth sharing once I had it sorted out. I was familiar with making a backup of a backup, but had never thought about making a copy of a copy.

First you need to create an image copy of your database or tablespace. For the sake of example, I'll make a copy of the FOO tablespace. The key is to assign a tag to it that you can use for later reference. I'll use the tag "DTSCOPYTEST":

backup as copy 
    tablespace foo 
    tag 'DTSCOPYTEST'
    format '+DG1';

So I have my image copy in the DG1 tablespace. Now say we want to make copy of that for some testing purpose and put it in a different diskgroup. For that, we need the "BACKUP AS COPY COPY" command, and we'll want to specify the copy we just took by using the tag that was used:

backup as copy
    copy of tablespace foo
    from tag 'DTSCOPYTEST'
    tag 'DTSCOPYTEST2'
    format '+DG2';

As you'd guess, RMAN makes a copy of the first copy, writing it to the specified format location.

As always, hope this helps!
Categories: DBA Blogs

The Importance of Backups (A Cautionary Block Recovery Tale)

Tue, 2014-11-04 22:28
Just wanted to share a quick story with everyone. As I was in the airport waiting to fly to Oracle OpenWorld this year, I noticed a flurry of emails indicating that part of our storage infrastructure for our standby production database had failed. Long story short, my co-workers did a brilliant job of stabilizing things and keeping recovery working. However, we ended up with more than a few block corruptions.

Using the RMAN command "validate database", we could then see the list of corrupt blocks in the v$database_block_corruption view. All that was needed was to run "recover corruption list" in RMAN, which will dig into datafile copies and backups to do what it can to repair or replace the corrupt blocks and then recover the datafiles. Of course, nothing is ever that easy for us!

The storage we were writing our weekly backups to had been having problems and the latest weekly had failed. We ended up having to back 2 weeks into backups to get the datafile blocks and archivelogs to eventually complete the corruption recovery. I also immediately moved our backups to more reliable storage as well so that we're never in the situation of wondering whether or not we have the backups we need.

So, triple-check your backup plan, validate your backups and TEST RECOVERY SCENARIOS! You can't say your backups are valid until you use them to perform a restore/recovery, and you don't want to find out the hard way that you forgot something.
Categories: DBA Blogs

ORA-16534 When Converting to/from Snapshot Standby with DataGuard Broker

Wed, 2014-10-22 23:00
We here at Seilerwerks Industries (not really) have been using snapshot standby databases to refresh an array of unit test databases from a common primary. During the business day, these would be converted to snapshot standby databases for testing, then overnight they are converted back to physical standby and recovered up to the master again.

However we ran into one problem the other week. I noticed that the test3 database was still in physical standby mode well into the business day. Trying to manually convert returned this error:

DGMGRL> convert database test3 to snapshot standby
Converting database "test3" to a Snapshot Standby database, please wait...
Error:
ORA-16534: switchover, failover or convert operation in progress
ORA-06512: at "SYS.DBMS_DRS", line 157
ORA-06512: at line 1

A quick search of MOS yielded bug 13716797 (ORA-16534 from the broker when setting apply-off), which simply suggested restarting the problem database when encountering that error. However doing so did not get me any further. That's when the I checked the Data Guard Broker configuration:

DGMGRL> show configuration;

Configuration - testdb

  Protection Mode: MaxPerformance
  Databases:
    test1 - Primary database
    test5 - Physical standby database
    test6 - Snapshot standby database
    test3 - Physical standby database
    test4 - Snapshot standby database

Fast-Start Failover: DISABLED

Configuration Status:
ORA-16610: command "CONVERT DATABASE test6" in progress
DGM-17017: unable to determine configuration status

Looks like I have two databases stuck in physical standby mode, test3 and also test6. And the configuration is specifically complaining about test6. So I restarted that database and, sure enough, I was then able to convert both back to snapshots:

DGMGRL> show configuration;

Configuration - testdb

  Protection Mode: MaxPerformance
  Databases:
    test1 - Primary database
    test5 - Snapshot standby database
    test6 - Snapshot standby database
    test3 - Snapshot standby database
    test4 - Snapshot standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS

It was very interesting to me to see one member of the Data Guard configuration prevent me from performing an operation on a different member. Hopefully this helps one of you in the future.

Categories: DBA Blogs

Advanced Queue Quickie: Errors and Privileges

Wed, 2014-10-01 23:07
File this one under the misleading-errors department. One of my developers was working with a new queue. He pinged me when he got this error trying to create a job that used the queue:

ERROR at line 1:
ORA-27373: unknown or illegal event source queue
ORA-06512: at "SYS.DBMS_ISCHED", line 124
ORA-06512: at "SYS.DBMS_SCHEDULER", line 314
ORA-06512: at line 2

The CREATE_JOB statement was:

BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'foo.bar_q_job',
job_type => 'PLSQL_BLOCK',
job_action => 'begin foo.bar_pkg.consume_bar_queue(); end;',
queue_spec => 'BAR.BAR_Q, FOO_BAR_AGENT',
enabled => true,
comments => 'This is a job to consume the bar.bar_q entries that affect foo.');
END;
/

After a few minutes of banging our heads, it became obvious that this was a permissions problem. The queue was owned by BAR, the job was being created as FOO. The ORA error message could/should have made this more obvious, in my opinion.

Anyway, the fix was simply to grant access to FOO:

DBMS_AQADM.GRANT_QUEUE_PRIVILEGE(
        privilege  => 'ALL',
        queue_name => 'bar.bar_q',
        grantee    => 'foo' );

Hope this saves some banged heads for others.
Categories: DBA Blogs

NOFILENAMECHECK Parameter Causes DUPLICATE DATABASE To Ignore DB_CREATE_FILE_DEST?

Wed, 2014-09-10 23:06
Last week I was creating a new testing database from a backup of our demo database, both under Oracle 11.2.0.3. I grabbed one of my old scripts to handle the duplicate function, which looked similar to this:

connect auxiliary /;
run {

        duplicate database to testdb
                backup location '$BACKUPDIR'
                nofilenamecheck;

}

One important difference between the demo database and this new test database is that the original demo database lives on a filesystem and the new database was to go onto ASM on a different host. I had copied the syntax from the old script and kicked it off. I made sure that the db_create_file_dest was set to the ASM diskgroup. However the restore would fail as RMAN tried to write to the filesystem path used by the original demo database, which didn't exist on this host, instead of the ASM diskgroup.



Definitely puzzling to me. I double-checked the documentation for NOFILENAMECHECK, which only suggested that it did a check for matching filenames, but didn't state that it would cause the issue I was seeing. The summary saying that it would prevent RMAN from checking for a name collision, which it does to prevent from overwriting existing files on the same host.

However what I found is that having NOFILENAMECHECK in my command resulted in RMAN restoring the files to the original path, ignoring my db_create_file_dest specifications. When I removed the NOFILENAMECHECK specification from the RMAN command, the files were restored to the ASM diskgroup as intended.

MOS Support documents 1375864.1 and 874352.1 suggest using the DB_FILE_NAME_CONVERT parameter but I found this was not necessary when I set the DB_FILE_CREATE_DEST parameter, as long as I didn't use NOFILENAMECHECK. I couldn't find anything in MOS about NOFILENAMECHECK forcing the restore to use a certain location though.
Categories: DBA Blogs

ORA-14048 When Adding Composite Partition

Thu, 2014-08-21 15:35
Fresh off the heels of my earlier composite partitioning post, I just ran into this confusing issue:

SQL> alter table p_objects
  2          add partition p201410
  3                  values less than (to_date('2014/11/01','yyyy/mm/dd'))
  4          (
  5                  subpartition p201410_spdts values ('DTS')
  6                  , subpartition p201410_spfoo values ('FOO')
  7                  , subpartition p201410_spbar values ('BAR')
  8                  , subpartition p201410_spsys values ('SYS')
  9                  , subpartition p201410_spsysaux values ('SYSAUX')
 10          )
 11  tablespace tbs1
 12  ;
tablespace tbs1
*
ERROR at line 11:
ORA-14048: a partition maintenance operation may not be combined with other
operations

The error doesn't quite make it easy to determine what the problem is. Turns out that the tablespace/storage clause of the ALTER TABLE ... ADD PARTITION has to come prior to the subpartitions definition. Simply moving that part of the statement a few rows up yields success:

SQL> alter table p_objects
  2          add partition p201410
  3                  values less than (to_date('2014/11/01','yyyy/mm/dd'))
  4          tablespace tbs1
  5          (
  6                  subpartition p201410_spdts values ('DTS')
  7                  , subpartition p201410_spfoo values ('FOO')
  8                  , subpartition p201410_spbar values ('BAR')
  9                  , subpartition p201410_spsys values ('SYS')
 10                  , subpartition p201410_spsysaux values ('SYSAUX')
 11          )
 12  ;

Table altered.

Again, probably obvious to most of you. It wasn't as obvious to decipher when I was trying to add a partition with over 6,300 subpartitions. As always:


Categories: DBA Blogs

Adding New Partitions with Custom Subpartition Definition (Range-List)

Thu, 2014-08-21 14:28
As part of a project for work I wanted to create a script that would create a new range partition but also pre-create all the list subpartitions. By default the subpartitions would be created based on the subpartition template. However for various reasons which I won't get into we don't update or use the subpartition template. I wanted to define the subpartition list as part of the ALTER TABLE ... ADD PARTITION statement. I assumed it was perfectly acceptable but didn't see any obvious examples in my hasty web searching, so I thought I'd share on myself.

I start by creating my composite range-list partitioned table:

SQL> create table p_objects
  2  tablespace tbs1
  3  partition by range(rdate)
  4  subpartition by list(owner)
  5  subpartition template
  6  (
  7          subpartition spsys values ('SYS')
  8          , subpartition spsysaux values ('SYSAUX')
  9  )
 10  (
 11          partition p201301 values less than (to_date('2013/02/01','YYYY/MM/DD')),
 12          partition p201302 values less than (to_date('2013/03/01','YYYY/MM/DD')),
 13          partition p201303 values less than (to_date('2013/04/01','YYYY/MM/DD')),
 14          partition p201304 values less than (to_date('2013/05/01','YYYY/MM/DD')),
 15          partition p201305 values less than (to_date('2013/06/01','YYYY/MM/DD')),
 16          partition p201306 values less than (to_date('2013/07/01','YYYY/MM/DD')),
 17          partition p201307 values less than (to_date('2013/08/01','YYYY/MM/DD')),
 18          partition p201308 values less than (to_date('2013/09/01','YYYY/MM/DD')),
 19          partition p201309 values less than (to_date('2013/10/01','YYYY/MM/DD')),
 20          partition p201310 values less than (to_date('2013/11/01','YYYY/MM/DD')),
 21          partition p201311 values less than (to_date('2013/12/01','YYYY/MM/DD')),
 22          partition p201312 values less than (to_date('2014/01/01','YYYY/MM/DD'))
 23  )
 24  as select object_id
 25          , owner
 26          , object_name
 27          , object_type
 28          , to_date(trunc(dbms_random.value(
 29                  to_char(to_date('2013/01/01','YYYY/MM/DD'),'J'),
 30                  to_char(to_date('2013/12/31','YYYY/MM/DD'),'J')
 31                  )),'J') rdate
 32  from all_objects
 33  where owner in ('SYS','SYSAUX');

Table created.

This creates the partitions with 2 subpartitions each, per my defined template. For example:

PARTITION_NAME                 SUBPARTITION_NAME
------------------------------ ------------------------------
P201301                        P201301_SPSYS
                               P201301_SPSYSAUX
P201302                        P201302_SPSYS
                               P201302_SPSYSAUX
P201303                        P201303_SPSYS
                               P201303_SPSYSAUX

Next I'll add two more partitions, one with no subpartition definition specified and another with a custom definition:

SQL> alter table p_objects
  2          add partition p201408
  3                  values less than (to_date('2014/09/01','yyyy/mm/dd'));

Table altered.

SQL> alter table p_objects
  2          add partition p201409
  3                  values less than (to_date('2014/10/01','yyyy/mm/dd'))
  4          (
  5                  subpartition p201409_spdts values ('DTS')
  6                  , subpartition p201409_spsys values ('SYS')
  7                  , subpartition p201409_spsysaux values ('SYSAUX')
  8          )
  9  ;

Table altered.

The results:
PARTITION_NAME                 SUBPARTITION_NAME
------------------------------ ------------------------------
P201408                        P201408_SPSYS
                               P201408_SPSYSAUX
P201409                        P201409_SPDTS
                               P201409_SPSYS
                               P201409_SPSYSAUX

You can see that the first statement just used the subpartition template as defined in the original CREATE TABLE statement. The second uses the list I defined in the ALTER TABLE statement. I could have only defined the SPDTS subpartition and it would have only used that one and not used SYS or SYSAUX subpartitions at all.

Hopefully this post helps a few of you in your searching when faced with a similar task. This was probably obvious to most of you but the mind starts to slip for some of us.
Categories: DBA Blogs

ASM Startup Fails With ORA-04031 After Adding CPUs

Sun, 2014-07-13 15:11
A few weeks ago we upgraded one of our production server, adding another CPU tray. This brought the number of CPU cores from 80 to 160, and took us from 2Tb of RAM to 4Tb (just in time for Oracle to announce the in-memory database in 12.1.0.2!).

However when I went to start things up, ASM wasn't starting up, giving me these errors:

ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KKSSP^550","kglseshtSegs")
ORA-04031: unable to allocate 392 bytes of shared memory ("shared pool","unknown object","sga heap(1,1)","kdlwss")

ORA-04031: unable to allocate 560 bytes of shared memory ("shared pool","unknown object","sga heap(1,1)","KKSSP")
RBAL (ospid: 59427): terminating the instance due to error 4031




A quick search of MOS turned up this gem:

Unable To Start ASM (ORA-00838 ORA-04031) On 11.2.0.3/11.2.0.4 If OS CPUs # > 64. (Doc ID 1416083.1), with this cause:
In 11.2.0.3/11.2.0.4, the "PROCESSES" parameter will be default to "available CPU cores * 80 + 40" (in the ASM spfile). As the default value for "MEMORY_TARGET" is based on "PROCESSES", it can be insufficient if there is a large number of CPU cores or large number of diskgroups which could cause issues (i.e. Grid Infrastructure stack fails to stop with ORA-04031 etc) per Bug:13605735 & Bug:12885278, it is recommended to increase the value of MEMORY_MAX_TARGET & MEMORY_TARGET before upgrading/installing to 11.2.0.3/11.2.0.4 (does not apply to 10g ASM).
We followed the workaround of increasing the ASM memory_target (and memory_max_size) to 4Gb (from 500Mb) and things started up just fine.

We've since actually increased the memory even more in ASM to see even better performance but I haven't taken the time yet to precisely understand where that is coming from. I'll be sure to write another post to detail that.

For now we're definitely enjoying the bigger shared pool and buffer caches in 11.2. We're definitely excited to dig into the in-memory options in the upcoming 12.1.0.2.
Categories: DBA Blogs

Beware April 2014 PSU and Golden Gate Integrated Capture Users

Tue, 2014-06-03 14:55
When the latest quarterly Patch Set Update (PSU) came out from Oracle, we planned to get it in place as soon as reasonable due to our need to stay current with security patches, and my need to apply what I had assumed were well-tested bug fixes for our 11.2.0.3 installations. However we were in for an unpleasant surprise.

We first applied the April 2014 PSU on our staging & development database hosts and things ran fine. After two weeks, we put it into production. The next morning we noticed that our Golden Gate extracts were abending with a message like this:

2014-05-13 01:41:44 ERROR OGG-02077 Extract encountered a read error in the asynchronous reader thread and is abending: Error code 600, error message:
ORA-00600: internal error code, arguments: [knlogcPackColList:101], [1], [], [], [],[], [], [], [], [], [], []. 

Obviously the PSU was the only change from the day before when these GoldenGate extracts ran fine. The error itself seemed to match Bug 16306373 (OGG Lightweight capture fails with ORA-600 [knlogcPackColList:101]), which affects integrated capture extracts, which was what we were trying to use.

So we had two questions to answer:

  1. Was it really the PSU, and if so do we need to rollback?
  2. Why didn't we see this in development/staging?




The answer to #2 came pretty quickly: the extracts had been stopped prior to applying the PSU in development and were never restarted. Our use of GoldenGate is for a new project that is still not quite solid in form, and so the extracts might be shut down for weeks at a time. However the DBA team was not aware of this (something that will be added to the checklist in the future) and so that part was never tested. We decided to turn the extracts on and, sure enough, we saw the problem within 30 minutes.

As far as #1, we opened an SR with Oracle Support. At first they assured us it wasn't the PSU, but within a few hours had started backing away from that confidence and by the end of the day were suggesting we roll back. However we decided to hold off on that due to the luxury of not needing those extracts to run quite yet (we had just been capturing data in production to make sure we could handle the workload). That gave us and Oracle Support some breathing room to get it fixed.

Eventually Oracle Support did confirm that it was Bug 16306373. The patch for this bug, however, conflicted with another patch we had installed. So they created a new merge patch MLR 18689166 for us. This was applied that evening to just our development database home, and immediately we saw more problems.

First, regularly-scheduled materialized view refreshes were failing with:

ORA-06508: PL/SQL: could not find program unit being called: "SYS.DBMS_SNAPSHOT" 

Then we saw DataPump exports failing with:

ORA-06508: PL/SQL: could not find program unit being called: "SYS.DBMS_LOGREP_UTIL" 

On top of that, GoldenGate gave us these errors trying to unregister the extract:

GGSCI (stagingdb) 14> unregister extract ext1ol database
ERROR: Cannot register or unregister EXTRACT EXTFOO because of the following SQL error: OCI Error ORA-01775: looping chain of synonyms (status = 1775). 

It definitely looked like something in that patch had hosed the catalog. Oracle quickly found the problem:

The problem with the invalid dictionary objects after the installation of patch 18689166 is due to the execution of script e1102000.sql that is being called from postinstall.sql e1102000.sql is a downgrade scripts and removes some objects from the data dictionary resulting in other dependent objects to become invalid. 
It is here that I should point out that Oracle had this merge patch in their QA for over 4 days. I could have seriously hurt myself from simultaneously rolling my eyes and face-palming as violently as I did.

The fix was to restart the instance into UPGRADE mode and run the catupgrd.sql script from $ORACLE_HOME/rdbms/admin, which rebuilds the catalog. We confirmed that this did fix the problem in DEV, and proceeded to patch the other databases, this time commenting out the call to e1102000.sql n the postinstall.sql script, per Oracle's workaround.

So, not only did the April 2014 PSU introduce a bug, but then the patch for that bug made things incredibly worse. We were fortunate in that we did not yet need the GoldenGate extracts in production. However many others will not have that luxury and will want to double-check that bug and the PSU if they're using GoldenGate and integrated capture.
Categories: DBA Blogs

archive_lag_target Works in SE

Mon, 2014-05-12 15:34
TL;DR: The archive_lag_target parameter will force log archiving in Standard Edition.

Just a quick note here that I wanted to share since I didn't see anything directly confirming this when I was searching around.

I have an Oracle 11gR2 Standard Edition (SE) database that I'm also maintaining a manual standby for, since Oracle Data Guard is not available in SE. I created a metric extension in EM12c to alert me if the standby is more than 1 hour behind the primary. However since this is a very low-activity database, archive logs were not switching even once an hour. Obviously, I could include a command to force a log switch/archive in the script that I use to push archivelogs to the standby. However we all know that with Data Guard on Enterprise Edition (EE), one would use the archive_lag_target initialization parameter to set the desired maximum standby lag. Oracle enforces this by performing a log switch at most every X seconds, where X is the number specified by the archive_lag_target value. By default this is set to 0, which disables the feature.

I had assumed that archive_lag_target would only work in EE but decided to give it a try and was pleasantly surprised to see that it does work as intended in SE. So I can set archive_lag_target=900 to specify a 15 minute maximum log archiving (it would be more frequent if the database activity warranted an earlier switch).
Categories: DBA Blogs

Don't Fear the EM12c Metric Extensions

Fri, 2014-04-25 11:24
A few weeks ago, our customer support team asked us to automate part of their checklist that looks at the number of active sessions in our production database. In EM12c, this seemed like a no-brainer with the Average Active Sessions metric. So I added this to my production incident ruleset and went back to another project. Over the next few days we'd get pinged by EM12c but the support folks would say it shouldn't. After taking a look we realized that we should be looking at USER sessions, excluding the BACKGROUND sessions Oracle creates to run the instance (like DBWn and LGWR).

The trouble was that I couldn't find a metric for just the user sessions. I had resolved myself to having a scheduler job or script run the SQL that I wanted and send an email if it was over my critical threshold. On a whim I put a message out to twitter, and thankfully my friend Leighton answered. He suggested I look at adding a metric extension, something still foreign to me. I had seen the term in the EM12c interface but it sounded like some kind of plugin interface. Turns out it's simply another way of saying "user-defined metrics". Honestly, if they were labelled that way I would have started playing with them much sooner (subtle criticism).

So a quick search turned up a great video that showed just how simple it was to create a metric extension based on a SQL query. In just a few minutes I had the metric extension created, tested and published. You can create many different types of metric extensions all target types, but in my case a simple SQL query for a database instance was all I needed:

select count(*) from v$session
where type='USER' and status='ACTIVE';

I then define the warning and critical thresholds for the count and it's done! I added it to my ruleset (removing the stock Average Active Sessions metric) and haven't looked back.

Since then I've created a few other metric extensions, for example a standby lag check on a standard edition physical standby. My only regret is not taking the time to learn about these sooner. I suggest anyone using EM12c do so sooner rather than later.
Categories: DBA Blogs

Supplemental Logging and Securefiles Causing DBWn to Block

Thu, 2014-04-17 09:00
A few weeks back, we began making changes to prepare for using Oracle Golden Gate. One of the first steps required is to enable "minimal supplemental logging" at the database level. We did this during an evening maintenance window. However by the time the morning workload picked up, we started seeing a lot of sessions blocking, and the root blocker was one of the DB Writer (DBWn) processes.

Looking at the blocked sessions, a query similar to this was a common theme:

UPDATE foo
SET foo_data = :data, foo_time = systimestamp
WHERE foo_id = :id

This statement was run by many sessions from our webservers as part of a page load process. Very high-frequency call rate. We knew that the only change in the database was the minimal supplemental logging. Obviously we were preparing to turn it off, but took some time to look into it. It is important to note that for the most part the contention was only with sessions running this update statement.



Looking at the table involved, one unique feature was that foo_data field is an encrypted CLOB, which Oracle refers to as a securefile in 11g.

SQL> desc foo
Name            Null?    Type
--------------- -------- ----------------------------
FOO_ID          NOT NULL VARCHAR2(64)
FOO_DATA                 CLOB ENCRYPT
FOO_TIME        NOT NULL TIMESTAMP(6)

We opened an SR with Oracle and they pointed us to this unpublished bug:


Bug 9351684 : SECUREFILE - CACHE NOLOGGING CAUSES HIGH WRITE COMPLETE WAITS

They described it for us as:

Confirmed as "not a bug" in this bug.

It was stated: "write complete waits in this case are unfortunately, expected and can not be avoided/tuned. Even for NOLOGGING case there is a short invalidation redo that must be generated, and for correct crash recovery, dbwr must wait for redo to be written to disk first before data blocks can be written."Basically telling us that this is working as intended and there is no workaround if you're using securefiles. For us it was important that we move forward with Golden Gate, so we would need to have a solution that let us keep minimal supplemental logging on. Looking closer, we knew that this table was on an encrypted tablespace already, so we felt comfortable changing the table so that it used a regular "basicfile" CLOB:

SQL> desc foo
Name            Null?    Type
--------------- -------- ----------------------------
FOO_ID          NOT NULL VARCHAR2(64)
FOO_DATA                 CLOB
FOO_TIME        NOT NULL TIMESTAMP(6)


Since making this change, the problems have gone away. Obviously we were lucky in that we could change the table to not use securefiles. If you have a table that sees a lot of DML with securefiles, you're probably going to have a painful experience with supplemental logging. Beware!
Categories: DBA Blogs

RMAN Redundancy is not a Viable Retention Policy

Tue, 2014-04-15 14:07

Originally posted by me on the Pythian blog. This is an older post that I somehow forgot to post on my own blog, but another recent redundancy foul-up reminded me of it.

The story you are about to read is based on actual events. Names and paths have been changed to protect the innocent. I call this scenario “The Perfect Storm” because it took just the right combination of events and configurations. Sadly, this doesn’t make it an unlikely occurrence, so I’m posting it here in hopes that you’ll be able to save yourselves before it’s too late.

I have always had a preternatural dislike for using REDUNDANCY as a retention policy for Oracle RMAN, greatly preferring RECOVERY WINDOW instead, simply because REDUNDANCY doesn’t really guarantee anything valuable to me, whereas RECOVERY WINDOW guarantees that I’ll be able to do a point-in-time recovery to anytime within the past x days. Plus, I had already been burned once by a different client using REDUNDANCY. With the story I’m about to tell, this dislike has turned into violent hatred. I’m going to be light on the technical details, but I hope you’ll still feel the full pain.



First some table setting:
  • Standalone 10.2.0.2 instance (no RAC, no DataGuard/Standby)
  • RMAN retention policy set to REDUNDANCY 2
  • Backups stored in the Flash Recovery Area (FRA)
A few months ago, we had a datafile corruption on this relatively new instance (data had been migrated from an old server about a week prior). The on-call DBA followed up the page by checking for corruptions in the datafile with this command:

RMAN> backup check logical datafile '/path/to/foobar_data.dbf';

This, my friends, led to the major fall, though we did not know it for many hours. You see, the FRA was already almost full. This causes the FRA to automatically delete obsolete files to free up space. That last backup command, while only intended to check for logical corruption, did actually perform a backup of the file, and rendered the earliest backup of the file obsolete since there were two newer copies. That earliest file happened to be from the level 0 backup from which we would later want to restore.

Of course, at first we didn’t know why the file was missing. Logs showed that it was on disk no less than two hours before the problem started. Later, scanning the alert log for the missing backup filename yielded this:

Deleted Oracle managed file /path/to/flash_recovery_area/FOO_DB/backupset/2008_12_01/o1_xxxx.bkp

Oracle deleted the one backup file that we needed!

Even worse, it wasn’t until this time on a Monday night that we realized that the level 0 taken the previous weekend had failed to push the backup files to tape because of a failure on the NetBackup server. The problem was reported as part of Monday morning’s routine log checks, but the missing files had not yet been pushed to tape.

In the end, we were able to drop and restore the tablespace to a previous point in time on a test instance from another backup file and exp/imp data back over. It was ugly, but it got things back online. Many DBAs better than myself gave their all on this mission.

To summarize, the ingredients:
  • Oracle RMAN
  • CONFIGURE RETENTION POLICY TO REDUNDANCY 2;
  • Flash Recovery Area near full, obediently deleting obsolete files.
  • Tape backup failure
Add in an innocent backup command and . . . BOOM! Failure Surprise.

The two biggest points to take away are:

  • Tape backup failures are still serious backup failures and should be treated as such, even if you backup to disk first.
  • REDUNDANCY is not a viable retention policy. In my house, it is configuration non grata.
Categories: DBA Blogs

Migrating (and Upgrading!) Your EM12c Repository Database

Mon, 2014-04-07 08:00
This week I migrated our EM12c repository database to a new server as part of its promotion to production status. Just to make it a little more exciting, the migration also involved an in-flight upgrade from 11.2.0.3 to 11.2.0.4. Much of this post is directly inspired by Martin Bach's post on the same subject. I ran into a few other snags that weren't mentioned so I thought it would be worthwhile to document the experience here for your benefit.

I'm assuming you have all the software installed (and patched to the latest PSU, right?). Alright then, let's begin!

Stop OMS
We want to make sure there are no more changes coming, and nothing needs to access the repository database, so be sure to stop all OMS instances:

$ emctl stop oms -all

Backup PFILE
We need to get the pfile for the current repo and copy it into place on new host:

SQL> create pfile='/mnt/emrepo_backup/initemrepo.ora' from spfile;

I use /mnt/emrepo_backup here because that is the directory that I'll be backing the database up to and copying to the new host after. If you create your pfile somewhere else, be sure to copy it to the new host under $ORACLE_HOME/dbs/
Backup Repo Database
Next we backup the repo database. Here's a snippet from my ksh script that I used:


#!/bin/ksh

BACKUPDIR=/mnt/emrepo_backup
LOGFILE=backup_emrepo.log

mkdir -p $BACKUPDIR

rman log=$LOGFILE <<EOF
connect target /
set echo on


run {

        allocate channel c1 device type disk format '$BACKUPDIR/%U';
        allocate channel c2 device type disk format '$BACKUPDIR/%U';
        allocate channel c3 device type disk format '$BACKUPDIR/%U';
        allocate channel c4 device type disk format '$BACKUPDIR/%U';

        backup as compressed backupset database
                include current controlfile
                plus archivelog;
}
EOF

When the backup is finished, review the RMAN log and make note of which backup piece contains the controlfile backup. We'll need to refer to it by name as part of the restore process.

If your backup directory is an NFS mount, then you can simply unmount it from here and mount it to the new server. Otherwise, be sure to copy the files there after the backup is complete, for example:

$ scp -r /mnt/emrepo_backup newhost:/path/to/emrepo_backup

After this, it should be safe to shutdown the old repository database.

$ sqlplus / as sysdba
SQL> shutdown immediate

If you use Oracle Restart:

$ srvctl stop database -d emrepo
$ srvctl disable database -d emrepo

Prepare New Host for Repo DB
Now we need to set things up on the new host for the emrepo DB.

Create oratab Entry
First let's create an entry in /etc/oratab for this DB under the new 11.2.0.4 home. For example:

emrepo:/oracle/app/product/11.2.0.4:N

Edit PFILE and Create SPFILE
Then let's copy that parameter file into place.

$ . oraenv
ORACLE_SID = [oracle] ? emrepo
The Oracle base has been set to /oracle/app
$ cd $ORACLE_HOME/dbs/
$ cp /mnt/emrepo_backup/initemrepo.ora .

Now edit that file and make sure you update the parameters that require updating. In my case, I'm using Oracle Managed Files (OMF) so I set db_create_file_dest and db_create_online_log_dest_1. I also set db_recovery_file_dest for the FRA. I then set the control_files parameter to specify where I want the control file(s) restored to from the backup when I get to that point.

Now, Martin Bach noted in his blog post that he did not have to specify a db_file_name_convert or log_file_name_convert. I was having some difficulty during the restore phase, and added these parameters out of pure speculation. They didn't help the problem, but I left them in for the duration of my process. I only mention this as an FYI if you end up comparing your settings to mine.

Once you have all your parameters set as desired, create the SPFILE:

$ sqlplus / as sysdba
SQL> create spfile from pfile;

Now, let us restore ourselves the database.
Restore Repo DB on New Host
The restore was done largely as part of a ksh script, which I'll reference snippets of here. Let's start by defining some variables:

BACKUPDIR=/mnt/emrepo_backup
DESTDIR=/oracle/app/oradata/data/EMREPO


Restore Controlfile and Mount Database
From the script, we call RMAN to start the instance in nomount mode, restore the controlfile from the specified backuppiece and mount the database:

rman log=$LOGFILE <<EOF
connect target /
set echo on
startup force nomount;
restore controlfile from '$BACKUPDIR/1abcd123_1_1';
alter database mount;

catalog start with '$BACKUPDIR' noprompt;
EOF

We end by cataloging the backup files, as you can see.

Generate SET NEWNAME Script
Here I dip into sqlplus to generate an script for RMAN to call SET NEWNAME for each of the datafiles. Without this, RMAN would try to restore the datafiles to their old paths on the original host. Here I set them for the path that OMF will use:


sqlplus -s /nolog <<EOF
connect / as sysdba
set head off pages 0 feed off echo off verify off
set lines 200
spool rename_datafiles.rman
select 'set newname for datafile ' || FILE# || ' to ''' || '$DESTDIR/datafile/' || substr(name,instr(name,'/',-1)+1) || ''';' from v\$datafile;
select 'set newname for tempfile ' || FILE# || ' to ''' || '$DESTDIR/tempfile/' || substr(name,instr(name,'/',-1)+1) || ''';' from v\$tempfile;
spool off
EOF

Restore & Recover Database
Now we're ready to restore the database and perform recovery. Again, we call RMAN and run this:

run {
  allocate channel c1 device type disk;
  allocate channel c2 device type disk;
  allocate channel c3 device type disk;
  allocate channel c4 device type disk;
  @rename_datafiles.rman
  restore database;
  switch datafile all;
  switch tempfile all;
  recover database;
}


At this point we're done with the restore and recovery. Normally I would OPEN RESETLOGS, but remember that we're restoring this to an 11.2.0.4 home, so we still need to UPGRADE the database!


Open and Upgrade Database
First we still call OPEN RESETLOGS, but with the UPGRADE option. This replaces the "STARTUP UPGRADE" command you would find in the manual upgrade instructions.

$ sqlplus / as sysdba
SQL> alter database open resetlogs upgrade;

Now we follow the rest of the manual upgrade instructions, I'll just post the commands here, but you should definitely review the documentation:

$ cd $ORACLE_HOME/rdbms/admin
$ sqlplus / as sysdba
SQL> spool upgrade.log
SQL> @catupgrd.sql

-- Start database again
SQL> startup;

-- Check status of components, some will be fixed by utlrp.sql
SQL> @utlu112s.sql

-- Rebuild everything
SQL> @catuppst.sql
SQL> @utlrp.sql

-- Confirm everything is OK now
SQL> SELECT count(*) FROM dba_invalid_objects;
SQL> SELECT distinct object_name FROM dba_invalid_objects;
SQL> @utlu112s.sql


The utlu112s.sql should now report all components as VALID. If not, you'll want to refer to the upgrade documentation for troubleshooting.

At the point the database is upgraded and open. Make sure you have a listener running and that the new database is registered. The only thing  left is the tell your OMS servers to look for the repository database in its new location.

Update OMS Repository Settings
First we need to start just the administration server:

$ emctl start oms -admin_only

This is necessary if you used the "-all" option when stopping OMS earlier. If you did not use "-all" then the admin server should still be running.

Now, update the store_repos_details setting in the OMS configuration:

$ emctl config oms -store_repos_details -repos_port 1521 \
  -repos_sid emrepo -repos_host newhost.mydomain.com \
  -repos_user sysman -repos_pwd xxx

Repeat this step for all your OMS servers (emctl should remind you to do so when changing the config). Then on each, completely shutdown and restart OMS:

$ emctl stop oms -all
$ emctl start oms

And that should be it! Don't forget to drop/delete the database from the original server when you're comfortable doing so.
Categories: DBA Blogs

ORA-00600 [kkzlpllg:5] When Dropping MView Log

Sun, 2014-03-30 18:43
This week a co-worker and I have been doing some rapid-fire testing to improve fast-refresh performance on an old materialized view, which does some summary aggregations on a 1.9 billion row (and growing) master table. One of the things we tested was using the new-in-11gR2 COMMIT SCN feature. There is a great blog post describing the benefits of this feature by Alberto Dell'Era. To quickly summarize and over-simplify, it provides a much faster way to update rows in the materialized view log that are eligible for refresh and purging. This definitely sounds like something we'd want, so let's roll!

Well we quickly hit a snag when testing our creation script the second time around, when it wouldn't let us drop the materialized view log:
SQL> DROP MATERIALIZED VIEW LOG ON FOO.BAR
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kkzlpllg:5], [], [], [], [], [],[], [], [], [], [], []


We found that we also could no longer perform fast refreshes of the materialized view, getting the same ORA-00600 error. Our initial MOS search turned up Doc ID 14158012.8, which indicates Bug 14158012 (Orphan rows in SNAP_LOGDEP$ causes ORA-600 [kkzlpllg:5]). The bug description is:
With this bug, when there are orphan rows in SNAP_LOGDEP$ with RSCN=NULL,
a CREATE MATERIALIZED VIEW or DROP MATERIALIZED VIEW statement will report ORA-600 [kkzlpllg:5].


I verified that we did have such NULL RSCN values in SNAP_LOGDEP$ related to the master table here.

SQL> select d.tableobj#, o.name
  2  from sys.snap_logdep$ d, sys.obj$ o
  3  where d.tableobj# = o.obj#
  4  and o.name='BAR'
  5  and d.rscn is null;

 TABLEOBJ# NAME
---------- ------------------------------
    605668 BAR
    605668 BAR

Unfortunately, that doc also said there was no workaround other than to apply a one-off patch (otherwise fixed in 11.2.0.4 and 12c)! Not exactly the quick fix we were hoping for. 

However, we did some more searching and found Doc 1310296.1: Internal Error ORA-00600 [kkzlpllg:5] When Executing Drop Materialized View Log

Pretty much the same thing, only this gives us a workaround:

Since the information stored in the Materialized View log is bogus in any case, the Materialized View Log is not useable anymore. The only option is to drop this object and re-create it. To make this possible, a dummy non-NULL value needs to be written into the RSCN column for that table.So we update those rows to set RSCN to a dummy value (9999):

SQL> UPDATE snap_logdep$
  2  SET RSCN = 9999
  3  WHERE tableobj# = 605668

  4  AND RSCN IS NULL;

And we were able to drop the materialized view log and resume testing afterwards.

Hopefully this article saves someone from hitting the same initial roadblock that I did. Especially nice to see the Oracle support docs contradicting themselves!

We also made another interesting find with this particular materialized view that I'll be blogging about later. Definitely an eye-opener, face-palmer and forehead-smacker all in one.

UPDATE - 8 April 2014

Nothing mind blowing, but a little time saver if you know the object name. This doesn't take the owner into account, so be careful out there.

update snap_logdep$
set rscn = 9999
where tableobj# = (
        select distinct d.tableobj#
        from sys.snap_logdep$ d, sys.obj$ o
        where d.tableobj# = o.obj#
        and d.rscn is null
        and o.name='BAR'
)
and rscn is null;


commit;

Categories: DBA Blogs

Pages