How PARANOID are you with data? Or are there DIRTY READs in Oracle???

Some weeks ago, I was discussing with a friend about an issue he saw in some Oracle databases, kind of losing data, or in other words, seeing UNCOMMITTED data that later on disappeared.

Allowing a session to read uncommited data being changed in another transaction is a Transaction Isolation Level called READ UNCOMMITED, which is not supported by Oracle. Such type of read is known as a DIRTY READ.

Many database scholars argue that a dirty read should never be allowed, under any circumstances, which I tend to agree with. And Oracle seems to agree as well, since it does not support this by design, unlike other databases.

So, the idea of Oracle allowing such thing sounds crazy, right? Or maybe a bug?

Well, after some research this friend did, he found this post from Jonathan Lewis explaining a very rare but possible situation when these DIRTY READS could take place in Oracle and cause data confusion.

And I was able to reproduce the issue in several different versions, including 23c FREE.

In this post I will explain in details how to reproduce the issue, and then I will present the solution that will make the title of this post make sense.

Basically, the problem happens when the LGWR process hangs, avoiding the COMMIT to complete successfully. But the problem is the way Oracle works by default, the COMMIT steps are executed first in memory by writing the redo entries to the Redo Log Buffers and then calling LGWR to write them to Redo Log Files on disk. In the rare case when the LGWR is hung for whatever reason but the instance is still working, another session may end up querying and seeing the changed data event without the COMMIT being persisted to disk.

In order to reproduce the behavior, we will follow these steps exactly the way JL shows:

Initial Preparation

To prepare I will just create a table and insert a row of data:

create table t1(n1 number);

Table created.

insert into t1 values(1);

1 row created.

commit;

Commit complete.

Step 1 – Session 1 changes some data

Now let’s change some data:

col con_name for a10
col sid for a10
select SYS_CONTEXT ('userenv','con_name') con_name, SYS_CONTEXT ('userenv','sid') sid from dual;

CON_NAME   SID
---------- ----------
FREEPDB1   51

update t1 set n1 = 2;

1 row updated.

select * from t1;

	    N1
----------
	     2

Step 2 – Session 2 connects (or is already connected)

This is the session that will see the dirty read. It needs to be connected before the LGWR fails, because after that no new session will be able to connect.

col con_name for a10
col sid for a10
select SYS_CONTEXT ('userenv','con_name') con_name, SYS_CONTEXT ('userenv','sid') sid from dual;

CON_NAME   SID
---------- ----------
FREEPDB1   271

Step 3 – LGWR hangs

Let’s find the LGWR PID# and use it with oradebug to force a hang (put the process in a suspended state for the tests).

In order to use oradebug, I will need a third session connected as SYSDBA at the CDB level:

col con_name for a10
col sid for a10
select SYS_CONTEXT ('userenv','con_name') con_name, SYS_CONTEXT ('userenv','sid') sid from dual;

CON_NAME   SID
---------- ----------
CDB$ROOT   237

select * from session_privs where privilege = 'SYSDBA';

PRIVILEGE
----------------------------------------
SYSDBA

Then, I will identify the LGWR PID:

select pid from v$process where pname='LGWR';

       PID
----------
	    22

And finally I will put the process to sleep:

oradebug setorapid 22
Oracle pid: 22, Unix process pid: 88, NID: 4026532473, image: oracle@bcadfe8ca782 (LGWR)
oradebug suspend
Statement processed.

I won’t show it here, but you can test yourself that from this moment on, if you try to open a new session it will hang.

Step 4 – Session 1 issues a COMMIT

When the COMMIT is issued, first the server process attending that session will work in memory writing the COMMIT redo entry, and then calls LGWR to write the redo entries from the Redo Log Buffer into the Redo Log File on disk.

select SYS_CONTEXT ('userenv','con_name') con_name, SYS_CONTEXT ('userenv','sid') sid from dual;SQL> SQL> 

CON_NAME   SID
---------- ----------
FREEPDB1   51

commit;
-- THE SESSION WILL HANG

Since the LGWR is hung, the COMMIT does not finish and the session also hangs. BUT, the work in memory is already done and so from the memory perspective the COMMIT was successful already.

Step 5 – Session 2 issues a SELECT against the data and receives a DIRTY READ in response

SQL> select SYS_CONTEXT ('userenv','con_name') con_name, SYS_CONTEXT ('userenv','sid') sid from dual;

CON_NAME   SID
---------- ----------
FREEPDB1   271

SQL> select * from t1;

        N1
----------
         2

    See that? Even with the COMMIT not confirmed and the redo data not persisted on disk, the Session 2 was able to see the data already modified, which should NOT be allowed.

    So, this is a VERY VERY rare situation that only happened because a background process (the LGWR) stopped working properly, and right after, without the instance failing, a session changes data and another one checks that data. But although it’s rare, it can happen as we demonstrated here.

    Simulating an instance failure

    What happens now if the instance fails? Well, since the COMMIT was never persisted on disk, the transaction should be rolled back after the instance is back up, so theoretically the data change never happened.

    Let’s see it in action: in the SYSDBA session, without resuming the LGWR, I will force a restart of the instance:

    startup force
    ORACLE instance started.
    
    Total System Global Area 1603679416 bytes
    Fixed Size		    5313720 bytes
    Variable Size		  973078528 bytes
    Database Buffers	  620756992 bytes
    Redo Buffers		    4530176 bytes
    Database mounted.
    Database opened.
    

    Now that the instance is up again, I will reconnect as a normal user and check the data:

    col con_name for a10
    col sid for a10
    select SYS_CONTEXT ('userenv','con_name') con_name, SYS_CONTEXT ('userenv','sid') sid from dual;
    
    CON_NAME   SID
    ---------- ----------
    FREEPDB1   33
    
    select * from t1;
    
            N1
    ----------
             1
    

    As expected, the transaction was lost and the original value is back.

    Now, the solution

    So, as I mentioned in the beginning, I was able to reproduce the same behavior in different Oracle versions: 12, 19 and 23.

    BUT, in 23c there is a new parameter called paranoid_concurrency_mode that controls this behavior. I don’t know exactly how this parameter works internally, since at this point I could not find any documentation about it.

    This original value is FALSE, and when it is changed to TRUE, then the demonstrated behavior showed here does not happen. In this case, the Session 2 also hangs when trying to issue a SELECT against the changed data.

    So, this is a solution for the problem, and also a very rare case when a SELECT can hang in Oracle.

    Some important notes here:

    • I was only able to make the parameter to work by changing it at the CDB level, and restarting the instance (it’s not online modifiable). If I changed it at the PDB level the problem persisted, which makes sense since the LGWR is a background process for the instance, so to change its behavior it’s expected to change the parameter at the CDB level.
    • I don’t know what are the disadvantages of changing this parameter to TRUE, what else it changes or even what is the performance penalty for the instance. Although it’s not a hidden parameter (it doesn’t start with an hyphen), since there is no documentation at this point, you should ask My Oracle Support about changing it.

    So that’s it for now.

    This is just a study case, and PLEASE do not test or do any change in Production database without proper testings and in this particular case, having the OK from MOS.

    I hope you like it.

    Se you in the next post!

    4 thoughts on “How PARANOID are you with data? Or are there DIRTY READs in Oracle???

    1. Frank Wilhoit

      One of my employers was a big DB2 shop, and the DBAs forced us to use dirty reads for everything. The applications were poorly indexed, so (we were told) any joins caused a cascade of table locks that in one case reached all the way up to the system catalog, preventing even the nightly backup from running. Again, this is what we were told.

      Like

      Reply
      1. Eduardo Claro Post author

        yeah Frank, this is one of the basic things that explain why Oracle is superior to most other databases: the locking mechanism is much better implemented, so Oracle NEVER needs to escalate locks. And when a SELECT is issued agains data that is being modified by another sesion’s transaction, these other DBs usually lock the SELECT statement until the transaction is completed. Oracle NEVER blocks a SELECT, it uses the UNDO to present consistent data even with ongoing transactions. Or, to simplify: the concurrency controls in Oracle are much better.

        Like

        Reply

    Leave a comment