RAC (Real application cluster): February 2019

Learn RAC FROM BASIC TO IN DEPTH:

I will cover part by part from basic to installation [NEW 12 c features will be covered later on ]

RAC ARCHITECTURE

SO Why RAC

<< Single instance database drawbacks >>

Instance failure (no high availability) :if single instance crashes or fails service will be stopped and your database will not be available.

Server failure: server can fail due to many reasons eg power failure hardware failure etc.

limited capacity (no Scalability): It means instance have a limited capacity you cant increase its memory capacity after it reaches its limit for eg users connecting to instance there is a limit of it .

RAC needs cluster with shared storage

So clusterware in simple term is a software
The grouping of two or more server acting as a single server grouped together logically is know as clusterware [physically they a different servers but logically they are grouped together as one server in depth is covered in clusterware topic ] shared storage

shared storage is a storage shared by every servers.

RAC instances:

Database is created on shared storage
Rac database will have multiple instance.
Each instance runs on physically different server

Working of RAC

so rac provides availability: It means ur database still be available even if ur node 1 or 2 fails [atleast one node should be up at all time ]
so to elaborate if node 1 fails the client connected to node 1 can be redirected to node 2 automatically in RAC .[TAF i will explain this later on ]

[so in order to totally protect your DB for failure MAA architecture should be used (RAC + Dataguard) ]

Advantages of RAC:

HIGH AVAILABILITY : protect against instance /server failure

Scalability : More instance support more work load

Workload management :workload can be distributed among multiple instance

Major components of RAC

cluster (oracle clusterware)

shared storage

Interconnect [network-private network b/w servers]

RAC software.

Shared storage >

It stores oracle clusterware files and oracle database files.
should be cluster-aware storage[ it means anynode can access it ]
can be DAS,NAS,SAN,NFS
Must be configured with cluster file system (CFS) or ASM [we need cfs or asm because regular file system wont allow access to data to multiple node at single time]

Cluster Interconnect>

Its a private network between two servers for inter-node communication and oracle rac cache fusion [cache fusion is transfer of datablock from cache memory of one node to other node]
Should be high bandwidth and low latency [To avoid performance bottlenecks min 1GBps ]

RAC Networks [Minimum 2 network required]

Each node must have multiple network adapters.

Public network adapter [LAN- for client]
Private network adapter (interconnect adapter)
ASM Network (For 12c - optional)

RAC detailed Architecture (we will discuss in details later on )

RAC provides additional performance from multiple instances by enabling the following features:

Cache fusion: Cache fusion is a cache coherency mechanism to transfer requests for specific blocks via the cluster interconnect, thus improving performance.
Sequence generators: RAC provides a clustered database solution with the database shared between two or more instances. All objects, including sequence numbers in the shared database, are accessible from one or more instances simultaneously.
System change number (SCN): The system change number generated on one instance is communicated via the cluster interconnect to the other instances providing a single view of the transaction status across all instances. This communication of SCN across the interconnect takes place without any additional overhead, by piggy backing against any message that is passed across the cluster interconnect. An SCN is at least 48 bits long. Thus they can be allocated at a rate of 16,384 SCNs per second for over 534 years without running out of them.
Failover: clustered configuration consists of two or more nodes participating as a collective configuration. In a clustered database, this type of configuration provides application fail-over by allowing re-connection to the database using another active instance in case the connection to the original instance is broken.
Scalability: By allowing members in a cluster to leave (in case of node failures, or for maintenance) or join the cluster (in case new nodes are added to the cluster), RAC provides scalability. Scalability helps to add additional configurations based on increased user requirements.

Various component of cluster hardware configuration
Capture

Cluster manager (CM) : CM is part of the clustered operating system that is responsible for providing cluster integrity. A high-speed interconnect is used to provide communication between nodes in the cluster. The CM uses the interconnect to process heartbeat messages between nodes. The function of the heartbeat messaging system is to determine which nodes are logical members of the cluster and to update the membership information on the nodes. Basically, the heartbeat messaging system enables the CM to understand how many members are in the cluster at any given point of time.

The communication layer manages the communication between the nodes. Its responsibility is to configure and pass messages across interconnect, to the other nodes in the cluster. While the CM uses the messages returned by the heartbeat mechanism, it is the responsibility of the communication layer to ensure the transmission of the message to the CM.
Interprocess communication protocol (IPC) in a clustered configuration is responsible for packaging the Oracle messages and passing them to and from the communication layer for the interconnect access.

Background Processes in RAC:

Oracle 11g uses Global Cache Services to coordinate activity. A lock is treated as a held resource. RAC is a multi-instance database. Multiple instances access the same database concurrently. In terms of structure, the difference between a RAC instance and a stand-alone Oracle instance is miniscule. Besides all the usual Oracle background processes, many special processes are spawned to coordinate inter-instance communication and to facilitate resource sharing among nodes in a cluster. The Oracle documentation goes through all of the processes if you are interested in knowing more. Here is a brief description of some of the main ones:

ACMS The Atomic Controlfile to Memory Service (ACMS) is an agent on a per-instance basis that helps to ensure a distributed SGA memory update is globally committed on success and globally aborted on failure.
LMON The Global Enqueue Service Monitor (LMON) monitors the entire cluster to manage global enqueues and resources. LMON manages instance and process expirations and the associated recovery for the Global Cache Service.
LMD The Global Enqueue Service Daemon (LMD) is the lock agent process that manages enqueue manager service requests for Global Cache Service enqueues to control access to global enqueues and resources. The LMD process also handles deadlock detection and remote enqueue requests.
LMSn These Global Cache Service processes (LMSn) are processes for the Global Cache Service (GCS). RAC software provides for up to ten Global Cache Service processes. The number of LMSn processes varies depending on the amount of messaging traffic among nodes in the cluster. The LMSn processes do these things:
- Handle blocking interrupts from the remote instance for Global Cache Service resources.
- Manage resource requests and cross-instance call operations for shared resources.
- Build a list of invalid lock elements and validate lock elements during recovery.
- Handle global lock deadlock detection and monitor lock conversion timeouts.
LCK0 process The Instance Enqueue Process manages global enqueue requests and cross-instance broadcast. Manages non-cache fusion and library/row cache requests.
RMSn RAC management processes include tasks like the creation of resources as nodes are added.
RSMN Remote Slave Monitor (RSMN) performs remote instance tasks for a coordinating process.
GTX0-j The Global Transaction Process supports global XA transactions.

Global Cache Service (GCS) and Global Enqueue Service (GES) [ interview imp]***

GCS and GES (which are basically RAC processes) play the key role in implementing Cache Fusion. GCS ensures a single system image of the data even though the data is accessed by multiple instances. The GCS and GES are integrated components of Real Application Clusters that coordinate simultaneous access to the shared database and to shared resources within the database and database cache. GES and GCS together maintain a Global Resource Directory (GRD) to record information about resources and enqueues. GRD remains in memory and is stored on all the instances. Each instance manages a portion of the directory. This distributed nature is a key point for fault tolerance of the RAC.

The coordination of concurrent tasks within a shared cache server is called synchronization. Synchronization uses the private interconnect and heavy message transfers. The following types of resources require synchronization: data blocks and enqueues. GCS maintains the modes for blocks in the global role and is responsible for block transfers between instances. LMS processes handle the GCS messages and do the bulk of the GCS processing.

An enqueue is a shared memory structure that serializes access to database resources. It can be local or global. Oracle uses enqueues in three modes: null (N) mode, share (S) mode, and exclusive (X) mode. Blocks are the primary structures for reading and writing into and out of buffers. An enqueue is often the most requested resource.

GES maintains or handles the synchronization of the dictionary cache, library cache, transaction locks, and DDL locks. In other words, GES manages enqueues other than data blocks. To synchronize access to the data dictionary cache, latches are used in exclusive (X) mode and in single-node cluster databases. Global enqueues are used in cluster database mode.

Cache Fusion and Resource Coordination [ will explain in detail below]

Because each node in a Real Application Cluster has its own memory (cache) that is not shared with other nodes, RAC must coordinate the buffer caches of different nodes while minimizing additional disk I/O that could reduce performance. Cache Fusion is the technology that uses high-speed interconnects to provide cache-to-cache transfers of data blocks between instances in a cluster. Cache Fusion functionality allows direct memory writes of dirty blocks to alleviate the need to force a disk write and reread (or ping) of the committed blocks. This is not to say that disk writes do not occur; disk writes are still required for cache replacement and when a checkpoint occurs. Cache Fusion addresses the issues involved in concurrency between instances: concurrent reads on multiple nodes, concurrent reads and writes on different nodes, and concurrent writes on different nodes.

Oracle only reads data blocks from disk if they are not already present in the buffer caches of any instance. Because data block writes are deferred, they often contain modifications from multiple transactions. The modified data blocks are written to disk only when a checkpoint occurs. Before I go further, you need to be familiar with a couple of concepts introduced in Oracle 9i RAC: resource modes and resource roles. Because the same data blocks can concurrently exist in multiple instances, there are two identifiers that help to coordinate these blocks:

Resource mode The modes are null, shared, and exclusive. The block can be held in different modes, depending on whether a resource holder intends to modify data or merely read it.
Resource role The roles are locally managed and globally managed.

Global Resource Directory (GRD) is not a database. It is a collection of internal structures and is used to find the current status of the data blocks. Whenever a block is transferred out of a local cache to another instance’s cache, the GRD is updated. The following information about a resource is available in GRD:

Data Block Identifiers (DBA)
Location of most current versions
Data blocks modes (N, S, X)
Data block roles (local or global)

Past Image

To maintain data integrity, a new concept of past image was introduced in the 9i version of RAC. A past image (PI) of a block is kept in memory before the block is sent and serves as an indication of whether it is a dirty block. In the event of failure, GCS can reconstruct the current version of the block by reading PIs. This PI is different from a CR block, which is needed to reconstruct read-consistent images. The CR version of a block represents a consistent snapshot of the data at a point in time.

cache fusion in detail::

Cache fusion is a new technology that uses a high-speed interprocess communication interconnect to provide cache-to-cache transfer of data blocks between instances in a cluster. This technology for transferring data across nodes through the interconnect became a viable option as the bandwidth for interconnects increased and the transport mechanism improved. Cache fusion architecture is revolutionary in an industry sense because it treats the entire physical distinct RAM for each cluster node logically as one large database SGA, with the interconnect providing the physical transport among them.

Prior to Oracle 9i RAC, transferring a data block from one node to another involved writing the block from the database buffer cache of the holding node to the shared disk storage. The requesting node read the data block from disk into its own cache.

Cache fusion, a natural evolution of the OPS architecture, implements cache synchronization using a write-back model. The GCS and GES processes on each node manage the synchronization by using the cluster interconnects for data block movement between nodes. Cache fusion addresses transaction concurrency between instances. This section will provide a brief introduction to the different scenarios of inter-cluster sharing of blocks and explore how they work.

Concurrent reads on multiple nodes: This occurs when two or more instances participating in the clustered configuration need to read the same block of information. The block of requested information is shared between the instances via the cluster interconnect. The first instance that reads the block would be the owning block and the subsequent instances that require access to the same block will request for the block via the cluster interconnect.

Concurrent reads and writes on different nodes: This is the common type of concurrency that would be noticed, a mixture of read/write operations against the same block. In this scenario the architecture is similar to that of a single instance except that they happen across the cluster interconnect through a different set of background processes. A block can be read from the current version or from the read-consistent previous version of the block.

Concurrent writes on different nodes: This kind of operation could be classified, or incorporated, into what was just discussed. This is a situation where multiple instances request modification of the same data block frequently. Again, the outcome, or the solution, to complete these requests is via the cluster interconnect.

In all of these transfer of block requests between instances using the interconnect, the GCS process plays a significant role as the master/keeper of all requests between instances. GCS tracks the location and status of data blocks as well as the access privileges of various instances. Oracle uses the GCS for cache coherency when the current version of a data block is on one instance's buffer cache and another instance requests that block for modification.

In this example, instance RAC3 has read block 500 from the database and is currently holding the block. Now instance RAC2 requires the same block and makes a request to retrieve it from the database. Instance RAC2 during this process communicates with the GCS process, which could be resident (depending on the resource master) on any of the nodes. For this example, we will place it on instance RAC4. Instance RAC4 understands that instance RAC3 is currently the holder of the block and requests instance RAC3 to transfer the block via the cluster interconnect to instance RAC2.

[ The resource master for a specific data file is obtained by querying against the GV$GCSHVMASTER_INFO ]

SQL>> SELECT INST_ID,HV_ID,CURRENT_MASTER,PREVIOUS_MASTER,REMASTER_CNT FROM GV$GCSHVMASTER_INFO WHERE REMASTER_CNT > 0;

The cache-to-cache data transfer is done through the high-speed IPC interconnect. This virtually eliminates any disk I/O to achieve cache coherency
The GCS tracks one or more past images (PI) for a block in addition to the traditional GCS resource roles and modes. (A past image is the copy of a block retained after the holding instance has shipped the block to the requesting node.) The amount of work that the GCS is required to perform is proportional to the number of instances participating in the clustered configuration.

When multiple instances require access to a block, and a different node masters the block, it is the GCS resources that track the movement of blocks through the cluster. As a result of block transfers between instances, multiple copies of the same block could be on different nodes. These blocks in different instances have different resource characteristics. These characteristics are identified by the following factors:

Resource mode
Resource role

Resource mode Resource mode is determined by various factors, such as who the original holder of the block is, what operation the block was acquired for, what operation the requesting holder intending to perform is, what the outcome of the operation is, etc. [Null ,shared ,exclusive]

Resource role While the resource modes are being maintained between the instances they could be held by the local instance, where its requirement is of a local nature, or could be utilized by more than one instance, where the requirements would be of a global nature. [local and global ]

Global resource directory [ In Depth ]

The global resource directory (GRD) contains information about the current status of all shared resources. It is maintained by the GCS and GES to record information about resources and enqueues held on these resources. GRD resides in memory and is used by GCS and GES to manage the global resource activity. It is distributed throughout the cluster to all nodes. Each node participates in managing global resources and manages a portion of the GRD.

When an instance reads data blocks for the very first time, its existence is local, that is, no other instance in the cluster has a copy of the same block. The block in this state is called a current state block (XI). Therefore, the behavior of this block in memory is similar to any single instance configuration, with the exception that GCS keeps track of the block even in a local mode. Multiple transactions within the instance have access to these data blocks. Once another instance has requested for the same block, then the GCS process will update the GRD, taking the state of the data block from a local role to a global role.

The structure and function of the GRD is similar to a redo log buffer. The redo log buffer contains the current and past images of the rows being modified, while the GRD contains information at a higher level, specifically the current and past image of the blocks being modified by the various instances in the cluster.

Database block address (DBA) This is the basic address of the block that is being modified. An example would be block 600. This indicates that block 600 is accessed by a user on its current instance and, based on other values like mode (null, shared, and exclusive) and role (local or global), is determined if the current instance is the original holder or a requester of the block.

Location This indicates the location of the current version of the data block. (This value is only present if multiple nodes share the block.)

Mode This indicates the mode in which the data blocks are being held by instances.

Role This indicates the role in which the data block is being held by each instance.

System change number (SCN) The system change number is required in a single instance configuration to serialize activities such as block changes, redo entries, and replay of redo logs during a recovery operation. The SCN has a more robust role in a RAC environment.

Depending on the type of block request, each instance participating in the transaction maintains a PI of the block being requested. This enables Oracle to generate redo logs in an orderly manner and to accommodate subsequent recovery processing. Hence, the SCNs generated by all instances have to be synchronized globally.

To make the activity less complex, separate SCNs are generated by each instance. However, with the database being a commonly shared database, and in an effort to keep the transactions in a serial order, these SCNs have to resynchronize their own SCN to the highest SCN known in the cluster.

The method used by Oracle to resynchronize its own SCN to the highest SCN in the cluster is a broadcasting mechanism known as Lamport generation. Under this scheme, SCNs are generated in parallel on all instances and Oracle piggybacks an instance's current SCN onto any message being sent via the cluster interconnect to another instance. This allows the SCN to be propagated between instances without incurring any additional message overhead. Once propagated, the GCS process will manage the SCN synchronization process. The default interval is based on the platform-specific message threshold value of 7 seconds.

Past image (PI) The past image is a copy of a globally dirty block image maintained in cache. It is saved when a dirty block is served to another instance after setting the resource role to global. A PI must be maintained by an instance until it, or a later version of the block, is written to disk. The GCS is responsible for informing an instance that its PI is no longer needed, after another instance writes the block. PI can be discarded when an instance writes a current block image to disk. With cache fusion, such writes occur to satisfy checkpoint requests and not to transfer blocks between instances.

Current image (XI) The current image is a copy of a block held by the last (current) instance in the chain of instances that requested and transferred an image of the block.

The GRD tracks consistent read block images with a local resource in NULL mode. Once tracked, GRD does not have to retain any information about a resource being held in NULL mode by an instance. However, once it has some kind of global allocation, global block resource information is stored in the GRD to manage the history of block transfers even if the resource mode is NULL. With local resources, the GCS discards resource allocation information for instances which downgrade a resource to NULL mode.

Lock structure

when a block held by an instance is requested by another instance, the original holder makes a copy of the block which becomes the past (previous) image of the block. To keep track of these PIs, Oracle uses locking of blocks. Implementing roles and modes enforce these locks.

When a lock is acquired for the first time by an instance, only that specific instance has a copy of the block. Therefore, the block is acquired with a local role. If the block was acquired from a remote instance, the holding instance will make a copy of the original block before transmitting the block to the requesting instance. This is called the past image. This indicates that there are two possibilities; one where a PI is maintained and the second where a PI is not maintained. These possibilities for the PI are represented by a Boolean value of ''yes'' or ''no.'' These modes, roles, and instance of the PI all represent the lock structure.

representation

the three-character lock structure where

Blocks held by a single instance only have a single copy and each is in control of its modification and writing of the changes to disk. However, when another instance requests for the same block, its status changes from a local to a global state. Only in this situation, where one or more other instances also contains a copy of the current block, does the original instance require to keep the original, or PI, of the block

Global cache management

Cache fusion brings about inter-instance transfer of data blocks between the buffer cache of one instance to the buffer cache of another instance.

It is the GCS that tracks the locations, modes, and roles of data blocks and makes updates to the GRD. By tracking these states, the GCS plays an important role in global cache management, acquiring resources at a cluster-wide level and providing cache coherency when the current version of a data block is in one instance's buffer cache and another instance requests that block for modifications. GCS, in its ultimate wisdom of managing the global cache, also ensures that only one instance could modify a block at a given time.

When a user process from another instances makes a request for a block of data that the current instance is holding, the LMD process builds the initial block and passes it to GRD. If the GRD contains the block information (based on the type of request) it creates a PI, assigns an SCN and passes the block to the LMS process. In turn, the LMS process returns the block to the requesting instance and it finally reaches the user process.

When an instance needs to write a block to disk upon a checkpoint request, the instance checks the role of the resource covering the block. If the role is global, the instance must inform the GCS of the write requirement. The GCS is responsible for finding the most current block image and informing the instance holding the image to perform the block write. The GCS then informs all holders of the global resource that they can release the buffers holding the PI copies of the block, thus allowing the global resources to be released

(just to remember the concept)
SO in simple words let us assume that our team have whats app group 4 members 4 contact no. 1 admin . now one member changed the number.(like nitin change the no.) so he will call admin (GCS )
change is required now. NOW admin will update the no. and inform everyone to delete old number (PI)

a block written record (BWR) is placed in the redo log buffer when an instance writes a block covered by a global resource, or when it is told it can free up a PI buffer. Recovery processes use this record to indicate that redo information for the block is not needed prior to this point.

During inter instance transfer of blocks, there could be a situation when an instance receives a current copy of a block for which it already has a PI copy. If, during the write request operation, an instance receives a current copy of the block for which it already has a PI copy, the instance will keep both the copies of the block. The receiving instance has to serve the block to another instance and the GCS includes an indication of whether a write is in progress that would free the PI.

If a write is not occurring, the instance replaces the old PI with a new PI created from the current image. This creates a single string of redo for the block, terminated by just one BWR when the block is finally written to disk. Under such circumstances the instance creates a new PI from the current image.

When the current image is served, a write-in-progress bit is set in the block if the block is holding an exclusive mode resource. This is required to synchronize block writes when the serving instance holds the original local role resource. The SCN of the PI is used during instance recovery to reconstruct the current and consistent read version of the block.

To be cont....

Wednesday, February 13, 2019

Global resource directory [ In Depth ]

Lock structure

Global cache management