10 Plan an Oracle Data Guard Deployment
Analyze your specific requirements, including both the technical and operational aspects of your IT systems and business processes, understand the availability impact for the Oracle Data Guard architecture options, and consider the impact of your application and network.
Oracle Data Guard Architectures
The Gold MAA reference architecture provides you with four architecture patterns, using Oracle Active Data Guard to eliminate single point of failure. The patterns vary from a single remote active standby with Fast Start Failover and HA Obeserver, to including far sync instances, multiple standbys, and reader farms.
When planning your Gold MAA Reference Architecture, see High Availability Reference Architectures for an overview of each Gold architecture pattern, and choose the elements to incorporate based on your requirements.
Application Considerations for Oracle Data Guard Deployments
As part of planning your Oracle Data Guard deployment, consider the resources required and application availability requirements in a fail over scenario.
Deciding Between Full Site Failover or Seamless Connection Failover
The first step is to evaluate which failover option best meets your business and application requirements when your primary database or primary site is inaccessible or lost due to a disaster.
The following table describes various conditions for each outage type and recommends a failover option in each scenario.
Table 10-1 Recommended Failover Options for Different Outage Scenarios
Outage Type | Condition | Recommended Failover Option |
---|---|---|
Primary Site Failure (including all application servers) | Primary site contains all existing application servers (or mid-tier servers) that were connected to the failed primary database. | Full site failover is required |
Primary Site Failure (with some application servers surviving) |
Some or all application servers are not impacted and the surviving application servers can reconnect to new primary database in a secondary disaster recovery site. Application performance and throughput is still acceptable with different network latency between application servers and new primary database in a secondary disaster recovery site. Typically analytical or reporting applications can tolerate higher network latency between client and database without any noticeable performance impact, while OLTP applications performance may suffer more significantly if there is an increase in network latency between the application server and database. |
Seamless connection failover is recommended to minimize downtime and enable automatic application and database failover. |
Complete Primary Database or Primary Server Failure |
Application servers are not impacted and users can reconnect to new primary database in a secondary disaster recovery site. Application performance and throughput is still acceptable with different network latency between application servers and new primary database in a secondary disaster recovery site. Typically analytical or reporting applications can tolerate higher network latency between client and database without any noticeable performance impact, while OLTP applications performance may suffer more significantly if there is an increase in network latency between the application server and database. |
If performance is acceptable, seamless connection failover is recommended to minimize downtime and enable automatic application and database failover. Otherwise, full site failover is required. |
Full Site Failover Best Practices
A full site failover means that the complete site fails over to another site with a new set of application tiers and a new primary database.
Complete site failure results in both the application and database tiers becoming unavailable. To maintain availability, application users must be redirected to a secondary site that hosts a redundant application tier and a synchronized copy of the production database.
Consider the two figures below. The first figure shows the network routes before failover. Client or application requests enter the Primary site at the client tier, and are routed to the application server and database server tiers on the primary site.
Figure 10-1 Network Routes Before Site Failover
Description of "Figure 10-1 Network Routes Before Site Failover"
The second figure, below, illustrates the network routes after a complete site failover. Client or application requests enter the Secondary site at the client tier and follow the same path on the secondary site that they followed on the primary site.
Figure 10-2 Network Routes After Site Failover
Description of "Figure 10-2 Network Routes After Site Failover"
MAA best practice is to maintain a running application tier at the standby site to avoid incurring start-up time, and to use Oracle Data Guard to maintain a synchronized copy of the production database. Upon site failure, a WAN traffic manager is used to execute a DNS failover (either manually or automatically) to redirect all users to the application tier at standby site while a Data Guard failover transitions the standby database to the primary production role.
Use Oracle Active Data Guard Fast-Start Failover to automate the database failover. Application server and non-database failovers can be automated and coordinated by using Oracle Site Guard. Oracle Site Guard orchestrates and automates any operations, such as starting up application servers on the secondary site, resynchronizing non-database meta data as Data Guard fails over automatically.
For more information about Oracle Site Guard, see the Oracle Site Guard Administrator's Guide.
Configuring Seamless Connection Failover
Automating seamless client failover in an Oracle Data Guard configuration includes relocating database services to the new primary database as part of a Data Guard failover, notifying clients that a failure has occurred to break them out of TCP timeout, and redirecting clients to the new primary database.
In the following figure, a database request is interrupted by an outage or timeout (1), so the session reconnects to the Oracle RAC cluster (2) (or standby) (2), the database request replays automatically on the alternate node (3), and the result from the database request is returned to the user (4).
To achieve seamless connection failover, configure Fast Connection Failover (FCF) as a best practice to fully benefit from fast instance and database failover and switchover with Oracle RAC and Oracle Data Guard. FCF enables clients, mid-tier applications, or any program that connects directly to a database to fail over quickly and seamlessly to an available database service when a database service becomes unavailable.
- Use an Oracle Clusterware managed service that is not the default database service (the default service has the same name as the database or PDB). The services that you create provide location transparency and high availability features.
-
Use the recommended connection string as discussed in the Application Continuity white paper with built in timeouts, retries, and delays so that incoming connections do not see errors during outages.
-
Fast Application Notification (FAN) is a mandatory component to initiate draining, to break out of failures, and to rebalance sessions when services resume and when load imbalances occur. For outages such as node and network failures, fast failover of the application does not happen if the client is not interrupted by FAN. FAN applies to all failover solutions.
-
When maintenance starts, drain application connections from the instances or nodes targeted for maintenance. Enable FAN with Universal Connection Pools (UCP) or ODP.NET connection pools or connection tests (or both). Connection pools with FAN are the best solution because pools provide a full life cycle of session movement, that is, the draining and rebalancing of application connection as maintenance progresses.
- The standard solution for failing over sessions is to use Transparent Application Continuity (TAC). Use Transparent Application Failover (TAF) if your application is read-only and does not change Oracle session state in the session after the initial setup. Alternatively, use Application Continuity (AC) if you want to customize how the application connection fails over with side effects or callbacks, or if you have an application that uses state, such as temporary tables, and never cleans up.
When implementing your failover solution, along with the above best practices, follow the instructions in the white paper, Continuous Availability - Application Checklist for Continuous Service for MAA Solutions.
Assessing Network Performance
Oracle Data Guard relies on the underlying network to send redo from the primary and standby databases, and to ensure near zero or zero redo transport lag.
Your network assessment should evaluate the following.
-
Network is reliable
-
Sufficient available bandwidth to accommodate the maximum redo generation rate and any existing activity sharing the same network
-
If using
SYNC
orFAR SYNC
transport, minimal latency between the primary database and target standby (or far sync server) may be necessary to achieve a tolerable performance impact
Use the oratcptest
tool to measure and evaluate network bandwidth and
latency. The oratcptest
tool was specifically designed to help you
assess network resources to help tune and optimize Data Guard configurations, optimize
Data Guard instantiation, help evaluate if SYNC
transport is viable,
and minimize redo transport lags. Using oratcptest
you can:
-
Assess whether you have enough bandwidth available for your Data Guard activity
-
Determine optimal TCP socket buffer sizes
-
Determine optimal Oracle Net settings
-
Tune operating system limits on socket buffer sizes
-
Tune Oracle Net session data unit SDU for use with
SYNC
The following example shows you how oratcptest
highlights the potential
Data Guard asynchronous redo transport bandwidth and the increase in bandwidth by
adjusting operating system socket buffer sizes.
First, get the baseline.
Client (primary):
$ java -jar oratcptest.jar standby_server -port=port_number -mode=async -duration=120 -interval=20s -sockbuf=2097152
[Requesting a test]
Message payload = 1 Mbyte
Payload content type = RANDOM
Delay between messages = NO
Number of connections = 1
Socket send buffer = 2 Mbytes
Transport mode = ASYNC
Disk write = NO
Statistics interval = 20 seconds
Test duration = 2 minutes
Test frequency = NO
Network Timeout = NO
(1 Mbyte = 1024x1024 bytes)
(11:39:16) The server is ready.
Throughput
(11:39:36) 71.322 Mbytes/s
(11:39:56) 71.376 Mbytes/s
(11:40:16) 72.104 Mbytes/s
(11:40:36) 79.332 Mbytes/s
(11:40:56) 76.426 Mbytes/s
(11:41:16) 68.713 Mbytes/s
(11:41:16) Test finished.
Socket send buffer = 2097152
Avg. throughput = 73.209 Mbytes/s
Now that you have a baseline with a 2MB buffer, increase the socket buffer size to 4MB to assess any gain in throughput with a larger buffer size.
$ java -jar oratcptest.jar -server -port=port_number -sockbuf=4194305
Client (primary):
$ java -jar oratcptest.jar test.server.address.com -port=<port number> -mode=async -duration=60s -sockbuf= 4194305
[Requesting a test]
Message payload = 1 Mbyte
Payload content type = RANDOM
Delay between messages = NO
Number of connections = 1
Socket send buffer = 4194303 bytes
Transport mode = ASYNC
Disk write = NO
Statistics interval = 10 seconds
Test duration = 1 minute
Test frequency = NO
Network Timeout = NO
(1 Mbyte = 1024x1024 bytes)
(11:15:06) The server is ready.
Throughput
(11:15:16) 113.089 Mbytes/s
(11:15:26) 113.185 Mbytes/s
(11:15:36) 113.169 Mbytes/s
(11:15:46) 113.169 Mbytes/s
(11:15:56) 113.168 Mbytes/s
(11:16:06) 113.171 Mbytes/s
(11:16:06) Test finished.
Socket send buffer = 4 Mbytes
Avg. throughput = 113.149 Mbytes/s
The socket buffer size increase yields a 54% improvement in throughput.
See Oracle Support document Assessing and Tuning Network Performance for Data Guard and RMAN
(Doc ID 2064368.1) for information about the oratcptest
tool.
Determining Oracle Data Guard Protection Mode
Oracle Data Guard can run in three different protection modes, which cater to different performance, availability, and data loss requirements. Use this guide to determine which protection mode fits your business requirements and your potential environmental constraints.
Maximum Protection mode guarantees that no data loss will occur if the primary database fails, even in the case of multiple failures (for example, the network between the primary and standby fails, and then at a later time, the primary fails). This policy is enforced by never signaling commit success for a primary database transaction until at least one synchronous Data Guard standby has acknowledged that redo has been hardened to disk. Without such an acknowledgment the primary database will stall and eventually shut down rather than allow unprotected transactions to commit.
To maintain availability in cases where the primary database is operational but the standby database is not, the best practice is to always have a minimum of two synchronous standby databases in a Maximum Protection configuration. Primary database availability is not impacted if it receives acknowledgment from at least one synchronous standby database.
Choose this protection mode if zero data loss is more important than database
availability. Workload impact analysis is recommended to measure whether any overhead is
acceptable when enabling SYNC
transport.
Maximum Availability mode guarantees that no data loss will occur in cases where
the primary database experiences the first failure to impact the configuration. Unlike
the Maximum Protection mode, Maximum Availability will wait a maximum of
NET_TIMEOUT
seconds for an acknowledgment from any of the standby
databases, after which it will signal commit success to the application and move to the
next transaction. Primary database availability (thus the name of the mode) is not
impacted by an inability to communicate with the standby (for example, due to standby or
network outages). Data Guard will continue to ping the standby and automatically
re-establish connection and resynchronize the standby database when possible, but during
the period when primary and standby have diverged there will be data loss should a
second failure impact the primary database.
For this reason, it is a best practice to monitor protection level, which is simplest using Enterprise Manager Grid Control, and quickly resolve any disruption in communication between the primary and standby before a second failure can occur. This is the most common zero data loss database protection mode.
Choose this protection mode if zero data loss is very important but you want the primary
database to continue to be available even with the unlikely case that all standby
databases are not reachable. You can complement this solution by integrating multiple
standby databases or using Far Sync instances to implement a zero data loss standby
solution across a WAN. Workload impact analysis is recommended to measure whether any
overhead is acceptable when enabling SYNC
transport.
Maximum Performance mode is the default Data Guard mode, and it provides the highest level of data protection that is possible without affecting the performance or the availability of the primary database. This is accomplished by allowing a transaction to commit as soon as the redo data needed to recover that transaction is written to the local online redo log at the primary database (the same behavior as if there were no standby database). Data Guard transmits redo concurrently to 1) the standby database directly from the primary log buffer and 2) to the local online redo log write asynchronously enabling a very low potential data loss if the primary site is lost. There is never any wait for standby acknowledgment but the potential data loss for this data protection mode can still be near zero..
Similar to Maximum Availability mode, it is a best practice to monitor the protection level using Enterprise Manager Grid Control, and quickly resolve any disruption in communication between primary and standby before a second failure can occur.
Choose this mode if minimum data loss is acceptable and zero performance impact on the primary is required.
Offloading Queries to a Read-Only Standby Database
Offloading queries and reporting workloads to read-only standby databases can free up your primary database system resources, giving you the ability to add more users, workloads, or even databases.
When you leverage both primary and standby database resources, your business and your applications benefit with higher total system usage, and potentially higher application throughput.
Offload appropriate workloads by following these steps.
-
Identify which application modules are read-only or read-mostly.
-
Evaluate whether you have application services or modules that are read-only.
-
Small and short read-only queries are good candidates to offload to the standby database.
-
Short DMLs, especially those that are response-time sensitive, should not be offloaded to the standby.
-
Large reports or analytic reports are good candidates to offload.
-
Reports that are primarily reads, and that may have an infrequent DML, typically at the start or end of a report, may be good candidates to offload.
To enable DML Redirection, see ADG_REDIRECT_DML.
-
-
Gather information about the expected application performance, throughput, response time, or elapsed time service levels for each offload candidate.
-
Once you have determined which queries and reports are good candidates to offload, find out the required expected and maximum response time or elapsed time for each of them. For example some large analytic reports must complete within a 2 hour time span.
-
For short queries, determine the expected response time and throughput expectations.
-
These requirements are sometimes referred to as application performance Service Level Agreements, which you need for the next step.
-
-
Test the performance of each candidate on the standby, and determine whether it meets your requirements.
-
Even though the primary and standby databases have essentially identical data, they are independent databases, independent machines, independent configurations, and have different workloads. For example, an Active Data Guard read-only standby database has a redo apply workload plus the queries that are offloaded, while the primary database may have OLTP, batch, and query workloads.
-
Reported elapsed times, query response time, and workload performance may vary between the primary and standby due to these system, configuration, and workload differences.
- Tuning requires that you understand system resources, SQL plans, and individual query CPU and wait profile. The tuning recommendations are applicable for both primary and standby databases. See Diagnosing and Tuning Database Performance .
-
-
Offload a subset of the queries that meet your performance requirements, freeing up resources on the primary database for additional processing capacity.
- Once you have determined which queries and reports can be offloaded, and the performance of those activities are acceptable, then slowly offload some of the workload and monitor it.
- Do not oversubscribe and offload too much workload to the standby such that redo apply cannot keep pace after tuning. If the standby falls behind. then you lose that standby as a viable role transition target, and in most cases a standby that lags cannot be used to offload queries.
What if a specific query does not meet your requirements?
-
Consult with a performance engineer and follow the recommendations in Database Performance Tuning Guide.
-
A particular query response time or throughput or report elapsed time is not guaranteed to be the same on the standby system as it was on the primary. Analyze the system resources, SQL plans, overall CPU work time and wait times.
For example, you may see
standby query scn advance wait
is contributing to a much longer elapsed time in one of your short queries. This wait increase is attributed to Active Data Guard redo apply. If a query sees a certain row in a data block and needs to roll it back because the transaction has not committed as of the query System Commit Number (SCN), it needs to apply corresponding undo to get a consistent read for that query. If the redo for the corresponding undo change has not been applied by redo apply yet, the query needs to wait. The presence of such wait is itself not an issue, and typically may be a couple of milliseconds, but it will vary by workload and may be higher in Real Application Cluster database systems.