0%

19c adg 与 clone pdb

19c adg 与 clone pdb

前言

​ 哎,又好久没写博客 ^.^

​ 最近在客户系统作升级演练时候遇到一些dg问题,之前19c基本都是新库搭建好dg环境,然后再create新的业务pdb,或者是有一些业务pdb然后再搭建好高可用的dg环境,基本没问题。但这次由于需求,在搭建好的dg环境,进行remote hot clone pdb进入主库,备库上面的新clone的pdb出现问题,以此记录一下。


一、adg主库remote hot clone pdb 出错

1
2
3
4
5
-- 故障场景:
主库和备库都是19.3 RAC,19.4 RU环境,有4个ADG备库。
主库新建create和删除pdb备库正常同步进行,然后当主库通过dblink clone pdb进入cdb时,主库正常,备库正常接收日志,但是alter报错,新clone的pdb无法recover

本次特殊的是源库11.2.0.4升级成19c nocdb,然后再clone这个nocdb进来主库rac作为目标pdb,然后执行noncdb_to_pdb脚本转换为正常pdb。本次pdb大小300G左右,rman压缩备份后为36G,通过rman恢复40分钟左右,升级到19c耗时1个小时左右,转为pdb耗时30多分钟,具体时间根据库大小和机器环境配置决定。
1
2
3
4
5
-- 发现问题
当时主库clone完成后查询备库dg应用延时为零,未打开备库上面的的pdb和查看备库的alert日志,以为一切正常。结果第二天打开standby中的pdb无法正常打开为read only,报错数据文件unname,才发现备库有问题,检查备库DB_FILE_NAME_CONVERT参数正常,4个dg都有相同问题,判断应该不是dg参数有问题,估计应该是主库clone和create pdb不太一样,于是谷歌,找到一篇相似的文章对4个dg进行了修复。
-- https://codeleading.com/article/15241304236

难道19c adg主库不支持hot clone pdb进入cdb吗?于是后面自己进行测试,被clone的源库pdb状态设置为read write或者read only都不行,库虽然不大,受限于网络带框,测试真是难受。于是细心去查看官方手册,发现在高可用章节已经明确提醒DG clone pdb时需要配置额外参数了,19c之前需要手动修复备库的datafile19c可以设置参数自动完成standby clone,下面开始介绍。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

-- 排查问题

-- standby查看主库新clone进来的pdb,一切看似正常,同步延时为0,熟不知后台已经报错了(cdb mount和open都一样会有问题)
-- 因为备库的这个故障pdb被认定为offline不再进行recover同步,所以cdb全局的同步延时显示为0,后台alert很明显提示了
SQL> show pdbs

CON_ID CON_NAME OPEN MODE RESTRICTED
---------- ------------------------------ ---------- ----------
2 PDB$SEED READ ONLY NO
3 TEST_PDB MOUNTED


SQL> select value from v$dataguard_stats;

VALUE
----------------------------------------------------------------
+00 00:00:00
+00 00:00:00
+00 00:00:00.000


SQL> select open_mode from v$database;

OPEN_MODE
--------------------
READ ONLY WITH APPLY
1
2
3
4
5
6
-- 打开备库pdb开始报错
SQL> alter pluggable database all open;
alter pluggable database all open
*
ERROR at line 1:
ORA-01111: name for data file 40 is unknown - rename to correct file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 查看alert(mrp正常启动,但故障pdb无法恢复,导致不能打开)
2020-08-29T09:17:30.653161+08:00
stopping change tracking
Warning: Datafile 40 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/system.344.1049672939) is offline during full database recovery and will not be recovered
Warning: Datafile 41 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/sysaux.345.1049672939) is offline during full database recovery and will not be recovered
Warning: Datafile 42 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/undotbs1.325.1049672939) is offline during full database recovery and will not be recovered
Warning: Datafile 43 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/users.342.1049672937) is offline during full database recovery and will not be recovered
Warning: Datafile 44 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/undotbs2.323.1049672939) is offline during full database recovery and will not be recovered
Warning: Datafile 45 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbdata.328.1049672941) is offline during full database recovery and will not be recovered
Warning: Datafile 46 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbdata.329.1049672941) is offline during full database recovery and will not be recovered
Warning: Datafile 47 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbdata.338.1049672943) is offline during full database recovery and will not be recovered
Warning: Datafile 48 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbdata.334.1049672943) is offline during full database recovery and will not be recovered
Warning: Datafile 49 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbdata.327.1049672941) is offline during full database recovery and will not be recovered
Warning: Datafile 50 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbdata.326.1049672939) is offline during full database recovery and will not be recovered
Warning: Datafile 51 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbdata.324.1049672939) is offline during full database recovery and will not be recovered
Warning: Datafile 52 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbdata.339.1049672939) is offline during full database recovery and will not be recovered
Warning: Datafile 53 (+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/test_pdbindex.322.1049672939) is offline during full database recovery and will not be recovered
PR00 (PID:124906): Media Recovery Waiting for T-1.S-382 (in transit)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- standby datafile name查询,发现新clone的pdb文件名全是unname,并且dbs没有这些UNNAME文件
SQL> select name from v$datafile;

NAME
----------------------------------------------------------------------------------------------------
+DATADG/ORCLDG/DATAFILE/system.257.1046268257
+DATADG/ORCLDG/DATAFILE/sysaux.258.1046268259
+DATADG/ORCLDG/DATAFILE/undotbs1.259.1046268259
+DATADG/ORCLDG/A379BC465D2FAAE1E0538387040AC9B3/DATAFILE/system.260.1046268259
+DATADG/ORCLDG/A379BC465D2FAAE1E0538387040AC9B3/DATAFILE/sysaux.261.1046268267
+DATADG/ORCLDG/DATAFILE/users.262.1046268281
+DATADG/ORCLDG/A379BC465D2FAAE1E0538387040AC9B3/DATAFILE/undotbs1.263.1046268297
+DATADG/ORCLDG/DATAFILE/undotbs2.264.1046268301
+DATADG/ORCLDG/DATAFILE/sysaux.265.1046268307
+DATADG/ORCLDG/DATAFILE/undotbs1.266.1046268343
+DATADG/ORCLDG/DATAFILE/undotbs2.267.1046268343

/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00040
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00041
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00042
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00043
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00044
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00045
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00046
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00047
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00048
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00049
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00050
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00051
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00052
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00053
/u01/app/oracle/product/19.4.0/db_1/dbs/UNNAMED00054
1
2
3
4
5
6
7
8
9
10
# 回头寻找主库clone pdb时,standby的相关alert日志发现: 
# 由于主库clone时源库为rw,或者是pdb refresh clone,或者主库clone时指定了standbys=none参数,这些特殊情况导致standby上面这个pdb为nostandby或者tablespace和datafile状态为offline,从而pdb无法进行正常恢复,filename也未正常convert进入asm,而是指向dbs下面的UNNAME文件。
2020-08-28T23:54:22.012162+08:00
Recovery created pluggable database TEST_PDB
TEST_PDB(3):Tablespace-SYSTEM during PDB create skipped since source is in r/w mode or this is a refresh clone
TEST_PDB(3):File #40 added to control file as 'UNNAMED00040'. Originally created as:
TEST_PDB(3):'+DATADG/ORCL/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/system.341.1049672939'
TEST_PDB(3):because the pluggable database was created with nostandby
TEST_PDB(3):or the tablespace belonging to the pluggable database is
TEST_PDB(3):offline.

二、修复standby上面由于主库有问题的pdb

1
2
3
4
5
6
7
8
-- standby停止mrp进程和故障pdb disable recovery
conn / as sysdba
show parameter standby -- 正常一般是设置为auto
recover managed standby database cancel;
ho ps -ef|grep mrp

alter session set container = test_pdb;
alter pluggable database test_pdb disable recovery;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

# grid asmcmd进去standby新建和主库一直的故障pdb的datadir路劲
+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE

# 开始恢复故障pdb
# 此时cdb为mount或者open都行,pdb是mount状态因为故障无法打开,需要恢复
# from service 指定主库的 net service name,12c以后可以这样restore
rman target /
run {
allocate channel ch00 type disk;
allocate channel ch01 type disk;
allocate channel ch02 type disk;
allocate channel ch03 type disk;
allocate channel ch04 type disk;
allocate channel ch05 type disk;
set newname for pluggable database test_pdb to '+DATADG/ORCLDG/ADF2FB440E69A7B3E0538387040A6468/DATAFILE/%U';
restore pluggable database test_pdb from service orcl;
switch datafile all;
}
# 重启db
shu immediate
startup mount
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- enable pdb recover
alter session set container = test_pdb;
alter pluggable database test_pdb enable recovery;

-- open pdb
conn /as sysdba
recover managed standby database disconnect;

-- 等一会恢复差不多,打开cdb和pdb
alter database open;
alter pluggable database all open;

-- 检查pdb状态和dg延迟
show pdbs
select value from v$dataguard_stats;

-- 检查pdb的tempfile,如果缺失的话手动可以手动增加,尽量和主库保持一致
-- 支持,故障的pdb暂时恢复了,和主库保持一致,正常了
1
2
3

-- 总结:
2种方法可以恢复,一种是如上直接restore from service,另一种是主库备份pdb传到standby进行恢复,我两种都测试成功,但是推荐上面1种,oracle能做就给它做了,省事,自己备份、传输、恢复可能慢或者麻烦。
三、19c adg正确remote hot clone pdb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- 1. 主库和所有dg增加相同的tns
-- 主库和4个DG都添加dblink使用的tns别名 ,注意有些DG网络不一定和主库相同,网段和主库不一样,需要注意
orcl1522 =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.131)(PORT = 1522))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = orcl)
)
)

-- 2. 主库新建到source cdb源库的dblink
create public database link dblink_cdb1522 connect to system identified by oracle using 'orcl1522';

-- 3. noncdb库或者pdb库里面赋权(注意是pdb里面授权system或者你的clone用户,不是在cdb):
grant create pluggable database to system;

-- 4. 主库和所有DG检验dblink
select name, db_unique_name from v$database@dblink_cdb1522;

-- 5. 打开所有DG并且设置 standby_pdb_source_file_dblink 参数为上面创建测试成功dblink
alter system set standby_pdb_source_file_dblink = dblink_cdb1522;

-- 6. 需要被clone的nocdb或者pdb源库打开为read only模式
shu immediate;
statup mount;
alter database open read only;
-- or
-- alter pluggable database orcl open read only;

-- 7. 主库克隆目标pdb进入cdb库(备库也是从source通过dblink克隆,一般会比主库clone完成慢)
create pluggable database orcl_test from orcl@dblink_cdb1522;

-- 8. 切到clone好的目标pdb库
alter session set container = orcl_test;

-- 如果源库是nocdb克隆过来,进去pdb执行下面脚本:
-- @$ORACLE_HOME/rdbms/admin/noncdb_to_pdb.sql

-- 9. 打开目标pdb库:
alter database open;

四、standby clone中途重启db,有个文件为unname,需要修复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- 还是延续上面,当主库clone pdb时候备库由于非正常原因重启了,比如这次的定时调度任务,数据库shutdown卡了1.5小时才顺利完成重启,因为备库也一直在复制数据库从克隆的source库,一直未完成。但可惜的是重启后还是有一个文件有问题未正常同步过来为unname,需要手动修复。

-- standby修改控制文件缺失的unname文件名
alter system set standby_file_management = manual;
alter session set container = orcl_test;
select name from v$dbfile;
alter database create datafile '+DATADG/MUST_RENAME_THIS_DATAFILE_123.4294967295.4294967295' as '+DATADG/ORCLDG/AE3BE5CEC5283544E0538387040A76F2/DATAFILE/TEST_OGG.dbf';
select name from v$dbfile;

-- 从克隆source库拷贝出来,注意不是dg的主库,是主库clone pdb的源库,因为adg正常的clone过程备库也是通过和主库一样去source库clone,不知道是性能还是机制这样设置的,没有通过clone好的主库同步(如果文件多可以rman备份恢复过来)
asmcmd cp +DATADG/ORCL/AE3BE5CEC5283544E0538387040A76F2/DATAFILE/TEST_OGG.373.1049988991 /tmp/TEST_OGG.373.1049988991
scp /tmp/TEST_OGG.373.1049988991 192.168.100.203:/tmp/

-- standby手动复制进去pdb asm datafile位置
asmcmd cp /tmp/TEST_OGG.373.1049988991 +DATADG/ORCLDG/AE3BE5CEC5283544E0538387040A76F2/DATAFILE/TEST_OGG.dbf

-- standby恢复备库的这个文件
conn /as sysdba
alter system set standby_file_management = auto;
recover managed standby database disconnect;
recover managed standby database cancel;
alter database open;
alter pluggable database all open;
recover managed standby database disconnect;

-- check dg
select open_mode from v$database
union all
select value from v$dataguard_stats;

五、19c standby 故障pdb删除

1
2
3
4
5
6
7
8
9
10
11
12
13
-- standby禁用故障pdb mrp恢复
alter session set container = TEST_R17_PDB1_RD;
alter pluggable database TEST_R17_PDB1_RD disable recovery;

-- standby取消mrp
conn / as sysdba
recover managed standby database using current logfile disconnect from session;

-- primary执行
drop pluggable database TEST_R17_PDB1_RD including datafiles;

-- 总结:
standby的pdb有问题不要想着怎么去修改控制文件或者其他怎么把它删除,因为正常dg的所有ddl都是同步到standby的,所以正常主库创建和删除pdb备库都是自动同步,我们应该想办法修复dg的recover恢复同步应用,mrp同步修复好以后standby的pdb就和主库保持一致了。

总结

由于当时演练,时间紧迫,就我一个人,4个备库相同的pdb都是故障状态,重建时间花费太久而且可能出错,还是考虑先修复。

自己刚开始走错了路,导致耽搁了不少时间,因为备库alert日志太多,一个临时库未配置清理,alert有上千万行,一直在刷,而且看到UNNAME感觉很熟悉,可能是convert参数有问题,或者什么路劲有问题导致无法正确转换主库dbfile。于是打算standby create修复一下,但意外的是$ORACLE_HOME/dbs/下面并没有控制文件里面记录的UNNAME文件,咨询了2个大牛,决定还是create看看能不能apply恢复成功,create as花了很久,因为300多G的库。可能create需要存在实际的unname数据文件,然后复制过去,如果unname不存在强行create as 后并不能apply恢复,也不能restore或者recover from service恢复,可能只是一个空的文件。我后面是把asm文件直接手动删除了,去主库手动备份过来restore才好了。所以出问题还是要先看日志,找到为什么出问题的原因,而不能看到表象就直接上手处理,很可能会把简单问题复杂化甚至恢复不好。

关于19c和adg,正常主库create pdb是没问题的,但是clone会有问题。在12和18c,主库clone pdb后,备库需要手动介入恢复备库pdb相应的数据文件,幸运的是我们现在开始从11g后直接使用19c,19c出现2个新参数和rman新命令来支持dg。

  • STANDBY_PDB_SOURCE_FILE_DBLINK参数: 主库remote hot clone pdb时,源库设置为read only,所有dg都打开和设置这个参数为主库clone指定的dblink,当然相应tns也要配置好,这需要dg和clone的源库网络相同,这样的话dg也会去源库通过dblink clone目标pdb,克隆完成后和主库保持同步,不再需要手动介入备库,否则备库会有问题需要手动修复,这个参数在dg clone完后或者下次另外pdb clone前修改就行,或者取消掉。

  • STANDBY_PDB_SOURCE_FILE_DIRECTORY参数: 这个参数用来当主库plugg in pdb时,备库需要将所需要的datafile放入到这个指定目录,否则主库插入的pdb备库需要手动介入修复。

  • 12c: restore … from service命令: 从12c开始可以直接通过net重置和恢复db、pdb、ts、dbf等,这不用像11g手动去备份或者BACKUP INCREMENTAL FROM SCN增量备份来恢复dg同步。

  • 18c: recover standby database命令 : 从18c开始可以直接刷新和同步standby所有文件,也不再需要像11g手动去备份或者BACKUP INCREMENTAL FROM SCN增量备份来恢复dg同步(注意会刷新controlfile)。

最后还是推荐大家遇到问题也可以看看官方手册的建议和相关提醒,平时也可以多学习关于新版的特性功能,比如recover table就很简单.