配置MongoDB replication遇到的坑
Contents
一背景说明
两个Linux服务器,想配置成replication架构的MongoDB。反反复复,死活不成功。
二 故障重现和分析解决
0 2台机器信息如下
IP地址 | 角色 | hostname |
---|---|---|
172.16.17.8 | mongoprimary | guoxin8 |
172.16.19.110 | mongostandby | localhost |
1 每台机器上已经安装MongoDB
安装路径都在/root/3.4.6/下,可以通过/root/3.4.6/mongod直接启动MongoDB。
[root@localhost 3.4.6]# pwd /root/3.4.6 [root@localhost 3.4.6]# ll 总用量 253628 -rwxr-xr-x 1 root root 29870760 6月 2 11:39 mongo -rwxr-xr-x 1 root root 54389832 6月 2 11:40 mongod -rwxr-xr-x 1 root root 12744868 6月 2 11:40 mongodump -rwxr-xr-x 1 root root 10756395 6月 2 11:40 mongoexport -rwxr-xr-x 1 root root 10639650 6月 2 11:40 mongofiles -rwxr-xr-x 1 root root 10914443 6月 2 11:40 mongoimport -rwxr-xr-x 1 root root 10407075 6月 2 11:40 mongooplog -rwxr-xr-x 1 root root 53753832 6月 2 11:40 mongoperf -rwxr-xr-x 1 root root 14100552 6月 2 11:40 mongorestore -rwxr-xr-x 1 root root 30539424 6月 2 11:40 mongos -rwxr-xr-x 1 root root 10972416 6月 2 11:40 mongostat -rwxr-xr-x 1 root root 10604661 6月 2 11:40 mongotop [root@localhost 3.4.6]#
2 每台机器已经创建配置文件
[root@localhost 3.4.6]# cat /etc/mongodb.conf dbpath=/data/mongodb/data logpath=/data/mongodb/log/mongodb.log logappend=true port=27017 fork=true #这里指定replicaset为rs0,如果是配置单实例MongoDB,则注释掉该选项即可 replSet=rs0 #bind_ip = 0.0.0.0 #bind_ip = 127.0.0.1 #bind_ip_all=true #这里暂时指定MongoDB不需要验证即可登录,需要的话,则反注释,并重启MongoDB #auth=true [root@localhost 3.4.6]#
3 每台机器上已经创建指定路径
[root@localhost 3.4.6]# ll /data/mongodb 总用量 0 drwxr-xr-x 2 root root 6 6月 2 17:39 data drwxr-xr-x 2 root root 6 6月 2 17:39 log [root@localhost 3.4.6]#
4 mongoprimary上启动MongoDB
[root@guoxin8 ~]# /root/3.4.6/mongod -f /etc/mongodb.conf about to fork child process, waiting until server is ready for connections. forked process: 19749 child process started successfully, parent exiting [root@guoxin8 ~]#
5 mongostandby上启动MongoDB
[root@localhost ~]# /root/3.4.6/mongod -f /etc/mongodb.conf about to fork child process, waiting until server is ready for connections. forked process: 15956 child process started successfully, parent exiting [root@localhost ~]#
6 mongoprimary上执行rs.initiate()
[root@guoxin8 ~]# /root/3.4.6/mongo MongoDB shell version v3.4.6 connecting to: mongodb://127.0.0.1:27017 MongoDB server version: 3.4.6 Server has startup warnings: 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** We suggest setting it to 'never' 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** We suggest setting it to 'never' 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] > rs.status() { "info" : "run rs.initiate(...) if not yet done for the set", "ok" : 0, "errmsg" : "no replset config has been received", "code" : 94, "codeName" : "NotYetInitialized" } > rs.initiate() { "info2" : "no configuration specified. Using a default configuration for the set", "me" : "guoxin8:27017", "ok" : 1 } rs0:SECONDARY> rs0:PRIMARY> rs0:PRIMARY>
截止到现在,上述所有的操作都是没有任何问题的。接下来,当我添加副本集的时候,就会有问题了。
7 mongoprimary上执行rs.add()
[root@guoxin8 ~]# /root/3.4.6/mongo MongoDB shell version v3.4.6 connecting to: mongodb://127.0.0.1:27017 MongoDB server version: 3.4.6 Server has startup warnings: 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** We suggest setting it to 'never' 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'. 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] ** We suggest setting it to 'never' 2022-06-02T17:50:56.162+0800 I CONTROL [initandlisten] > rs.status() { "info" : "run rs.initiate(...) if not yet done for the set", "ok" : 0, "errmsg" : "no replset config has been received", "code" : 94, "codeName" : "NotYetInitialized" } > rs.initiate() { "info2" : "no configuration specified. Using a default configuration for the set", "me" : "guoxin8:27017", "ok" : 1 } rs0:SECONDARY> rs0:PRIMARY> rs0:PRIMARY> rs.add("172.16.19.110:27017"); { "ok" : 1 } rs0:PRIMARY> rs0:PRIMARY> rs.status() { "set" : "rs0", "date" : ISODate("2022-06-02T09:53:47.745Z"), "myState" : 1, "term" : NumberLong(1), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1654163617, 1), "t" : NumberLong(1) }, "appliedOpTime" : { "ts" : Timestamp(1654163619, 1), "t" : NumberLong(1) }, "durableOpTime" : { "ts" : Timestamp(1654163619, 1), "t" : NumberLong(1) } }, "members" : [ { "_id" : 0, "name" : "guoxin8:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 172, "optime" : { "ts" : Timestamp(1654163619, 1), "t" : NumberLong(1) }, "optimeDate" : ISODate("2022-06-02T09:53:39Z"), "electionTime" : Timestamp(1654163495, 1), "electionDate" : ISODate("2022-06-02T09:51:35Z"), "configVersion" : 2, "self" : true }, { "_id" : 1, "name" : "172.16.19.110:27017", "health" : 1, "state" : 0, "stateStr" : "STARTUP", "uptime" : 8, "optime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDurable" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2022-06-02T09:53:47.536Z"), "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"), "pingMs" : NumberLong(1), "configVersion" : -2 } ], "ok" : 1 } rs0:PRIMARY> > rs.status() 2022-06-02T17:53:57.789+0800 I NETWORK [thread1] trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed 2022-06-02T17:53:57.789+0800 I NETWORK [thread1] reconnect 127.0.0.1:27017 (127.0.0.1) ok { "set" : "rs0", "date" : ISODate("2022-06-02T09:53:57.790Z"), "myState" : 2, "term" : NumberLong(1), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1654163617, 1), "t" : NumberLong(1) }, "appliedOpTime" : { "ts" : Timestamp(1654163619, 1), "t" : NumberLong(1) }, "durableOpTime" : { "ts" : Timestamp(1654163619, 1), "t" : NumberLong(1) } }, "members" : [ { "_id" : 0, "name" : "guoxin8:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 182, "optime" : { "ts" : Timestamp(1654163619, 1), "t" : NumberLong(1) }, "optimeDate" : ISODate("2022-06-02T09:53:39Z"), "infoMessage" : "could not find member to sync from", "configVersion" : 2, "self" : true }, { "_id" : 1, "name" : "172.16.19.110:27017", "health" : 1, "state" : 0, "stateStr" : "STARTUP", "uptime" : 8, "optime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDurable" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2022-06-02T09:53:54.538Z"), "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"), "pingMs" : NumberLong(1), "configVersion" : -2 } ], "ok" : 1 } >
当我们执行rs.add(“172.16.19.110:27017″);之后,看到当前节点还是primary,没有任何问题。但是,新加的节点172.16.19.110的状态却变成了,”stateStr” : “STARTUP”,。这就不对了,很快,我们再次执行rs.status()时,看到当前primary节点变成了secondary。
要了命了啊,primary变成了secondary,而secondary的状态是STARTUP。整个replication set里没有primary。这怎么用?
折腾了半天,还是从Google那儿找到了点儿线索。
8 具体原因
1 主节点机器里,/etc/hosts文件里的hostname映射的不是127.0.0.1的环路地址;
2 从rs.initiate()命令的结果里,也可以看到提示:
{ "info2" : "no configuration specified. Using a default configuration for the set", "me" : "guoxin8:27017", "ok" : 1 }
3 从standby节点的MongoDB 日志里,也看到了类似的提示错误:
[root@localhost 3.4.6]# cat /data/mongodb/log/mongodb.log .... 2022-06-02T18:02:17.160+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to guoxin8:27017 2022-06-02T18:02:27.161+0800 I REPL [ReplicationExecutor] Error in heartbeat request to guoxin8:27017; ExceededTimeLimit: Couldn't get a connection within the time limit 2022-06-02T18:02:27.172+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to guoxin8:27017 - HostUnreachable: HostUnreachable 2022-06-02T18:02:27.172+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to guoxin8:27017 due to failed operation on a connection 2022-06-02T18:02:29.161+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to guoxin8:27017 2022-06-02T18:02:39.162+0800 I REPL [ReplicationExecutor] Error in heartbeat request to guoxin8:27017; ExceededTimeLimit: Couldn't get a connection within the time limit 2022-06-02T18:02:39.174+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to guoxin8:27017 - HostUnreachable: HostUnreachable 2022-06-02T18:02:39.174+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to guoxin8:27017 due to failed operation on a connection
9 解决办法也很简单
在从节点172.16.19.110机器上的/etc/hosts文件里,添加,guoxin8对应到主节点的IP地址:
[root@localhost ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 #mongo 172.16.17.8 guoxin8 [root@localhost ~]#
然后,可以看到从节点上的MongoDB日志:
[root@localhost 3.4.6]# tailf /data/mongodb/log/mongodb.log 2022-06-02T18:03:41.306+0800 I REPL [replication-1] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1654163619000|1, t: 1 }[-221148826281795956] 2022-06-02T18:03:41.306+0800 I REPL [replication-1] Initial sync attempt finishing up. 2022-06-02T18:03:41.306+0800 I REPL [replication-1] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: 0, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1654164221217), initialSyncAttempts: [], fetchedMissingDocs: 0, appliedOps: 0, initialSyncOplogStart: Timestamp 1654163619000|1, initialSyncOplogEnd: Timestamp 1654163619000|1, databases: { databasesCloned: 1, admin: { collections: 1, clonedCollections: 1, start: new Date(1654164221271), end: new Date(1654164221304), elapsedMillis: 33, admin.system.version: { documentsToCopy: 1, documentsCopied: 1, indexes: 2, fetchedBatches: 1, start: new Date(1654164221272), end: new Date(1654164221304), elapsedMillis: 32 } } } } 2022-06-02T18:03:41.310+0800 I REPL [replication-1] initial sync done; took 0s. 2022-06-02T18:03:41.310+0800 I REPL [replication-1] Starting replication fetcher thread 2022-06-02T18:03:41.310+0800 I REPL [replication-1] Starting replication applier thread 2022-06-02T18:03:41.311+0800 I REPL [replication-1] Starting replication reporter thread 2022-06-02T18:03:41.311+0800 I REPL [rsSync] transition to RECOVERING 2022-06-02T18:03:41.311+0800 I REPL [rsBackgroundSync] could not find member to sync from 2022-06-02T18:03:41.313+0800 I REPL [rsSync] transition to SECONDARY 2022-06-02T18:03:51.315+0800 I REPL [ReplicationExecutor] Member guoxin8:27017 is now in state PRIMARY 2022-06-02T18:03:56.317+0800 I REPL [rsBackgroundSync] sync source candidate: guoxin8:27017 2022-06-02T18:03:56.320+0800 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to guoxin8:27017 2022-06-02T18:03:56.322+0800 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to guoxin8:27017, took 2ms (2 connections now open to guoxin8:27017)
然后,再次在主节点的mongo shell里,执行rs.status()时,就一切正常了。
> rs.status() { "set" : "rs0", "date" : ISODate("2022-06-02T10:07:58.329Z"), "myState" : 1, "term" : NumberLong(2), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1654164471, 1), "t" : NumberLong(2) }, "appliedOpTime" : { "ts" : Timestamp(1654164471, 1), "t" : NumberLong(2) }, "durableOpTime" : { "ts" : Timestamp(1654164471, 1), "t" : NumberLong(2) } }, "members" : [ { "_id" : 0, "name" : "guoxin8:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 1023, "optime" : { "ts" : Timestamp(1654164471, 1), "t" : NumberLong(2) }, "optimeDate" : ISODate("2022-06-02T10:07:51Z"), "electionTime" : Timestamp(1654164230, 1), "electionDate" : ISODate("2022-06-02T10:03:50Z"), "configVersion" : 2, "self" : true }, { "_id" : 1, "name" : "172.16.19.110:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 848, "optime" : { "ts" : Timestamp(1654164471, 1), "t" : NumberLong(2) }, "optimeDurable" : { "ts" : Timestamp(1654164471, 1), "t" : NumberLong(2) }, "optimeDate" : ISODate("2022-06-02T10:07:51Z"), "optimeDurableDate" : ISODate("2022-06-02T10:07:51Z"), "lastHeartbeat" : ISODate("2022-06-02T10:07:56.445Z"), "lastHeartbeatRecv" : ISODate("2022-06-02T10:07:57.878Z"), "pingMs" : NumberLong(0), "syncingTo" : "guoxin8:27017", "configVersion" : 2 } ], "ok" : 1 } >
三 小结
之前,在Linux上配置MongoDB的replication顺顺利利的。这次,却遇到这个坑。费劲半天,找到了原因。
参考:https://stackoverflow.com/questions/13603096/starting-over-with-replica-configuration-in-mongodb
四 补记
当我在部署另外一套MongoDB replication时,主节点的数据量比较大是,添加从节点之后,会看到备节点的状态为: “stateStr” : “STARTUP2″。此时,我们到备库节点上,看日志,以及看数据路径大小在变化时,说明,replication正在同步数据。等到同步结束之后,状态就变为正常了。
rs0:PRIMARY> rs.status() { "set" : "rs0", "date" : ISODate("2022-06-02T10:51:06.074Z"), "myState" : 1, "term" : NumberLong(1), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1654166752, 1), "t" : NumberLong(1) }, "appliedOpTime" : { "ts" : Timestamp(1654167063, 1), "t" : NumberLong(1) }, "durableOpTime" : { "ts" : Timestamp(1654167063, 1), "t" : NumberLong(1) } }, "members" : [ { "_id" : 0, "name" : "k8-minion-ident-test2-164:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 806, "optime" : { "ts" : Timestamp(1654167063, 1), "t" : NumberLong(1) }, "optimeDate" : ISODate("2022-06-02T10:51:03Z"), "electionTime" : Timestamp(1654166280, 2), "electionDate" : ISODate("2022-06-02T10:38:00Z"), "configVersion" : 2, "self" : true }, { "_id" : 1, "name" : "172.16.19.110:27017", "health" : 1, "state" : 5, "stateStr" : "STARTUP2", #状态不正常。 "uptime" : 309, "optime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDurable" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2022-06-02T10:51:05.192Z"), "lastHeartbeatRecv" : ISODate("2022-06-02T10:51:04.225Z"), "pingMs" : NumberLong(0), "syncingTo" : "k8-minion-ident-test2-164:27017", "configVersion" : 2 } ], "ok" : 1 } rs0:PRIMARY>
处理完同步之后,会看到备库里有下述类似日志:
2022-06-02T20:38:13.934+0800 I INDEX [InitialSyncInserters-fdp_bill_uat.t_fdp_bill_reality_bill0] building index using bulk method; build may temporarily use up to 33 megabytes of RAM 2022-06-02T20:38:13.939+0800 I INDEX [InitialSyncInserters-fdp_bill_uat.t_fdp_bill_reality_bill0] build index on: fdp_bill_uat.t_fdp_bill_reality_bill properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "fdp_bill_uat.t_fdp_bill_reality_bill" } 2022-06-02T20:38:13.939+0800 I INDEX [InitialSyncInserters-fdp_bill_uat.t_fdp_bill_reality_bill0] building index using bulk method; build may temporarily use up to 500 megabytes of RAM 2022-06-02T20:38:31.967+0800 I REPL [replication-12] Applying operations until { : Timestamp 1654173503000|1 } before initial sync can complete. (starting at { : Timestamp 1654166756000|1 }) 2022-06-02T20:38:32.144+0800 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host k8-minion-ident-test2-164:27017 due to bad connection status; 1 connections to that host remain open 2022-06-02T20:38:32.144+0800 I REPL [replication-15] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1654173503000|1, t: 1 }[-7280547347428778685] 2022-06-02T20:38:32.144+0800 I REPL [replication-15] Initial sync attempt finishing up. .... 2022-06-02T20:38:32.148+0800 I REPL [replication-12] initial sync done; took 6755s. 2022-06-02T20:38:32.148+0800 I REPL [replication-12] Starting replication fetcher thread 2022-06-02T20:38:32.148+0800 I REPL [replication-12] Starting replication applier thread 2022-06-02T20:38:32.149+0800 I REPL [rsBackgroundSync] could not find member to sync from 2022-06-02T20:38:32.149+0800 I REPL [replication-12] Starting replication reporter thread 2022-06-02T20:38:32.149+0800 I REPL [rsSync] transition to RECOVERING 2022-06-02T20:38:32.151+0800 I REPL [rsSync] transition to SECONDARY
参考:https://dba.stackexchange.com/questions/58740/why-is-mongo-stuck-in-startup2