1. 介绍
alluxio这种master-slave结构必然涉及到单点故障。所以配置master的容错是必须的。
如果alluxio还没装好请参考我的其他文章。
本文前提是你已经配好了HA的HDFS。
hadoop配置HA我没写文章,建议参考官方文档:HDFS High Availability
我们采用的实验配置如下:
hostname | role |
---|---|
a1(10.8.12.16) | alluxio-master,alluxio-worker,hadoop master,hadoop slave |
a2(10.8.12.17) | alluxio-standby-master,alluxio-worker,hadoop master-standby,hadoop slave |
a3(10.8.12.18) | alluxi-worker,hadoop-slave |
注意,设定为master和standby master的要可以互相无密码登入哦~~~
2. 准备工作
这里假设你已经提前安装好了JDK、支持HA的HDFS以及ZK集群。
如果不这样操作,当你alluxio集群中worker的内存比较大的时候,会无法启动worker,看worker.log有报错:
2016-09-21 16:57:24,174 WARN util.NativeCodeLoader (NativeCodeLoader.java:<clinit>) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-09-21 16:57:24,513 ERROR logger.type (MetricsConfig.java:loadConfigFile) - Error loading metrics configuration file.
2016-09-21 16:57:24,754 INFO server.Server (Server.java:doStart) - jetty-7.x.y-SNAPSHOT
2016-09-21 16:57:24,769 INFO handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.s.ServletContextHandler{/metrics/json,null}
2016-09-21 16:57:24,814 INFO handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.w.WebAppContext{/,file:/home/appadmin/alluxio-1.2.0/core/server/src/main/webapp/},/home/appadmin/alluxio-1.2.0/core/server/src/main/webapp
2016-09-21 16:57:27,864 INFO server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChannelConnector@0.0.0.0:30000
2016-09-21 16:57:27,864 INFO logger.type (UIWebServer.java:startWebServer) - Alluxio Worker Web service started @ 0.0.0.0/0.0.0.0:30000
2016-09-21 16:57:27,865 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:27,869 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (0) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:27,919 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:27,920 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (1) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:27,970 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:27,970 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (2) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:28,021 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:28,021 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (3) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:28,171 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:28,173 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (4) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:28,723 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:28,723 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (5) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:29,724 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:29,724 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (6) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:30,725 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:30,725 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (7) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:31,725 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:31,726 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (8) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:32,726 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:32,726 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (9) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:33,726 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:33,727 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (10) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:34,727 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:34,727 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (11) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:35,728 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:35,728 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (12) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:36,728 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:36,729 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (13) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:37,729 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:37,729 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (14) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:38,729 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:38,730 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (15) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:39,730 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:39,731 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (16) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:40,731 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:40,731 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (17) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:41,732 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:41,732 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (18) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:42,732 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:42,732 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (19) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:43,733 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:43,733 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (20) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:44,733 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:44,734 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (21) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:45,734 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:45,735 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (22) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:46,735 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:46,735 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (23) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:47,736 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:47,736 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (24) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:48,736 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:48,737 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (25) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:49,737 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:49,737 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (26) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:50,738 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:50,738 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (27) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:51,738 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:51,739 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (28) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:52,739 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:52,739 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (29) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:52,740 ERROR logger.type (BlockWorker.java:start) - Failed to get a worker id from block master
alluxio.exception.ConnectionFailedException: Failed to connect to BlockMasterWorker master @ /10.8.12.16:19998 after 29 attempts
at alluxio.AbstractClient.connect(AbstractClient.java:186)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:291)
at alluxio.worker.block.BlockMasterClient.getId(BlockMasterClient.java:109)
at alluxio.worker.WorkerIdRegistry.registerWithBlockMaster(WorkerIdRegistry.java:60)
at alluxio.worker.block.BlockWorker.start(BlockWorker.java:168)
at alluxio.worker.AlluxioWorker.startWorkers(AlluxioWorker.java:354)
at alluxio.worker.AlluxioWorker.start(AlluxioWorker.java:326)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:84)
2016-09-21 16:57:52,741 ERROR logger.type (AlluxioWorker.java:main) - Uncaught exception while running Alluxio worker, stopping it and exiting.
java.lang.RuntimeException: alluxio.exception.ConnectionFailedException: Failed to connect to BlockMasterWorker master @ /10.8.12.16:19998 after 29 attempts
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at alluxio.worker.block.BlockWorker.start(BlockWorker.java:171)
at alluxio.worker.AlluxioWorker.startWorkers(AlluxioWorker.java:354)
at alluxio.worker.AlluxioWorker.start(AlluxioWorker.java:326)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:84)
Caused by: alluxio.exception.ConnectionFailedException: Failed to connect to BlockMasterWorker master @ /10.8.12.16:19998 after 29 attempts
at alluxio.AbstractClient.connect(AbstractClient.java:186)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:291)
at alluxio.worker.block.BlockMasterClient.getId(BlockMasterClient.java:109)
at alluxio.worker.WorkerIdRegistry.registerWithBlockMaster(WorkerIdRegistry.java:60)
at alluxio.worker.block.BlockWorker.start(BlockWorker.java:168)
... 3 more
查看zk上节点信息,也会发现相关的节点没有创建出来.查看zk日志,也会有错误,说connect refused的错误。
3. 添加hadoop和HDFS相关配置文件
将hadoop的配置文件core-site.xml和hdfs-site.xml拷贝至$ALLUXIO_HOME/conf下
4. client配置
客户端使用时需要添加额外的JAVA选项
alluxio.zookeeper.enabled=true
alluxio.zookeeper.address=[zookeeper_hostname1]:2181,[zookeeper_hostname2]:2181,[zookeeper_hostname3]:2181
5. 完整配置一览
5.1 alluxio-env.sh(master)
ALLUXIO_MASTER_HOSTNAME=${ALLUXIO_MASTER_HOSTNAME:-"10.8.12.16"}
ALLUXIO_WORKER_MEMORY_SIZE=${ALLUXIO_WORKER_MEMORY_SIZE:-"100GB"}
ALLUXIO_RAM_FOLDER=${ALLUXIO_RAM_FOLDER:-"/home/appadmin/ramdisk"}
ALLUXIO_UNDERFS_ADDRESS=${ALLUXIO_UNDERFS_ADDRESS:-"hdfs://ns/alluxio/"}
5.2 alluxio-env.sh(standby master-1)
ALLUXIO_MASTER_HOSTNAME=${ALLUXIO_MASTER_HOSTNAME:-"10.8.12.17"}
ALLUXIO_WORKER_MEMORY_SIZE=${ALLUXIO_WORKER_MEMORY_SIZE:-"100GB"}
ALLUXIO_RAM_FOLDER=${ALLUXIO_RAM_FOLDER:-"/home/appadmin/ramdisk"}
ALLUXIO_UNDERFS_ADDRESS=${ALLUXIO_UNDERFS_ADDRESS:-"hdfs://ns/alluxio/"}
5.3 alluxio-env.sh(alluxio worker)
ALLUXIO_WORKER_MEMORY_SIZE=${ALLUXIO_WORKER_MEMORY_SIZE:-"100GB"}
ALLUXIO_RAM_FOLDER=${ALLUXIO_RAM_FOLDER:-"/home/appadmin/ramdisk"}
ALLUXIO_UNDERFS_ADDRESS=${ALLUXIO_UNDERFS_ADDRESS:-"hdfs://ns/alluxio/"}
5.4 alluxio-site.properties(所有节点)
alluxio.underfs.hdfs.configuration=/home/appadmin/hadoop-2.7.2/etc/hadoop/core-site.xml
alluxio.zookeeper.enabled=true
alluxio.zookeeper.address=10.8.12.16:2181,10.8.12.17:2181,10.8.12.18:2181
alluxio.master.journal.folder=hdfs://ns/alluxio/journal
# 端口超时也稍微设置大点,否则使用ZK的时候会有端口关闭的问题
alluxio.security.authentication.socket.timeout.ms=3000000
# 心跳超时设置大点,否则alluxio failover的时候worker可能会来不及切换到新master
alluxio.worker.block.heartbeat.timeout.ms=60000
6. 测试容错
6.1 启动集群方法
- 在master上和standby master上分别执行以下操作启动master进程
# 所有节点上的master都会启动
alluxio-start.sh master
- 在各个worker上分别执行以下操作来启动worker进程
alluxio-start.sh worker SudoMount
这里貌似一定要用该命令分别启动worker,否则就会出问题
6.2 进程一览
a1(10.8.12.16)节点的进程:
a2(10.8.12.17)节点的进程:
a3(10.8.12.18)节点的进程:(我这里多了个AlluxioMaster,因为我a3也设置了和其他节点无密码登入,所以会多开了这个,实际上不需要)
6.3 zk一览
发现16这台才是leader。
我们访问下web界面,发现只有16这台上才有 worker的信息:
访问下17这台,可以访问,但是为空!
18这台由于我们在alluxio-env.sh当中没有设置ALLUXIO_MASTER_HOSTNAME,所以是不会成为standby master的,虽然进程已经启动!!
PS:如果同时启动3台master,实际上也只会随机选择2台来做为主备。。。因此多了没有用。
6.4 namenode发生failover
首先我们测试namenode中一个挂掉,引发fail over是否对alluxio集群有影响。
首先看下namenode的情况:
我们跑到nn2节点上,查看下namenode的进程,然后杀死他(O.O 好可怕~~)
kill之后查看nn1状态:
然后我们看下alluxio是否正常,我们简单通过alluxio runTests命令来测试:
是不是很坑爹?竟然一个namenode挂掉了,alluxio竟然没有识别出来,仍然去挂掉的namenode上去获取datanode信息。还好报错给了我们提示:
The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
原来这是个HDFS上的配置问题。这个问题的原因是我们只有3个节点,而且正好又设置了3个备份,当1个datanode挂了就会存在无法写入的问题。解决方法:
6.4.1 使得复制份数小于datanode数量(推荐)
<!--原来是3台节点3个复制,现在改成3个节点2个复制-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
6.4.2 修改repllace策略
我们修改下hdfs-site.xml文件添加如下选项:
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
<value>NEVER</value>
</property>
我们关闭下alluxio集群和dfs集群,然后重启下再模拟以上操作。这里推荐6.4.1的方法,因为那个方法是我验证过的,没问题~
6.4.3 重新复制新HDFS文件到alluxio配置目录
这一步不要忘记!!
6.5 alluxio failover测试
先在ZK客户端上确认下leader是nn1这台。
现在我们尝试把16这台上的alluxio master进程杀死O.O
再检查ZK发现多了台17,不过原来的旧的节点没删(这个我就觉得alluxio做的不太好啊,都死掉的节点了,查了下发现以前tachyon就有人提了PR解决这个问题了,怎么现在alluxio又有这个问题了?见[TACHYON-961] Delete previous leader znodes before taking leadership)。看来我也要去提问题了。
然后在a2这台做为standby master重新运行alluxio runTests命令