1. 介绍

alluxio这种master-slave结构必然涉及到单点故障。所以配置master的容错是必须的。

如果alluxio还没装好请参考我的其他文章。

本文前提是你已经配好了HA的HDFS。

hadoop配置HA我没写文章,建议参考官方文档:HDFS High Availability

我们采用的实验配置如下:

hostname role
a1(10.8.12.16) alluxio-master,alluxio-worker,hadoop master,hadoop slave
a2(10.8.12.17) alluxio-standby-master,alluxio-worker,hadoop master-standby,hadoop slave
a3(10.8.12.18) alluxi-worker,hadoop-slave

注意,设定为master和standby master的要可以互相无密码登入哦~~~

2. 准备工作

这里假设你已经提前安装好了JDK、支持HA的HDFS以及ZK集群。

这里特别强调一点:请在配置zk集群的时候设置zoo.cfg的时候注意把initLimit设置的大点,默认是10,我设置成100.

如果不这样操作,当你alluxio集群中worker的内存比较大的时候,会无法启动worker,看worker.log有报错:

2016-09-21 16:57:24,174 WARN  util.NativeCodeLoader (NativeCodeLoader.java:<clinit>) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-09-21 16:57:24,513 ERROR logger.type (MetricsConfig.java:loadConfigFile) - Error loading metrics configuration file.
2016-09-21 16:57:24,754 INFO  server.Server (Server.java:doStart) - jetty-7.x.y-SNAPSHOT
2016-09-21 16:57:24,769 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.s.ServletContextHandler{/metrics/json,null}
2016-09-21 16:57:24,814 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.w.WebAppContext{/,file:/home/appadmin/alluxio-1.2.0/core/server/src/main/webapp/},/home/appadmin/alluxio-1.2.0/core/server/src/main/webapp
2016-09-21 16:57:27,864 INFO  server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChannelConnector@0.0.0.0:30000
2016-09-21 16:57:27,864 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio Worker Web service started @ 0.0.0.0/0.0.0.0:30000
2016-09-21 16:57:27,865 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:27,869 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (0) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:27,919 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:27,920 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (1) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:27,970 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:27,970 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (2) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:28,021 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:28,021 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (3) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:28,171 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:28,173 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (4) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:28,723 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:28,723 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (5) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:29,724 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:29,724 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (6) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:30,725 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:30,725 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (7) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:31,725 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:31,726 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (8) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:32,726 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:32,726 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (9) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:33,726 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:33,727 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (10) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:34,727 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:34,727 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (11) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:35,728 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:35,728 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (12) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:36,728 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:36,729 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (13) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:37,729 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:37,729 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (14) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:38,729 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:38,730 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (15) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:39,730 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:39,731 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (16) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:40,731 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:40,731 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (17) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:41,732 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:41,732 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (18) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:42,732 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:42,732 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (19) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:43,733 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:43,733 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (20) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:44,733 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:44,734 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (21) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:45,734 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:45,735 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (22) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:46,735 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:46,735 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (23) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:47,736 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:47,736 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (24) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:48,736 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:48,737 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (25) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:49,737 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:49,737 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (26) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:50,738 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:50,738 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (27) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:51,738 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:51,739 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (28) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:52,739 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.2.0) is trying to connect with BlockMasterWorker master @ /10.8.12.16:19998
2016-09-21 16:57:52,739 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (29) to BlockMasterWorker master @ /10.8.12.16:19998 : java.net.ConnectException: Connection refused
2016-09-21 16:57:52,740 ERROR logger.type (BlockWorker.java:start) - Failed to get a worker id from block master
alluxio.exception.ConnectionFailedException: Failed to connect to BlockMasterWorker master @ /10.8.12.16:19998 after 29 attempts
    at alluxio.AbstractClient.connect(AbstractClient.java:186)
    at alluxio.AbstractClient.retryRPC(AbstractClient.java:291)
    at alluxio.worker.block.BlockMasterClient.getId(BlockMasterClient.java:109)
    at alluxio.worker.WorkerIdRegistry.registerWithBlockMaster(WorkerIdRegistry.java:60)
    at alluxio.worker.block.BlockWorker.start(BlockWorker.java:168)
    at alluxio.worker.AlluxioWorker.startWorkers(AlluxioWorker.java:354)
    at alluxio.worker.AlluxioWorker.start(AlluxioWorker.java:326)
    at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:84)
2016-09-21 16:57:52,741 ERROR logger.type (AlluxioWorker.java:main) - Uncaught exception while running Alluxio worker, stopping it and exiting.
java.lang.RuntimeException: alluxio.exception.ConnectionFailedException: Failed to connect to BlockMasterWorker master @ /10.8.12.16:19998 after 29 attempts
    at com.google.common.base.Throwables.propagate(Throwables.java:160)
    at alluxio.worker.block.BlockWorker.start(BlockWorker.java:171)
    at alluxio.worker.AlluxioWorker.startWorkers(AlluxioWorker.java:354)
    at alluxio.worker.AlluxioWorker.start(AlluxioWorker.java:326)
    at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:84)
Caused by: alluxio.exception.ConnectionFailedException: Failed to connect to BlockMasterWorker master @ /10.8.12.16:19998 after 29 attempts
    at alluxio.AbstractClient.connect(AbstractClient.java:186)
    at alluxio.AbstractClient.retryRPC(AbstractClient.java:291)
    at alluxio.worker.block.BlockMasterClient.getId(BlockMasterClient.java:109)
    at alluxio.worker.WorkerIdRegistry.registerWithBlockMaster(WorkerIdRegistry.java:60)
    at alluxio.worker.block.BlockWorker.start(BlockWorker.java:168)
    ... 3 more

查看zk上节点信息,也会发现相关的节点没有创建出来.查看zk日志,也会有错误,说connect refused的错误。

3. 添加hadoop和HDFS相关配置文件

将hadoop的配置文件core-site.xml和hdfs-site.xml拷贝至$ALLUXIO_HOME/conf下

4. client配置

客户端使用时需要添加额外的JAVA选项

alluxio.zookeeper.enabled=true
alluxio.zookeeper.address=[zookeeper_hostname1]:2181,[zookeeper_hostname2]:2181,[zookeeper_hostname3]:2181

5. 完整配置一览

5.1 alluxio-env.sh(master)

ALLUXIO_MASTER_HOSTNAME=${ALLUXIO_MASTER_HOSTNAME:-"10.8.12.16"}
ALLUXIO_WORKER_MEMORY_SIZE=${ALLUXIO_WORKER_MEMORY_SIZE:-"100GB"}
ALLUXIO_RAM_FOLDER=${ALLUXIO_RAM_FOLDER:-"/home/appadmin/ramdisk"}
ALLUXIO_UNDERFS_ADDRESS=${ALLUXIO_UNDERFS_ADDRESS:-"hdfs://ns/alluxio/"}

5.2 alluxio-env.sh(standby master-1)

ALLUXIO_MASTER_HOSTNAME=${ALLUXIO_MASTER_HOSTNAME:-"10.8.12.17"}
ALLUXIO_WORKER_MEMORY_SIZE=${ALLUXIO_WORKER_MEMORY_SIZE:-"100GB"}
ALLUXIO_RAM_FOLDER=${ALLUXIO_RAM_FOLDER:-"/home/appadmin/ramdisk"}
ALLUXIO_UNDERFS_ADDRESS=${ALLUXIO_UNDERFS_ADDRESS:-"hdfs://ns/alluxio/"}

5.3 alluxio-env.sh(alluxio worker)

ALLUXIO_WORKER_MEMORY_SIZE=${ALLUXIO_WORKER_MEMORY_SIZE:-"100GB"}
ALLUXIO_RAM_FOLDER=${ALLUXIO_RAM_FOLDER:-"/home/appadmin/ramdisk"}
ALLUXIO_UNDERFS_ADDRESS=${ALLUXIO_UNDERFS_ADDRESS:-"hdfs://ns/alluxio/"}

5.4 alluxio-site.properties(所有节点)

alluxio.underfs.hdfs.configuration=/home/appadmin/hadoop-2.7.2/etc/hadoop/core-site.xml
alluxio.zookeeper.enabled=true
alluxio.zookeeper.address=10.8.12.16:2181,10.8.12.17:2181,10.8.12.18:2181
alluxio.master.journal.folder=hdfs://ns/alluxio/journal
# 端口超时也稍微设置大点,否则使用ZK的时候会有端口关闭的问题
alluxio.security.authentication.socket.timeout.ms=3000000
# 心跳超时设置大点,否则alluxio failover的时候worker可能会来不及切换到新master
alluxio.worker.block.heartbeat.timeout.ms=60000

6. 测试容错

6.1 启动集群方法

  1. 在master上和standby master上分别执行以下操作启动master进程
# 所有节点上的master都会启动
alluxio-start.sh master
  1. 在各个worker上分别执行以下操作来启动worker进程
alluxio-start.sh worker SudoMount

这里貌似一定要用该命令分别启动worker,否则就会出问题

6.2 进程一览

a1(10.8.12.16)节点的进程:

a2(10.8.12.17)节点的进程:

a3(10.8.12.18)节点的进程:(我这里多了个AlluxioMaster,因为我a3也设置了和其他节点无密码登入,所以会多开了这个,实际上不需要)

6.3 zk一览

发现16这台才是leader。

我们访问下web界面,发现只有16这台上才有 worker的信息:


访问下17这台,可以访问,但是为空!

18这台由于我们在alluxio-env.sh当中没有设置ALLUXIO_MASTER_HOSTNAME,所以是不会成为standby master的,虽然进程已经启动!!

PS:如果同时启动3台master,实际上也只会随机选择2台来做为主备。。。因此多了没有用。

6.4 namenode发生failover

首先我们测试namenode中一个挂掉,引发fail over是否对alluxio集群有影响。

首先看下namenode的情况:

我们跑到nn2节点上,查看下namenode的进程,然后杀死他(O.O 好可怕~~)

kill之后查看nn1状态:

然后我们看下alluxio是否正常,我们简单通过alluxio runTests命令来测试:

是不是很坑爹?竟然一个namenode挂掉了,alluxio竟然没有识别出来,仍然去挂掉的namenode上去获取datanode信息。还好报错给了我们提示:

The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

原来这是个HDFS上的配置问题。这个问题的原因是我们只有3个节点,而且正好又设置了3个备份,当1个datanode挂了就会存在无法写入的问题。解决方法:

6.4.1 使得复制份数小于datanode数量(推荐)

<!--原来是3台节点3个复制,现在改成3个节点2个复制-->
        <property>
            <name>dfs.replication</name>
                <value>2</value>
        </property>

6.4.2 修改repllace策略

我们修改下hdfs-site.xml文件添加如下选项:

<property>
  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
  <value>NEVER</value>
</property>

我们关闭下alluxio集群和dfs集群,然后重启下再模拟以上操作。这里推荐6.4.1的方法,因为那个方法是我验证过的,没问题~

6.4.3 重新复制新HDFS文件到alluxio配置目录

这一步不要忘记!!

6.5 alluxio failover测试

先在ZK客户端上确认下leader是nn1这台。

现在我们尝试把16这台上的alluxio master进程杀死O.O


再检查ZK发现多了台17,不过原来的旧的节点没删(这个我就觉得alluxio做的不太好啊,都死掉的节点了,查了下发现以前tachyon就有人提了PR解决这个问题了,怎么现在alluxio又有这个问题了?见[TACHYON-961] Delete previous leader znodes before taking leadership)。看来我也要去提问题了。

然后在a2这台做为standby master重新运行alluxio runTests命令