1. Chandy lamport算法简介

1.1 补充知识

先普及个概念(取自维基百科):

Snaphot algorithm:

A snapshot algorithm is used to create a consistent snapshot of the global state of a distributed system. Due to the lack of globally shared memory and a global clock, this isn't trivially possible.

今天讨论的chandy and larmport算法也是分布式快照算法的一种,用于在分布式系统中记录一个全局一致的快照。

1.2 基本算法流程

原始论文可以参考Distributed Snapshots: Determining Global States of Distributed System

算法有些前提如下,简单总结就是传输是基本可靠的,不会发送消息丢包,或者无法通信等问题。
The assumptions of the algorithm are as follows:

  • There are no failures and all messages arrive intact and only once
  • The communication channels are unidirectional and FIFO ordered
  • There is a communication path between any two processes in the system
  • Any process may initiate the snapshot algorithm
  • The snapshot algorithm does not interfere with the normal execution of the processes
  • Each process in the system records its local state and the state of its incoming channels

基本流程:

论文中通过p,q两个节点传输令牌token来模拟算法流程


伪代码如下:

// 进程p行为,通过向q发出Marker,发起snapshot
begin
       p record its state;
then
       send one Marker along c after p records its state and before p sends further messages along c
end

//进程q接受Marker后的行为,q记录自身状态,并记录通道c的状态
if q has not recorded its state then
        begin
              q records its state;
              q records the state c as the empty sequence
        end
else q records the state of c as the sequence of messages received along c after q’s state was recorded and before q received the marker along c. 

算法总体上比较好理解,重点是通过传递marker来记录整条链路的全局状态。后续恢复的时候每个节点都可以从自己之前记录的checkpoint中恢复出来。

1.3 主要应用场景

该算法主要用于一些分布式系统中,利用全局一致性快照可以保证消息传递的exactly once语义

2. Flink中的应用

这里简单扯下流计算的发展。流计算框架包含两个核心问题,即计算和状态管理。发展历程也比较相似:
storm(需要自己管理状态)->Samza(内部利用leveldb和kafka管理数据,但是不支持exactly once语义)->Spark streaming(支持exactly once但是mini batch的思路延迟相对较高)

Flink实现exactly once方式的核心思想是基于chandy and lamport,具体细节可以参考论文Lightweight Asynchronous Snapshots for Distributed Dataflows

相比lamport的论文,这篇论文更加以实际工程角度说明了在流计算场景中如何实现exactly once语义。主要方式就是引入一个barrier,把input stream分为preshot records和post records。只有等到所有上游barrier才会继续往下处理。收到所有上游barrier的时候做一个snapshot。


主要缺点:图上也可以看出来,先收到的barrier的数据流可能由于要等待另外一个输入流的barrier而一直被阻塞,可能会产生较大的延迟

3. spark streaming

参考Spark Streaming Programming Guide可知,现在spark streaming 支持exactly once主要需要保证处理数据流程的每一步均满足exactly once:

  1. receive data: 假设数据只会传一次,并且传送一定成功
  2. transform: RDD计算结果满足确定性,重复结算结果是一样的
  3. output : spark streaming输出结果到外部系统的时候采用一个唯一的id来判断是否要写出数据,参考Semantics of output operations

4. 关于End to End exactly总结

如果data source ,data sink本身满足exactly once,并且数据传输时也满足exactly once,那么可以认为整个系统是end-to-end exactly once的。大部分分布式系统现在都可以利用chandy lamport算法的思想来满足自己系统的exactly once。但是在对接sink的时候往往需要一些额外工作,例如flink输出到kafka需要用二阶段提交,spark streaming输出时使用唯一id判断是否已经写过,这些都是一些实现上的细节差异。

参考资料:

  1. 分布式Snapshot和Flink Checkpointing简介
  2. An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)