Design and Implementation of CloudCanal Communication Layer

Preface

The previous entrepreneurial project CloudCanal used RSocket as the communication layer to implement communication between console and sidecar nodes. This article mainly shares the thinking behind this technology selection and the implementation details of CloudCanal’s unified communication layer based on RSocket.

Core motivation and trade-off

Generally speaking, the selection of communication layer technology certainly considers mature communication layer frameworks that are well-known to everyone, such as grpc or dubbo. However, at that time we lacked choices and opted for a relatively new communication layer framework called RSocket. The main reason was that we needed a communication framework that supported completely peer-to-peer communication between both ends of the communication, meaning any node in the communication could act as a server or client. This way, we could better control which side is responsible for opening port listening and flexible full-duplex communication, which is crucial for building a more secure CloudCanal. Frameworks like Dubbo and gRPC provide capabilities or usage patterns based on OSI 7 layers, which means message sending and receiving semantics must distinguish between server and client roles, this does not fit well with CloudCanal’s scenario.

In the architecture of CloudCanal, the console is generally positioned as a traditional client role, serving as the initiator of requests. The sidecar, on the other hand, is more like a server role. However, for security reasons related to the sidecar, CloudCanal requires that it initiate requests to create links with the console. In this mode, there is no need for sidecar to open ports which was very important when we wanted to build a secure SAAS data integration platform at that time. Traditional communication layers based on OSI7 layer protocol over HTTP cannot meet our requirements so we chose RSocket.
image.png

Disadvantages of RSocket

After introducing RSocket, we actually encountered quite a few problems, the main ones are as follows:

  • Lack of service registration and discovery mechanism: In the architecture of CloudCanal, the control console node will communicate with nodes in the sidecar cluster as a request initiator. The core rsocket capability does not include service registration, so we have rebuilt this capability ourselves.
  • Lack of service load balancing ability: The core rsocket capability is mainly point-to-point communication, lacking control over service load balancing.
  • Lack of observability: The core rsocket capability does not include construction for service observability, making troubleshooting difficult when problems occur. Therefore, this area needs further optimization and improvement by application developers.
  • Early versions had some bugs: One serious issue was that when processing bytebuf internally in rsocket, no copy was made and the shared memory block was directly modified and released. This caused setup frame authentication to fail during server restarts or client reconnections that depended on setup frame information.
  • Not essentially an RPC communication framework: Rsocket is essentially a communication framework more similar to Netty than one designed for RPC scenarios. Therefore, they later developed their own rsocket-rpc package including springboot-rsocket. Springboot-rsocket provides basic annotation-based routing capabilities but remote method invocation experience remains poor and cumbersome to use. We hope that when exposing our encapsulated communication layer to other developers it will be as simple and transparent as using local methods; this aspect will be addressed in future design considerations for the communication layer.
  • Reactive mode usage is not consistent with team development habits: RSocket is a communication layer framework designed for reactive programming style applications. Traditional synchronous call style usage habits within teams can easily lead to misuse of remote method calls resulting in direct blocking of communications; therefore it’s necessary to build a middleware layer allowing developers to use synchronous writing styles while still utilizing RSocket’s asynchronous non-blocking communication.

Communication layer design

In order to better address the issues arising from native RSocket, we further encapsulated and optimized rsocket, making it capable of serving as a universal communication layer to adapt to all future enterprise products (mainly CloudCanal and CloudDM).

Core architecture

On top of the core communication capabilities provided by RSocket, we have additionally encapsulated another layer to extend its abilities and make it more user-friendly. The core architecture is as follows.
image.png

RequestManager

RequestManager is a class that both the sender and receiver depend on. Its core function is to** support upper-layer development to use rsocket remote calls in a synchronous manner**. This is very important. Suppose A makes a request to B and waits for a response, if the processing of this request on the B side is a slow I/O operation, if A directly uses the interface provided by rsocket to wait synchronously, it will block all communication on the rsocket long connection channel, resulting in communication unavailability. As an RPC framework based on reactor model, RSocket’s key feature is asynchronous non-blocking. Therefore, RequestManager helps upper-layer applications build an asynchronous non-blocking intermediate layer to decouple application layer calls from data transmission at the communication layer so that the entire communication layer can work completely in an asynchronous non-blocking manner.

The core process is:

  1. When sending a request, register the request with RequestManager based on requestId.
  2. The registered request will be placed in a Map with key as requestId and value as guava SettableFuture. This Map will wait for the asynchronous return result to be filled.
  3. External remote method calls can directly block and wait for SettableuFuture results on business threads, rather than directly blocking on nio thread.
  4. After the asynchronous result is filled into the SettableFuture object, the upper-level caller naturally obtains the response result.

Sender Module

The capability provided by the sending module shields the details of underlying rsocket sending. Users can perceive the API provided by remote nodes through jar package dependencies, and then call remote methods just like calling local methods. The core technology mainly includes dynamic proxy and custom bean injection. Spring itself provides very rich extension points, which can be easily injected into service objects with JDK dynamic proxy and custom bean injection provided by spring extension points, and then directly called method. The following figure is an example of use, where the caller directly injects the service object xxRService, and then completes the remote method call just like calling a normal method.

image.png

The general process of completing transparent proxy and bean decoration is as follows:
image.png

In addition, the sender module also includes an important responsibility, which is the ability to load balance. The status information in CloudCanal, including node health information, is stored in the database. In sender, personalized service routing can be done based on node statistics. For example, routing can be prioritized based on low load by default and routed directly point-to-point if related binding information is involved.

Receive Module

The main responsibility of the receiving end is to receive network requests from the current machine and then dispatch them to specific interfaces on this machine. If the service node selection is a first-level route, then local service addressing becomes a second-level route. The core workflow is as follows.
image.png

The key implementation point of the receive module is: The dispatcher must use an asynchronous thread pool to process it when handling, avoiding blocking the NIO thread.

Functionarlity

The module is mainly for functional construction, including::

  • Authentication: Verify the permissions of the access service nodes. The service nodes are divided according to tenants.
  • SSL: Supports SSL encrypted communication.
  • Session Manager: In the CloudCanal communication layer, it manages its own service nodes, including node registration, authentication, logout, health detection, load balancing and other content.
  • Log Aspect: The encapsulated CloudCanal communication layer is shared by all RPC requests and acts as an aspect. Adding some log points in this aspect layer can facilitate problem tracking. These logs include normal request requestId, communication direction, route name common logs; exception logs; slow communication logs that can be associated with the system’s own alarm capabilities.
  • Metric: Mainly built-in RService collects statistical information about nodes for routing to determine their online/offline and healthy status.

commons

Some utility classes, annotation definitions, model classes, top-level interfaces.

Some design thoughts on RSocket itself

  • Provide richer interaction models: it provides support for 4 interaction models, including injection flow transmission and push semantics. The concept of peer-to-peer communication is also consistent with the idea of providing richer interaction models. Peer-to-peer communication combined with rich interaction models greatly enhances the applicability and flexibility of rsocket. Communication layers built on top of the HTTP protocol are limited by their request-response interaction model, which cannot achieve complete peer-to-peer bidirectional communication in its native state.
  • Provide application-level flow control mechanism: Communication protocols built on HTTP/2, such as gRPC and others, support flow control based on bytes. For example, the receiving end specifies a specific byte window size for flow control. RSocket places greater emphasis on application-level flow control and uses REQUEST_N frames in the protocol to specify how many more messages need to be received. The main advantage of doing flow control by message rather than by byte stream is that it can reduce buffer bloat to some extent (which is also why traditional congestion control algorithms are inefficient). It can also apply both application-layer and TCP-layer flow controls. However, I think gRPC’s dynamic traffic control based on BDP estimation is the future because it performs well without requiring users to manage flow control themselves.
  • High-level protocol: RSocket is generally classified as OSI layer 5/6, which belongs to the high-level protocol. It can be adapted to other transport layer protocols or even application layer protocols below it, such as Aeron, WebSocket, Netty, TCP and so on.
  • Binary transmission: Data is packaged into frames in binary format and transmitted to the lower layer. Similar to HTTP2, it is more efficient and easier to parse.
  • Multiplexing: The design concept of HTTP2 is also applied, where multiple streams can be multiplexed over a single connection to avoid head-of-line blocking.

Summary

If it is not for strong demands like peer-to-peer communication as in CloudCanal, choosing gRPC is generally a good choice. With the trend of cloud-native becoming more and more of an internet software foundation, gRPC has basically become the de facto standard for communication layer. It may be difficult for RSocket to compete with gRPC. In any case, having Google as a backer is still a huge advantage for gRPC, and now it has capabilities such as reactor and better flow control based on BDP estimation.