Thialfi: A Client Notification Service for Internet-Scale Applications

The Scandinavian mythology regards Thialfi, a swift runner, as the attendent of Thor, the god of war. Motivated by the swiftness that qualifies Thialfi, was perhaps why the folks at Google named their message delivery system (it delivers notifications at sub-second intervals!) after him.

The Case for another Notification Service
Its quite common for applications to share their data with other application users and devices. These “clients” of the data usually maintain a local copy of the data for many different reasons. Due to this arises the need to keep the data in sync/fresh across other users and devices. For example if you alter a calendar entry on your mobile phone you expect this change to synched across the calendars of all the other attendees almost immediately. This ability forms the core of a notification service.

In the absence of a general notification service applications settle for custom notification mechanisms. This either takes the form of a) frequent polling or b) push notifications. Both come with their own problems. Polling while conceptually simple and easy to implement, creates a tension between resource consumption and timeliness.
With push notifications ensuring reliability of message delivery is very difficult, especially in an heterogeneous landscape like the internet.

If we were to conceive a generic notification service the it must offer:
a) The ability to map different clients to different pieces of data based on interest
b) Notifications must be reliable without necessitating low-frequency backup polling as a fallback.
c) The service should see to it that the messages have made it through the last mile successfully into the client’s address space.
d) The service must be able to deliver to multiple applications written in different languages over multiple communication channels/protocols

Reliable Signaling over Reliable Messaging – An important tradeoff
The authors of Thialfi make some very good arguments for choosing to go with Reliable Signaling as opposed to what we typically find in messaging systems, which is the idea of Reliable Messaging.

Why doesn’t Reliable Messaging apply here?
Firstly because it becomes very cumbersome to manage when your clients are often unavailable for long durations and in some cases may never return. In such a scenario if you go the reliable messaging route then message packets will remain queued up thus consuming more resources and eventually also cause message floods at the client end.
Secondly message delivery tends to be very application specific. Data that is delivered to an application tends to have bespoke security, privacy and content format requirements.
So instead of a one-size-that-fits-all messaging system, Thialfi is consciously designed as a system that provides reliable signaling (it collapses all notifications for an object into a single message). This enables Thialfi to remain more loosely coupled with the applications.

System Architecture
Thialfi models data in terms of object identifiers and their version numbers. The objects themselves are stored within the application. Thialfi maintains no copy of the actual data. Each object is given a name which is unique within the scope of the owner application. These objects are versioned by the owning application. The application allocates a version number to each object and this information is passed along to Thialfi as part of a notification. Version numbers are constrained by the applications to be monotonically increasing as this is necessary for reliable delivery.

Thialfi Client
Thialfi has a client library that provides applications the ability to register for updates to any shared object and receive notification callbacks from the server. It speaks the Thialfi protocol.
As part of a notification that is delivered to the client Thialfi sends across the shared object’s identifier and the latest known version of the object. Using this information the client then accesses the application responsible for changing the object directly and thus synchronizes the state of the object. Thialfi does not take on the responsibility of syncing up the state.

The Journey of a Notification on Thialfi’s servers
The Thialfi server is notified of changes as and when an application modifies the state of a shared object. Applications that want to notify Thialfi use an publisher library to do the notification. Thialfi supports channels like XMPP, HTTP and RPC to cater to different applications.

The notification is intercepted first by a set of servers known as Bridge servers. These are stateless, randomly load-balanced tasks that consume a feed of application specific update messages from Google’s infrastructure pub/sub service, translate them into a standard notification format, and assemble them into batches for delivery to another set of servers called Matchers.

Matchers (they are partitioned by object) then consume the notification, match it with the set of registered clients, and forwards it to the Registrar for reliable delivery to clients. Matchers are partitioned over the set of objects and maintain a view of state indexed by object ID.

Registrars track clients, process registrations, and reliably deliver the notification using a view of state indexed by client ID (they are partitioned by client).

Achieving Reliable Delivery
The authors define the property of reliable delivery as –
“If a well-behaved client registers for an object X, Thialfi ensures that the client will always eventually learn of the latest version of X.”
Thialfi achieves end-to-end reliability by ensuring that state changes in one component eventually propagate to all other relevant components of the system.
Thialfi operates on two kinds of states namely, registration state (i.e., which clients care about which objects) and notification state (the latest known version of each object).

Synchronizing Registration State
Registration state moves through three different components that include the client, the Registrar and the matcher. Every message from the client to the Registrar comes along with a digest summarizing the entire registration state. If this digest does not match the state at the Registrar Thialfi runs a Registration Sync Protocol. If the client or the server detects a discrepancy, the client resends its registrations to the server. If the server detects the problem, it requests that the client resend them. Thus the client and the Registrar are synchronized.
When the Registrar commits a registration state change, a pending work marker is also set atomically. This marker is cleared only after all dependent writes to the Matcher have completed successfully. All writes are retried by the Registrar Propagator if any failure occurs. Its safe to do this as all the operations on the Thialfi server are idempotent by design.

Synchronizing Notification State
Notification state that comes from the publisher moves through four different components that include the bridge, the matcher, the Registrar and the client.
Notifications are removed from the update feed by the Bridge only after they have been successfully written to the Matcher’s persistent store. A periodic task in the Bridge reads the matcher’s temporary store and resends the notifications to the Matcher when required.
When a notification is written to the Matcher database, a pending work marker is used to ensure eventual propagation similar to the Registrar to Matcher propagation of registration state.
To insure the last leg of the journey the Registrar retains a notification for a client until either the client acknowledges it or a subsequent notification supersedes it.
Additionally the Registrar periodically retransmits any outstanding notifications while the client is online, ensuring eventual delivery.

When viewed together each of these mechanisms work in tandem to provide reliable delivery of the notifications.

Previewing from http://research.google.com/pubs/archive/37474.pdf

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s