Building an efficient, lightweight message bus

In software development there is often a decision to be made on whether to ‘build or buy’, should we create X from scratch, or use or adapt a product built by someone else? I came across this decision recently when it became apparent that the platform I was architecting needed some sort of messaging system. After assessing some available products it became obvious that while there were many really good messaging systems available they tended to be either very high-end (and hence expensive) and most often, far too  complex for the platform’s needs. This suggested we should build our own, light-weight version.

A simple, easy to implement solution would be to enable systems or users to ‘subscriber’ to messages or message topics and give a method, possibly via an api call, for them to look into the messaging repository every few seconds. this way they can see if any new messages had arrived, but this is highly inefficient and architecturally unsatisfactory.

Imagine if you had 1000 clients polling the system every two seconds. This would result in 30k database requests per minute and maybe only one or two of these clients would had received a message!

One of my engineers suggested we could use the natural ‘timeout’ function of IIS to terminate a ‘long-polling’ request. This is quite a complex idea so let me break it down . Assume for a moment that our messages have already been published to the messaging system and that our subscribers have registered and subscribed to the topics of their choice. And let us say that the subscriber is a desktop application, the user has just logged in and the application is about to make a ‘peek’ request to the messaging system to enquire if they have any messages. There are two possible scenarios: one the user has some messages waiting or two there are no messages available.

1. There are messages

In this scenario the polling application makes a request via the API call and the system says “you have 10 messages”. The subscriber then receives each message in turn (or maybe as a batch though this may introduce concurrency issues) until there are none left, simple. 

2. There are no messages

This scenario is reached either when the waiting messages have all been received or when there are no messages available for that poller. and it is precisely when there are no messages that the ‘timeout’ functionality becomes useful. The API call is sent by the subscriber but instead of sending a response immediately stating ‘no messages’ the system that has received the request waits, and it continues to wait until one of two things happen:

  • a message is received – in this situation the api response is sent back to the requestor when it detects that a new message has been received with info about the new message. The poller can then process the message and send a new request until the next message is received.
  • no message is received – if no message is received in the given ‘timeout’ period then the response is sent back saying no messages were received and the poller can then send a new request.

Query Efficiency

As you have probably guessed, the key to the system’s efficiency is in the timeout period and a little trick in reading data from the message repository.

A timeout period, could be set in numerous ways:

  • the value could be sent with the peek request
  • it could be configured within the application
  • or it could taken from an IIS setting

There are pros and cons to each, the point is that it can be changed depending on multiple factors, how many subscribers, the number of messages in the system, size of the message, load on the web and database server etc. The longer the period the more concurrent sessions there will be open and the fewer ‘peek’ requests from subscribers but it should not be so long as to potentially cause the subscriber to timeout. IIS has a natural time out period of ninety seconds which seems to suit most situations but it can be reduced, if for example there is a high volume of traffic expected or increased if the number any type of subscribers was fairly stable.

The second aspect of query efficiency involves querying the message repository. All of the above is good and well but whatever the mechanism for delivering the messages there still need to be some sort of query to into the messaging repository.  Once a peek request is received then the code needs to query the database, if there is a message for the subscriber it can be served but if not then we need to initiate a query mechanism to poll the database until either a message is received or the peek request times out. Again the wait period between peeks can be configured, a reasonable value might be two seconds. which is a compromise between the speed at which the subscriber receives its messages (maximum two seconds delay) and the number of reads of the database.

A final and probably the most important contributor to the overall efficiency of the messaging system is to combine requests from all subscribers with active peek requests into a single database query. Obviously it would be overkill to query for messages from all subscribers so tracking ‘active’ subscribers becomes necessary. Once this is included in the mix, we now have a system that has a maximum number of thirty database queries per minute regardless of how many subscribers are active which is a huge efficiency saving on the thirty thousand we started with.

And so a highly efficient messaging system capable of supporting thousands or tens of thousands of subscribers is possible to build in just a few days based on common technologies like api, sql and iis.