Sunday, February 24, 2013

Eventual consistency: dual software models, software with gaps

Maintaining consistency within parallel, distributed, high-performance, and stream oriented programming is all about accepting one thing: when it comes to consistency, that is dealing with failures, don't promise too much too soon! And that is the whole notion of eventual consistency: you sometimes need to wait for things to get resolved. "How much you wait" depends typically on how much money you have invested in infrastructure, yet it also depends on how fine your system captures failures.

Let's start with the eventual consistency.  In traditional transactional models, consistency, the knowledge that your system is "doing well", is assured by approaching each operation "transactionaly": if an operation fails, make sure nothing is messed up and that consistency is maintained.  Yet in the "new" world of massive parallelism, need for high-performance, and the like of streaming models, it is often simply impossible to maintain consistency. The simple explanation is that as operations are split along multiple cores and computer, when something goes wrong it takes "more" time to clean things up. The key theory is that the "cleanup time" after failed operations is "not plannable" (lookup "CAP theorem"). Said differently, failed operation can get "eventually" cleaned up, if your systems puts in the effort!

Eventual consistency is often brought up in the context of distributed databases. Yet in my world of real time streaming systems, system designs is even more limited than with DBs. The simple statement is that streaming system are one dimensional. And that often mean that you only have one way to look at things. Or again, you can view things as different models, but often not bring these view back together at a reasonable cost. 

One pattern to manage failure is simply to kill the process in which the failure was detected. The idea is then to let that process or another recover the operation. I remember once arguing with another developer that "life" is not always that simple. It is ok to use this pattern if the scope of "operations" is small enough. In fact the pattern is also economically good because it simplifies the whole automated QA approach. Yet, if the operations that you are working with have too long a lifespan, the cost to kill processes and restart the failed operations may be simply too high. In these cases you can add lots of redundancy, but that too will cost you, and your solution might simply be non-competitive with a solution that uses a smaller footprint. The last resort is to refine the scope of what you kill: only kill a process if it is corrupted, kill only the broken part of operations, keep as much of the operation that is "still good", incrementally recover the rest.

Some of you may know that duality is still one of my favorite subjects. In the context of linear programming, a dual model is one where variables are used to "measure" the "slack" associated to each constraint. In a software system, the constraints can be seen as the transactions, or the transactional streams, that are being processed. The dual model of a software system is then a model in which you track "how wrong" transactional operation are. That is what I call the "gap model". In a stream oriented model, the gap model is also a stream oriented model. The finer the gap model, the more failures can be described precisely, and the more you can limit the corrective measures needed to fix things. Yet like with all good things, if your model is too "fine", it simply becomes too expensive to work with. So compromised are needed.

(PS: my blog work will be considerably lower this year as I decided that I was more fun to write books (and big theories) than blogs. I will still try to keep you entertained from time to time. And as ever, requests are taken seriously! So do write comments.)