Void

Tag: architecture

Ode To Queues

If you have a producer with an uneven rate of production and a consumer which cannot keep pace with the producer at its peak, use a queue.

If you have a workload which need not be addressed synchronously, use a queue.

If your customer-facing application is riddled with workloads which can be deferred, move these to a queue thus making the customer-facing application lean and mean.

duck-3217049_640

Think of a queue as a shock absorber.

There are workloads which need to be processed immediately with sub-millisecond latency and then there are ones where you have the luxury of taking time. It is advisable not to mix these in an application. The second kind of workload can be addressed by moving it to a queue and having a consumer process them.

For example, consider a scenario where you are consuming messages and persisting them in a data store. These messages are coming in at a variable rate and at its peak, the data store cannot handle the load. You have two options. Scale the data store to meet the peak load or slap a queue in between to absorb the shock. Queue solves this problem in a KISS manner.

Queues enable applications to be highly available while giving enough room to manoeuvre. As long as the queue is highly available, the chance of message loss is almost nil. Since a queue is durable, you need not perfect your consumer’s high availability, you get leeway to manage.

With applications embracing microservices paradigm, there is a lot of API back and forth. Not all API consumption has to be in real-time. Whatever can be deferred should use a queue as the transport mechanism.

Queue introduces a bit more complexity into an application but the advantage it brings to the table makes it a worthwhile investment.

Advertisements

Anti features

When evaluating new technology, framework or library; a lot of importance is given to the salient features. While it is very important to know the positives, the negatives usually tend to be glossed over. Being aware of the shortcomings of a framework gives one the ability to anticipate problems down the road.

feedback-3239454_640

For example, let us take NoSQL databases. A lot of time is spent on singing paeans to the scalability, malleability etc of NoSQL databases while hardly thinking about the negatives that come with it.

Two simple techniques which give a good visibility on anti-features:
1. The very obvious one, Google for the shortcomings. Someone would have written a blog post on the interwebs highlighting how a framework or technology let them down. For example, take this post by Uber on how Postgres did not work as expected for them.
2. Comb through Github and/or JIRA peeking at the bugs raised and enhancements requested.

Both of the above will provide a good picture of the shortcomings. If you are evaluating a closed source proprietary technology, the above may not make the cut.

Once a mental note is made of the negatives, ponder on the scenarios where this might affect your usage. It helps to spend quality time on this as this will save one from a lot of future trouble.

If you think about this, this might sound very obvious but tends to be highly neglected. We get so caught up in the positives of something that the negatives tend to be ignored and this usually comes biting us back later.

Testing legacy applications

When contemplating on introducing automated testing in legacy applications, it is easy to get bogged down in terminology; unit testing, integration testing, regression testing, black box testing, white box testing, stress testing, etc. Quite a bit of time is spent in debates on unit testing versus integration testing, I have written about this before too.

A practical way to approach testing legacy applications is to first scope out the intention behind the test. Is it to test the behavior of a particular method, an API response or how an application behaves post an HTTP form submit? Next step is to jot down what and all has to be done to enable this. For example, if a database is involved, it can be mocked or a test database with bootstrapped data can be used.

software-762486_640

The gamut of changes needed to inject testability into an application that has never seen testing before should never be underestimated. The way you would structure testable code is vividly different from coding being incognizant of testing.

Take a look at the code below, how would you unit test getUser method without creating a database connection?

public class Foo {
    DbConnection connection = null;
    public Foo() {
        connection = <establish db connection>;
    }

    public User getUser(int id) {
        ////Query db and get user data        
        User user = new User();
        //Fill user with data from db
        return user;
    }
}

To mould this into testable code, DbConnection creation needs to be decoupled from object creation, like below:

public class Foo {
    DbConnection dbConnection = null;
    public Foo(DbConnection dbConnection) {
        this.dbConnection = dbConnection;
    }

    public User getUser(int id) {
        //Query db and get user data
        User user = new User();
        //Fill user with data from db
        return user;
    }
}

Since the DbConnection is independent of object creation, DbConnection can be mocked to unit test any method in the class. An application written without testing in mind would be replete with code like the above. Code patterns like these are one of the biggest hurdles in testing legacy applications.

Next step is to eliminate the resistance to testing. This would mean all the infrastructure and libraries needed to carry out testing are set up and a reference is readily available to follow. Bunch test cases into categories like unit tests, tests that need a mocked object, tests that need a mocked database, tests that need a database seeded with data, tests that need a web server etc. Post this, implement one test case for each of these categories. This will serve a dual purpose, the setup would be ready for each category and a reference readily available to others to emulate.

One aspect that is usually neglected is the effect of testing on the product release cycle. As a result of testing, more code, dependencies, and infrastructure is introduced which needs to be maintained. Along with working on new features, writing tests for these also has to be taken into account. While refactoring, it is not just the code that has to be refactored, even the test cases have to be refactored. This is a tradeoff between time to market, and maintainability and reliability.

Testing is no longer a chore it used to be, testing tools and frameworks have grown by leaps and bounds. With the advent of docker, headless browsers, Selenium etc; testing is very much within reach of most of the teams provided the intention is there and effort is put in.

SQS versus Kinesis

A lot of people are confused between SQS and Kinesis. In some ways, both act like a queue, but there is a huge difference between the two.

SQS is a queue, adheres to FIFO and promises at least once delivery.

Kinesis is a distributed message stream handler. A simplistic and hand wavy way to think of Kinesis is like one huge log file, items that you write to the stream as lines in this log file. When you want to process the stream, you get a pointer to the log file. When you read a line from this log file, the pointer moves to the next line. Kinesis is stateless, as in, it does not maintain the pointer for you, it is upto your reading process to maintain this. What this means is that, say you are reading off a Kinesis stream and your process goes down, when you bring the reader process up again, it will start processing from the start, not from the last line before crash. There is no concept of popping items out of Kinesis, data is always there(expires after 7 days), you manipulate the pointer to this data. Hence, if you want to re process the stream, you can replay i.e you can start from the beginning and do the whole thing over and over again. AWS provides a client library for Kinesis which maintains the state for you. This client library uses dynamodb to persist the state.

This should give you a fair idea of when to use Kinesis and when to opt in for SQS.

Nothing is sacrosanct

There is an interesting bug opened against Kafka. For those of you too lazy to click on the link and read through the description, I am reproducing it here in full.

It appears that validation of configuration properties is performed in the ConsumerConfig and ProducerConfig constructors. This is generally bad practice as it couples object construction and validation. It also makes it difficult to mock these objects in unit tests.

Ideally validation of the configuration properties should be separated from object construction and initiated by those that rely/use these config objects.

http://misko.hevery.com/code-reviewers-guide/flaw-constructor-does-real-work/

It links to an article by Misko Hevery on writing testable code. If you have not read posts by Misko Hevery, I urge you to. For an open source project like Kafka, it might make sense(Jay Kreps, the person behind Kafka does not agree with the bug as visible in the comment) to follow all the guidelines of writing testable code but for your project it might not. If you are a small company with a two or three person team, do not blindly follow practices because someone on the internet says so.

Follow rules and guidelines only if it helps you to make your code more secure, performant, easier to maintain etc, do not ape guidelines without understanding why they are laid out in the first place. To go back in history, checked exceptions were all the rage in the Java land a couple of years ago, but these days, after frameworks like Spring sprang up, people look down upon checked exceptions. Same with TDD. TDD was expounded as the next best thing after sliced bread, but now programmers are raising their doubts about TDD.

A lot of times, mocking objects, interfaces etc takes more work than writing the actual functionality. In many projects, there might not be an ROI in writing/maintaining this elaborate test framework/infrastructure. It is true that injecting dependencies into an object makes it easier to test, but it also comes with the downside of having to take care of injecting dependencies every time you create that object. If you are injecting dependencies by hand, object creation becomes an elaborate exercise each time, else you have to delegate this to some framework like Spring, Guice etc, now your project is bloated. Maybe you should side step dependency injection and create the object with the dependencies inside it.

The situations and background under which these rules/guidelines crop up might be radically different than the one in your organisation/team. Taking the call of what to follow and what not to is more of an art than science. Your instincts, past experiences etc help you in formulating them.

Designing for failure

In the world of software, failure is a certainty. Servers go kaput, databases go down, processes go out of memory, things break all the time. You can categorize software as good or bad based on how they behave in these adverse scenarios. I am not trying to imply that software has to be resilient to all these, on the other hand, I believe that it is perfectly fine to crap out when shit hits the fan. The question is how do you handle this crapping out.

Whenever architecting components, devote ample amount of time to failure scenarios. Let us take the case of a piece of code which interacts with an external, third party API. What are the questions you should be asking when designing this component? What happens if the API suddenly stops responding one day? Can I hold my main thread hostage to the API response time? What happens if the API takes eons to respond? In case there is an exception, am I logging enough data to debug? If there are performance issues, do I have enough diagnostic data? Diagnostic data might be in terms of graphing the API response time, no of times the code path was executed, etc. Do I need to send out an alert when something goes wrong? All these question revolve around failure handling. These questions should be second nature to you as a software engineer.

I have seen a tendency among developers to devote inordinate amount of time in making their code adhere to the latest programming fad, trying to use the best possible library etc, but not to failure scenarios. Logging data might not be as sexy as debating which design pattern to use, but once things break, logs are your only friend. Next time when you are furiously pounding on the key board, take a step back and ask these questions. In the future, the developer who maintains the code that you wrote today, will thank you for doing this.

Mental model of systems

One fine Sunday evening our quartz jobs running inside tomcat server started to freeze.  At the same moment, tomcat went kaput. sshed into the server and started poking around the logs. No error in logs. Hmm, ok. Checked the system health, again stats looked hale and hearty. Now what do I do?

I started to reason around the problem based on my mental model of the the inner workings of the various components in our application. Jobs are freezing and at the same time tomcat stops serving out requests. Cannot be a tomcat issue as I am not getting a tomcat connection error, so it means that tomcat has enough threads to spare and it’s thread pool is not exhausted. The implication of this is new threads are being spawned but are not able to proceed. Hmm, ok, so the threads are getting blocked. Now how do I figure out where the threads are getting blocked. Can I hook something into the JVM and get the state of all the threads running inside the container. Oh yes, JMX gives me that ability. Restarted tomcat with JMX configured, hooked into JMX and found out that the threads are getting blocked while trying to fetch database connections from the connection pooler. Our servers were not under any undue traffic spike, hence cannot be a scaling problem. That means some query is running rogue. How do I get the query that is running rogue. I can do two things, check the process list in the MySQL server or check the stack trace of the threads currently running and identify the point at which it is stuck. Took the thread approach, zeroed in on the code where the query was getting executed, fixed the query and things were back to normal.

If you are a full stack developer, as the current state of things stand, it is virtually impossible for you to develop a deep understanding of each and every component in the eco system. Let us list down the things that you need to be on top if you are running a web application in java. You have to know Java, the J2EE servlet spec, the application server that you are using, your MVC framework of choice, the components the MVC framework uses(maybe spring for dependency injection or hibernate for persistence etc etc), sql and the peculiarities of the database, HTML, CSS JavaScript, JavaScript frameworks, CSS frameworks, the OS in which you are hosting your app, the administration of this OS, physical hosting, the list goes on and on. To make matters worse there are so many options available for each of these components. Also, these components have additional dependencies. Getting intimate with the intricacies and the idiosyncracies of each and every component of the application is a herculean task better suited for immortals. This is where having a generic mental model helps.

Coming back to the problem of the crons stalling, even though I was not an expert in the internals of tomcat nor jmx nor db connection pooling, I had a very good general understanding of how these systems work in general and I could fit the problem into it and reason through and fix it. Let us say that you are using an evented HTTP server and you see that the throughput is not up to mark. If you know how an evented server works in general, you can think of at least two loop holes in your system that might be causing this. Either you are doing blocking calls or running some CPU intensive tasks. This knowledge is transferable irrespective of whether you use netty or nodejs or tornado or some other new kid on the block.

I am not trying to discourage you from developing a deep understanding of particular frameworks or technologies, but to the contrary in addition to it try to abstract your understanding out of the specifics and into generals so that tomorrow when a new shiny technology shows up, you can reason around it based on your past experience.