Mental model of systems
One fine Sunday evening our quartz jobs running inside tomcat server started to freeze. At the same moment, tomcat went kaput. sshed into the server and started poking around the logs. No error in logs. Hmm, ok. Checked the system health, again stats looked hale and hearty. Now what do I do?
I started to reason around the problem based on my mental model of the the inner workings of the various components in our application. Jobs are freezing and at the same time tomcat stops serving out requests. Cannot be a tomcat issue as I am not getting a tomcat connection error, so it means that tomcat has enough threads to spare and it’s thread pool is not exhausted. The implication of this is new threads are being spawned but are not able to proceed. Hmm, ok, so the threads are getting blocked. Now how do I figure out where the threads are getting blocked. Can I hook something into the JVM and get the state of all the threads running inside the container. Oh yes, JMX gives me that ability. Restarted tomcat with JMX configured, hooked into JMX and found out that the threads are getting blocked while trying to fetch database connections from the connection pooler. Our servers were not under any undue traffic spike, hence cannot be a scaling problem. That means some query is running rogue. How do I get the query that is running rogue. I can do two things, check the process list in the MySQL server or check the stack trace of the threads currently running and identify the point at which it is stuck. Took the thread approach, zeroed in on the code where the query was getting executed, fixed the query and things were back to normal.
Coming back to the problem of the crons stalling, even though I was not an expert in the internals of tomcat nor jmx nor db connection pooling, I had a very good general understanding of how these systems work in general and I could fit the problem into it and reason through and fix it. Let us say that you are using an evented HTTP server and you see that the throughput is not up to mark. If you know how an evented server works in general, you can think of at least two loop holes in your system that might be causing this. Either you are doing blocking calls or running some CPU intensive tasks. This knowledge is transferable irrespective of whether you use netty or nodejs or tornado or some other new kid on the block.
I am not trying to discourage you from developing a deep understanding of particular frameworks or technologies, but to the contrary in addition to it try to abstract your understanding out of the specifics and into generals so that tomorrow when a new shiny technology shows up, you can reason around it based on your past experience.