Tips and tricks to write resilient batch jobs.
You have a database table. At the end of the day, you want to copy all the records for the day from this table to another table. Let us assume that you use the second table for generating reports.
You want to do this quick and dirty. So, you write a batch job to do this. You script your job. You schedule(probably using a CRON) the job to run a couple of minutes after midnight—say at 12:05 AM every day. You do this so that the job can copy all the records of the previous day.
The job needs a start and an end timestamp to copy the records from the source table to the destination table. The job can pick up the timestamp in two ways.
The batch jobs has to copy the records of the previous day. Hence, it selects the timestamps as the start of the previous day and the end of the previous day.
The batch job checks the timestamp of the last record in the destination table. It chooses this timestamp as the start and the current time as the end timestamp.
Why is the second approach better than the first?
The second approach is auto-healing in the face of failures.
In the second approach, if the script fails to run one day, the next day when it runs, it will automatically copy all the missing records. In the first approach, you will have to intervene and copy the missing records manually.
All or nothing
What happens if the job dies mid-way while executing?
Your destination table will have half-baked data. You can prevent this by using database transactions in your script to either copy all the records or none. This keeps your destination table consistent.
Make it idempotent
Make the copy operation idempotent by deduping the records in the destination table and deleting the duplicates. If the script is idempotent, you can blindly re-try in case of failures without worrying about duplicate records in the destination table.
The above scenario is an example. This pattern keeps popping up when you write batch jobs; adapt it to your context.
The two modes of failure with batch jobs are:
- The job misses a schedule.
- The job dies mid-way while running.
If you follow the above principles, your jobs will become resilient in the face of failures.