This article introduces a new testing methodology for 500+ batch applications at eBay. These batch applications contain 10,000+ jobs running everyday to handle offline eBay business, for example, the monthly invoice jobs are responsible to record all transactions on a customer's primary account for subscriptions. In this piece, we explore the challenges of batch application testing when upgrading — and provide a solution to resolve the pain points of testing.
A batch application is different from the service system, because it has no web container inside an instance. For a batch pool, it will wait for the job command from an upstream job scheduler to run based on a given schedule. Once the job command is complete, the application process will quit, and return the exit code and exit state back to the scheduler.
Batch testing is always a challenge. Generally, a batch application is not idempotent since most are manipulating files, reading databases, etc. Those data sources can be updated frequently, and a batch job can’t get the same results even with the same input parameters.
The traditional way of testing new changes, such as a framework upgrade, for a batch application is to deploy the new code directly into production for an online verification. This can be a risky approach, and there should be better testing to cover the below use cases.
- Testing jobs that can only run once.
- Some batch applications do not have enough end-to-end test cases to support testing. Thus, there is no baseline for existing behavior that can validate behavior on the new platform.
- Lack of an automatic and generic way to identify the batch running result among the jobs.
A batch testing system is a brand new testing solution. It is designed as a centralized, yet independent system. It supports running the same job to the same instances based on different versions of codes and provides end-to-end comparison automatically.
The basic idea is the batch framework offers a baseline job by default. Since all eBay site applications must use the batch framework, the framework job provided within is able to run naively without any extra code of application. For instance, in the case a current application code is working well, the framework job can run twice based on different codes. The metrics before and after can be compared, if any exceptional cases happen.
The approach would resolve the pain point mentioned above.
- The framework job can be repeatedly run in production but has no site impact since there is no business logic inside it.
- The framework job offers a default end-to-end testing. It is incredibly helpful for testability, especially to those applications which have no tests but request quality assurance after code changes.
- The framework job can be called in any batch site application, meaning it is a common solution that is able to be fully automated.
The system provides three types of comparison to solve the challenges in testing.
Batch response contains two parts: one is exit code, the other is exit state. The exit code is the job return value back to the scheduler, which is the key signal for the workflow case. A workflow can be defined in a scheduler, where one batch pool is considered as a workflow node. The scheduler would dispatch the next job to different flow branches based on the job exit code running in the previous pool. The exit state is the value to mark if the job is successful or not.
Batch log indicates the application running logic. If any new exception comes or the error number of an existing exception increases sharply in the log, the new code of application most likely has problems.
Execution time is equal to application startup time plus framework job execution time. The same framework job should take the same time to execute. If the execution time is significantly increasing, the new code has the risk of performance.
Batch Testing System Workflow
The workflow of the batch testing system is shown below.
The detailed mirror steps are to:
- Trigger the scheduler to run the framework job on an old manifest;
- Trigger the scheduler to run the framework job on a new manifest; and
- Check the comparison result.
Inside the batch testing system, there are several components to work together. The controller is the key component as an orchestrator to process the following working sequence.
- Read the configuration file for comparison rules.
- Decide whether it needs to deploy the manifest. In most of the cases, the scenario happens in the same batch pool. It is not necessary to deploy the old manifest.
- Check whether the schedule to trigger the framework job is created in the scheduler system or not. If not, it will trigger the scheduler client to create the schedule.
- Trigger the framework job and collect the running data, such as exit code, status and logs.
- Trigger the deployment client to deploy the new manifest.
- Trigger the framework job and collect the running data again.
- Following the rules defined in the configuration file, compare the running data and output the result.
The configuration file is the key input to measure whether the test is green or not. The rule setting of each application depends on the application needs — for instance, if one application warms up some caches when starting up — the allowed execution time should accordingly be set longer.
The controller will read the configuration file which breaks down the testing rules from the git repository. Here is a sample file with an explanation to elaborate how to set up the rules.
3 timeout: 5
6 validatorType: batch-execution
8 returnCode: 0
9 durationMax: 3
10 state: SUCCESS
12 validatorType: batch-log-allowlist
15 - "~.*.*::2"
- “FrameworkSmokeTestJob” at line one is the framework job name to be run.
- “Timeout 5” at line three defines five minutes as the max batch execution time. If the batch job runs more than five minutes, the test will fail with a timeout exception.
- From lines five to 10, it defines the execution comparison rules. Line eight and line 10 means the return code should be zero, and the job should be a successful state, which are mandatorily required by the framework job. Line nine defines the rule that it does not allow execution time for more than three minutes.
- From lines 11 to 15, it defines the log comparison rules. A regex string in line 15 describes the rule: if any exception comes out more than two times, the job will be considered as failed. The reason for allowing two exceptions is to avoid noise exceptions. For example, the application throws a harmless exception two times during the startup. It should be ignored and defined in allowlist section.
From the example file, the user of the testing tool has flexibility to tuning the rules according to the application’s real running state. Moreover, if there are some other jobs of this application that are suitable for test jobs, they can be configured as well, meaning it supports multiple jobs in the configuration file. Even if there is no such job, the framework job is configured by default.
A batch testing system provides a successful testing solution for the batch system framework upgrade in eBay, which already benefits framework upgrade. It can overcome the unprecedented challenges of lacking end-to-end test cases for legacy batch applications and the “instance to instance” based comparison. It provides a non-online verification approach for quality assurance. Although the current scenario is for framework upgrade, it can also help verify the application code changes and OS upgrade cases as well. It provides a common measurement for a batch application.