At eBay, product teams can choose from multiple stacks (Java-spring/spring-boot, Nodejs, Python, etc.) to implement eBay’s 3000+ front-end UI, microservices, batch and messaging applications. The application platform team is a central team that operationalizes open source projects for use at eBay by integrating horizontal capabilities such as monitoring and observability, logging, tracing, security protections, authn and authz, and more. In addition, the application platform team also provides the underlying containers/environment on which the applications run. A bad release by the application platform team can break many applications when it upgrades to the latest version of the platform. Therefore, platform release certification is of utmost importance to the application platform team. This article describes how we have used Kubernetes operators, Helm Charts and Jenkins Pipelines, to achieve full automation of software quality certification and automated test result comparison.
The types of changesets that need to be certified frequently:
- Framework releases:
- Sitewide upgrade releases
- Micro version patches
- Cross-team contributions
- Runtime updates:
- JDK, Tomcat, Envoy and Node.js runtime patches
- OS updates:
- Framework certifications with new container images
- OS certification: Kernel + OS
- eBay application container certification
We have architected the certification solution to be efficient and avoid any manual testing and analysis of results. Additionally, we also developed the solution to be self-serviceable and capable of performing any type of certification.
The certification solution offers a standard automation template and can orchestrate and handle multiple complex certification requests and combinations. Well-defined certification suites are offered for different types of certification requests.
A certification unit is a standalone unit which does one part of certification, including:
- Simple: Only the test case
- Test case + Test app + pipelines
- Performance test
- Traffic Mirror with response comparison
Certification Helm Charts
Each certification category (Image Certification, Framework Release Certification, Kernel Upgrade Certification, etc.) has its own Helm Chart, which contains the templates of the certification unit and certification instances. Triggering the certification involves installing the corresponding Helm Chart with user-provided parameters into a Kubernetes cluster.
Certification and chart CRD instances are created based on the Helm Chart and user input parameters. Certification Instance defines the Jenkins pipeline git repository and pipeline parameters, while Chart CRD instance defines common parameters, groups of certification units and their dependency relationships.
Chart CRD Controller and Certification Controller
Controllers are the main orchestrators of the overall automation solution. The certification chart controller is responsible for managing all certification unit instances such as dependency enforcement, pausing/resuming/aborting certification, aggregation of status and result of each unit instance. The certification controller drives the whole lifecycle of a pipeline job run through certification service. It also remediates each failed job through back-off retry in order to improve the probability of success.
Certification Service is the coordination service between certification controller and the backend Jenkins service. It provides RESTful APIs to perform and query the action for controllers and delegates the request action to the backend Jenkins server for execution. All the Jenkins pipelines are created and deleted on demand. Certification service also saves the job logs before deleting the pipeline.
Modularized Jenkins Pipeline Scripts
The pipeline flow varies with each individual certification unit type. We have built modularized pipeline scripts as standalone steps so that different pipeline flows can be constructed by reusing them.
We have also developed pipeline modules to perform response comparison by forwarding a percentage of live traffic to the target host with n+1 code, load and performance runs (using JMeter and Taurus), etc., and automatically compare the metrics (Transaction time, TPS, GC, Memory, etc.)
Developing a unified scalable certification experience is a huge advancement toward our goal to achieve full automation of software quality certification and automated test comparison. We have built different functional modules (e.g. framework upgrade, code deployment, result analysis etc.) using reusable Jenkins pipeline libraries, which are used for instantiating certification units. A certification unit can be easily included in all certification charts just by adjusting parameters and failure thresholds.
The self-service certification portal enables team members to trigger certification jobs without worrying about complex configurations. Notification mechanisms with configurable failure thresholds are in place for prompt troubleshooting. Our certification solution has also helped us reduce the complete certification time from days to hours. By leveraging the Kubernetes controller reconciling mechanism, we have added resiliency to network issues and unreliable dependencies, allowing less involvement from human team members.
Learnings and Suggestions
Design your technology with scalability in mind. Sometimes capacity and underlying infrastructure must be upgraded to meet your project’s needs — such as an improved provisioning mechanism, node selector, lightweight jenkins builders, failover clusters, etc.
Be bold and try new things. Sometimes open source software is already solving a subset of problems and can serve as a helpful foundation, rather than starting from scratch. Finally, it is important to innovate constantly — with long-term benefits in mind — to create a flywheel effect that scaffolds all of your enhancements into a cohesive, streamlined software system.
Improving developer velocity and empowering our engineering team with world-class tools is very important to deliver great experiences to our customers. We innovate constantly by evaluating and incorporating industry standards and best practices in our SDLC and are working on future enhancements for software quality certification:
Cloud Native CD Pipeline
We have plans to leverage Tekton, a powerful and flexible open source framework for creating CI/CD systems, for pipelines and to integrate it with CRD controllers.
AI-Powered Failure Analysis Service
We are working on the classification of failures, root-cause analysis and predicting future failures using machine learning.
Interested in pursuing a career at eBay? We are hiring! Please take a look at our current job openings.
This article was initially published on The New Stack on March 22, 2021.