Advancements in big data technologies have enabled the processing and storage of massive amounts of data, with data lakes becoming an increasingly popular way to expose this data to users quickly. These provide increased agility and flexibility over traditional data warehouses.
The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is centrally collected and loaded onto Hadoop, and then business analytics and data mining tools are applied to the data where it resides.
But the question remains: Is a “data lake” approach truly enabling faster data driven decisions? While it may be a more flexible way to consume data, does this hold true for insights and answers?
Is data lake criticism deserved?
Similar to big data, the term data lake is in many cases criticized as being a marketing label for Hadoop. It has been accepted as a way to describe any large data pool where the schema and data requirements are not defined until the data is queried.
In reality, data lakes generate more questions for data users than answers. Data users cannot use data correctly without knowing what exists, how it can be used, what can be trusted, what it means, and how it was generated. Working in a data lake can be a daunting experience, absent any clear way to ask for help or even to find those who may have the answers.
Even more, we all know that dirty data is not really dirty; it is just incorrect. Data cleansing consists of correcting mistakes in the data. But with an unmanaged data lake, the system will continue to get the garbage data in without cleaning, and keep them forever with no retention process and even worse, data quality issues are ignored at this stage.
Approach and results
eBay has a culture of independence and innovation which needs an open approach that puts control into the hands of the data users and supports exploration and innovation. Towards this goal, we turned the concept of data governance on its head. Rather than focusing on control and limiting access, our data governance initiative is focused on gathering exhaustive information about each data element and making this available to users within their normal workflows in a programmatic way. In this way, data users are not working in a silo; they can make informed decisions about whether a given data element or object is the right fit to answer their business question.
One key way to understand your customers is to understand their behavior. At eBay, we track the full usage of our data assets, including how many times they are accessed, by whom, and how the asset was accessed (for example via report vs. manual querying), all the way down to the actual queries executed. This information is then used to for simple things like data retention (retiring unused assets), setting operational goals on levels of support, and even optimizing our big data fabric.
Data governance is managed as a process and product, not a project. Providing value without limiting how we keep the data lake clean. We are able to adapt to changes in data and data needs and keep data up to date, which may mean deprecating or removing as well as adding.
Policies and processes
We implement policies and processes to manage the data lifecycle includes Release -> Monitoring -> Re-Certification -> Rationalization/Optimization.
The Process assets are critical to ensure the right behavior conducted during the data development phase in the eBay holistic Big Data environment. In the meanwhile, internal audit and external (SoX) audit practices are the benchmark for process success.
We also ensure that golden data assets are provided in a right-on-time and curated manner to support eBay business users and analysts for data analysis and business decision making.
As the big data fabric governance team, we are always proactively working for Data Rationalization policies and other exercises to maintain the DW environment well-being. Data rationalization, as a key method, is always taken into consideration when bringing new data assets into the data lake.
As a result, we gain agility and change management at scale.
We create tools to help us discover and understand data, manage inventory, analyze usage, and optimize the usage and capacity.
As a result, we gain agility, change management at scale, and the ability to answer the questions such as What, How, Where, and When.
We create a quality platform to automatically validate and verify the data.
Cross-platform data reconciliation
Monitor data consistency across different data platforms
Send out notification automatically when data is not in sync
Open web service allows data downstream to plug in quality check in downstream processing
Rule-based data quality check
Monitor data accuracy through data flow
Support both default and free format type of quality check rules
Send out notification automatically when rule is triggered
Open web service allow data downstream to plug in quality check in downstream processing
As a result, users are able to answer the question for themselves “Can I trust the data?”
Usage analysis provides a 360-degree view on how the data assets are used by the customers.
Self-Service tool to analyze data usage by drag and drop, sourcing from Teradata QueryLog
Understand who used what data and how frequent
Understand how data products are used among different group of users
Understand data usage distribution among different platforms, locations, etc
Understand who are active users of data so we can contact in case of data changes or data incidents
As a result, eBay is enabled to use data to power data products.
We created shared scorecards across the organization to create alignment and ensure the adoption of our processes internally and of our data products externally (to leadership).
As a result, with Adoption comes subject and user expertise. That’s where the knowledge and collaboration come together to enhance user productivity.
At the end of the day, It is all about metadata.
Thanks to our talented engineering team, we have turned the concepts into an innovative product (DOE—Data Operational Excellence) to manage and integrate all metadata together, which can then be used to answer any questions for any specific production data assets.
We were able to successfully deploy processes and policies with full coverage of EDW in both Teradata and Hadoop, with close to 100% metadata and lineage coverage. All critical data assets have a 24/7 monitoring process.
This foundational work supports keeping the data lake "beautiful," and the data users are able to get the full value of using the data.
Our self-service data discovery platform reached a tipping point of adoption and continues to grow, with plans of expanding the data catalog to include real-time data, services, and data APIs, as well as enhanced discovery enabled by AI powered auto data tagging.