Databricks Delta Lake is the latest evolution in data lakehouse modernization. Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. It accelerates data engineering by managing task orchestration, cluster management, monitoring, data quality, and error handling. As a private preview user before it was made available to the general public, Valorem Reply’s Data & AI team has developed deep expertise in data modernization through Delta live tables features and technologies.
A key point of concern for our clients is often the reliability and quality of their data. After all, if the end-user doesn’t trust the data, all your modernization investments will result in lackluster adoption and returns. The challenge for those engineers ingesting large volumes of batch or streaming data is the time-consuming process of determining the success or failure of their data over time. Without trend-over-time visuals, it can be a painstaking process to observe the movement of data over time through their system.
To solve this, our data experts have developed the DLT Data Quality Dashboard. This unique IP allows users with little to no DLT expertise to harvest metadata from DLT Expectations into reporting objects. A simple color-coded performance visualization through a Databricks Dashboard or Power BI Dashboard shows at a glance if data quality or ingestion issues exist, trends over time, and links to jobs for deeper inspection.
Using our near real-time Data Quality Power BI dashboard gives users operational support, pipeline insight, and actionability for monitoring, triaging, and uncovering data quality or ingestion issues. This has allowed our customers to see source data quality issues moving through the pipelines, often before they hit the descriptive or predictive layers. It also allows the Data Scientists to self-serve to confirm that source data quality is appropriate to train models. Data Engineering Operations Support tasks can be delegated to more junior resources, allowing senior data engineers to focus time working on more complex optimization opportunities.
Installing Valorem Reply’s DLT Data Quality Dashboard is very simple, it has two steps:
1. Create the data structures that are being used by the reporting solution. These structures will hold the metadata from the pipeline runs. The demo below shows the script that will generate those data structures.
2. Next step is to hydrate metadata into data structures that you created in the first step. The demo below shows the script that can be scheduled based on the business requirements. It pulls the data from the JSON log files that are generated when a DLT pipeline run happens.
Once the data is populated in these tables, it can be used in any reporting tool. Here is a sample of a Power BI data model that we created using this data structure.
Users can create customized reporting dashboards based on the curated log analytics data structures that we generate from the raw log files. Here is a sample of a high-level dashboard in Power BI and a more granular view of the pipeline statistics in this section. If desired, push alert notifications can be set on thresholds for a wide range of data quality situations.
If you’d like to learn more about our automated data quality monitoring dashboard or how we can help you accelerate and optimize your data platform modernization through Databricks Delta Lake, reach out to us at marketing.valorem@reply.com.