Ensuring data is ingested reliably and securely is a big challenge with governance. Often, we see clients struggle with data governance. The old ways of doing data governance no longer work, and it often lacks a chance to succeed. Some big names, such as Gartner, have even declared existing data governance dead.
Data governance is frequently done through manual, static, and uniquely designed processes. Data governance is often tied to ITIL-based change management systems, with database administrators driving most of the hands-on work.
The size of data sets continues to increase, and the world of technology has been moving to cloud-based technology for several years. It’s nearly impossible to maintain accurate control without automation nowadays.
With big data, data needs to be ingested a lot faster than it used to. Data privacy and security needs are adding to the complexity. A manual approach is no longer adequate as complexity increases to ways manageable by manual or ad-hoc techniques through GUI-based tools.
At Infostrux, we believe the path to improving data governance is through automation and tools that support dynamic data governance.
Data governance is a set of principles and practices that keep data and data usage reliable and trustworthy.
It is a way to ensure the data is also not misused. Data governance usually involves a steering committee in charge of overall executive sponsorship, holding everyone accountable, and being a champion for the data at the enterprise, a governance team that sets high-level priorities and general standards of operation, and a data team that champions the best practices and frameworks to empower across the board quick adoptions while maintaining governance. This team typically creates frameworks and empowers other teams to use the data.
Reasons why we need to govern data:
Data governance controls the following aspects of your data:
Organizations face many challenges in gaining reliable data sets. As we determined above, data programs have a high rate of failure.
Here are many of the challenges:
Cloud-based data architecture allows organizations to focus on data intelligence, not system management.
Let us look at existing automation and how it can help using DataOps principles. Namely, using an ‘everything as code’ approach allows testing of the data pipelines and models.
In general, governance becomes stronger with everything as code as you design and enforce policy-based automation at the code base and CI/CD base. Much like DevSecOps, DataOps will evolve into a policy-based mechanism.
A data catalog helps users find and manage data sets across many systems. With everything as code, you would update the data catalog automatically with every new model, dimension, and fact so the data catalog would never miss any information.
Automated documentation as part of the data pipelines automatically profiles and verifies the schemas, generate documentation, and saves the results into whatever system you want. The automated documentation can be tracked as part of the Pull Requests in your Git system or into a documentation system that you support so there is a change management tracking or through many SaaS tools available in the market.
Like the catalog, data lineage, and a glossary helps users better find and understand what is in their data. Automation removes human mistakes and traces data across the entire data pipeline through logging and a tagging system that flows to your data catalog.
For example, suppose we are using Fivetran to feed data from Salesforce to Snowflake. Every run, we have a data glossary and a data lineage table that updates as the data pipelines run identifies where the specific dimensions come from and what system put that there. Then you can keep track of that lineage by understanding that an insert for Salesforce was made on a date and time, what dimensions were updated, and what tables.
You can find a pattern here of implementation for this. There are many other options and tools to create this; I like Atlan or Alation SaaS tools.
Data quality can be automated in different ways. Automated quality means fixing issues in the data through automation.
For example, suppose you have multiple data sources that use a dimension for the country; some are CA, Canada, CAN, or CANADA. Depending on your data storage system, through the ETL or ELT, you could have a function to fix all those, making the country code from multiple sources consistent.
Here is an example of checking data quality.
Security is probably easier to explain regarding automation as many of us are already used to doing automated security in the DevOps world. The benefit of automating your security is the fact that there are change controls that can be verified through practices, pull requests with peer reviews, and approvals. So fewer mistakes are made.
Other benefits of automating include de-identification of the data with policy-based de-identification linked to RBAC roles. Systems like Snowflake allows row and column-based security tied to RBAC roles.
Data engineering (aka DataOps), like the DevOps movement, is an excellent answer to data governance in the era of large data sets as it helps manage complexity.
Many of the pains of data governance revolve around things fixed through automation and fostering collaborative environments, shared responsibility, and continuous improvement.
Through those, governance becomes a shared responsibility model with everyone involved and working towards it.