Building an Effective Data Architecture:
Part 2 – The Importance of Data Governance
Icon Analytics Co-Founder
The purpose of Data Governance is to ensure your data is consistent, not only within a business unit but also across the entire organization. Well-implemented Data Governance guarantees that inputs to your data are reliable and accurate, metrics are agreed upon and used across business units and transformations are being applied repeatably. The key to achieving this consistency is proper identification of data stewards within each business unit utilizing the data architecture. These should be people who not only understand the data on a technical level but also have a thorough awareness of the business purpose of the organization and the Role the data plays. In other words, the data steward is the “go to person” as the source of truth for your organization’s data. At its very core, Data Governance revolves around the data stewards and the standardized procedures that get used across the data architecture.
When building out any data architecture, the first piece that we like to establish is a strong Data Governance layer. The earlier this layer can be established, the better off the organization is. This is because Data Governance is involved in every facet of a data architecture. By creating this layer earlier in the architecture build, it will be easier to manage and govern your data architecture as it grows. In Part 1 of this series, I mentioned how a Data Governance team will help resolve the issues of Data Silos and Uncontrollable Business Logic, and how to keep these issues at bay while scaling and growing. This article will take a deeper dive into a data architecture’s Data Governance layer, the foundational pieces required to maintain this layer, and how it can keep Data Silos and Uncontrollable Business Logic from forming.
Most of this article will be from the point of view of a single business unit within your organization. Please refer to Diagram 1 to help visualize an ideal data architecture and how a single business unit fits into the larger picture.
The first step to Governing your organization’s data is to select a group of data stewards. Their job is to vet any existing data and deem it reliable or mark it for exclusion. It is important this group of people understand your organization’s business operations and how the data represents those operations. This step is often easiest to perform as business units roll onto a clean data architecture. The stewards will then act as a sort of gate keeper for any new data entering the architecture.
Once the data residing within an architecture is deemed reliable and accurate, the duties of the data stewards expand to help regulate and secure the reliable data held within different deployment environments; in most cases these environments are called development, test, and production. The best way to perform this regulation is with a combination of Role-based access and Service Accounts.
Within each deployment environment and across each business unit, the following Roles will be required at a minimum:
User Access Role – Every employee requiring access to data will be added to this Role. This Role grants read/select access to all the tables in the Centralized Data Warehouse. It also grants the user read/select access to the data objects owned by the business unit that employee belongs to.
Team-Based Admin Role – This Role will have more responsibility and capabilities compared to the User Access Role. This administration Role will have write and execute privileges on the business unit’s data elements only, not to the Centralized Data Warehouse. This Role is generally granted to the business unit’s power users, or people who directly develop code against the data elements owned by that business unit only.
Super Admin Role – This Role is only granted to the people who will be responsible for maintaining the Centralized Data Warehouse. These employees will have full control of all objects in the architecture including any objects owned by any business unit. These users are generally your organization’s data engineers and data architects.
These are the minimum requirements for effective Role-Based access to a data architecture. Of course, this is not a one size fits all and your organization may have the need to split write access from execute access. The primary goal with these Roles is to NEVER allow any users direct access to any of the architecture’s data elements. A user must be granted to one or more of the above-mentioned Roles to access the architecture’s data elements. Additionally, these Roles should never point across development environments (i.e., a test environment User Access Role should never have read access to a production table; it will only have read access to test environment objects).
Below is an example for Role-based access specific to a Finance business unit. Let’s assume this business unit belongs to an organization with a development, test, and production environment:
fin_[dev/tst/prd]_user: A User Access Role for each environment. All Finance team members will need to be added to all three Roles. This allows users to read/select from all three environments’ versions of the Centralized Data Warehouse, and the Finance team’s specific Data Mart - a mini data warehouse downstream from the centralized repository, exclusively used by the business unit that owns it.
fin_[dev/tst/prd]_adm: Team-Based Admin Role for the Finance team’s power users - engineers, or developers building code and tailored specifically to the Finance team. These individuals will need write and execute access exclusively to the objects within the Finance team’s Data Mart, and across all three development environments.
Notice how there is not a Finance Super Admin Role. From the point of view of a business unit, there aren’t any users in that team that require write or execute access to the central repository. This Super Admin Role is restricted to the data architects, DBAs, data engineers, and/or data governors depending on your organization’s needs.
Service Accounts are the next piece to the puzzle as they help separate the general day to day work of your employees from regularly scheduled automation. The number of Service Accounts needed will vary based on the number of tools your organization uses, and the number of business units that will be accessing the data architecture. There will need to be one Service Account per business unit and tool in each development environment. For example, if your Finance team uses an ETL tool to pull data from the Centralized Data Warehouse and uses a data visualization tool to analyze that data; the finance team will need two Service Accounts per development environment. Looking at our sample organization which had a development, test, and production environment, this Finance team requires 6 total Service Accounts:
Service Accounts should be thought of as users that access the architecture just like your employees would. Therefore, they also need to be added to the Roles established above. Referring to the prior example, the finance team’s ETL Service Accounts will need to be added to the User Access Role, and the Team-Based Admin Role. This ensures the ETL tool can read from the Centralized Data Warehouse and write any transformed data to the Finance team’s Data Mart. The visualization Service Account, however, will likely only need to be added to the User Access Role allowing it to read from the Finance team’s Data Mart and Centralized Data Warehouse.
With this combination of Roles and Service Accounts you guarantee a few things that help the Data Governance team perform their jobs of maintaining reliable data, and regulating front-end Business Logic:
User Accounts nor Service Accounts can manipulate and ruin the Centralized Data Warehouse. This is the single source of truth across your organization. It holds all valid business logic, meta-data, master data, and organization-standard data transformations and models. This is your money maker, don’t give people direct access to it. Only Role based access is allowed since it can easily be maintained by the governance team, where established rules and regulations trickle down to all users assigned to that Role. Only a small, qualified group of individuals can be added to the Super Admin Role and will have the ability to write to and execute on the Centralized Data Warehouse.
Development and test data will never cross environments and tarnish production quality data. This is true because the Service Accounts that will be running the automated processes in production do not have access to the lower environment tables. This helps maintain data quality and accuracy which includes the business logic of the Centralized Data Warehouse. For example, when a production job fails because a production Service Account cannot find a certain table that was being used in the lower environments, this failure prompts the Data Governance team to act. They can now perform the work necessary to validate the newly developed table in the lower environments to ensure that it meets the organization’s standards for data reliability and accuracy before it is also deployed to production.
Role-Based Access prevents uncontrollable business logic from forming. By restricting users to Roles that only have access to their business unit’s Data Mart and the Centralized Data Warehouse, we prevent them from building additional transformations on top of another business unit’s non-governed Data Mart. If one business unit requires transformations performed by another business unit, then the business logic and transformations must be vetted by the governance team and added to the Centralized Data Warehouse where all business units have read access. This additionally eliminates downstream silos of data from being created. By restricting a business unit to their Data Mart, we eliminate compounding business logic and the formation of data silos that are ultra-specific to one analytical set. As a bonus, it helps mitigate some Spaghetti Code discussed in Part 1 of this series from arising between two or more unregulated sets of data elements.
With these practices in place, you allow your business units to continuously develop and explore your organization’s data without compromising the quality and accuracy of the company's Centralized Data Warehouse. A business unit is allowed to build, scale, and report all on their own without hindrance and within their own Data Mart. However, as soon as a report or data set is needed outside of a business unit, the Data Governance team is there to ensure those business unit specific transformations tell a story that matches the rest of the organization. The benefits highlighted in this paragraph start to shed some light on the themes of a successful Self-Service Layer, which a future article in this series will analyze in more detail.
If you enjoyed the contents of this article, or want to talk with Icon Analytics leadership about how to improve your organization’s Data Governance, you may get in touch with us at www.iconanalytics.io/get-started. There is still much more to cover when it comes to optimizing a modern data architecture so be on the lookout for Part 3 of the series - Centralizing your data and self-service enablement.