Building an Effective Data Architecture:
Part 3 – Centralizing Your Data and Self-Service Enablement
Icon Analytics Co-Founder
A Centralized Repository is a collection of an organization’s data that has been thoroughly examined and deemed accurate for use across an enterprise. In Part 2 of this series, I discussed the importance of Data Governance and how to maintain accurate and high-quality data consistently across your organization. This article is focused on how to build and manage the storage area for the governed, high-quality data, and most importantly, how to access the data held within the Central Repository.
When thinking about data storage in an architecture’s Central Repository, Icon Analytics generally recommends a two-pronged approach:
A Data Lake located at the forefront of the architecture
A Data Warehouse to store and organize governed data
A Data Lake is the raw data from source systems consolidating into a single repository. It is the first step to getting an enterprise’s data into the architecture. The Data Lake is not meant to be well organized when considering data objects or structures; the idea of the Lake is to establish a point in the architecture immediately after a data acquisition tool for your raw data to land. The importance of a Data Lake is to make raw data easily accessible to your Data Engineers in an environment they already have access to. Finally, end-users and Business Units will not have access to the Data Lake. At this point in the architecture, the data is too unorganized to perform any proper data analysis.
The Data Warehouse is the place where transformed and organized data is stored. Since the entire organization’s reporting and analysis will come out of the Data Warehouse, all governed data sets, transformations, and data linkages must occur in this location. This part is the single source of data truth across the entire organization. The Data Warehouse is repository where all the data that feeds any end point tool is stored. This can include data visualization tools for building reports, or be the source for data science and machine learning tools. In addition, it is also the first point of contact for data analysts and data consumers to source from.
Upon initial build of a new Centralized Repository, the Data Engineers will be busy ingesting available data sets into the Data Lake and Data Warehouse. However, do not fall into the bad habit of expecting Data Engineers to handle the creation and maintenance of Data Governance and Data Quality rules. Data Governors and Data Stewards must remain highly involved in this process to ensure that enterprise standards and data quality are maintained throughout the build. Please visit Part 2 of this series to learn more about the role of Data Governance in a data architecture.
As data becomes available in the Data Warehouse, organizations will begin to on-board critical business units that want to utilize the data. This is where the User Access Roles and Service Accounts discussed in Part 2 of this series start to take shape. Self-Service reporting and analytics begin here. Data-Marts are a subsection of the Warehouse that are specifically catered for an individual group to pull data from. These highly specialized sections of the data are presented for specific business units who can alter data as needed without negatively impacting the Centralized Repository. Once the Data Warehouse is built, an architecture’s admins and engineers establish these Data-Marts for each business unit to use in order to perform specific transformations or joins not yet approved for enterprise-wide use.
Regarding access to the data, Business End Users will have read access to all governed objects in the Data Warehouse which is approved for enterprise use. Additionally, they will have read and write access to their own specialized Data-Mart. By giving front-end Business Users the ability to build in their own environments which are sourced from the Centralized Repository, we enable several things:
A set point in data flows where the organization knows data is accurate within the Data Warehouse.
A location within the safety and security of the architecture that enables developers and analysts the ability to explore data sets through Business Unit specific Data-Marts.
Data Engineers can focus on architecture improvements and support rather than building business focused ETL flows for reporting. This is because the architecture now enables the Business Units to develop their own ETL flows within their specialized Data-Marts.
By using the combination of User Roles and Service Accounts that are discussed in Part 2 of this series, in conjunction with the Data Warehouse and Data-Marts, we eliminate data-silos where each Business Unit may have their own specific transformations and data flows that don’t align with other Business Units within the same organization.
When a Centralized Repository is not in place, Data Engineers often are bogged down with requests for specialized workflows and ETL across all Business Units. Once Business Units become self-sufficient at accessing and analyzing the data stored within the Data Warehouse, the Data Engineers can shift focus from being data ingestion developers, to data architecture admins. This can include implementing more robust auxiliary functions to help improve the existing architecture. Examples of these functions are an Audit Balance and Control process to better control data assets or establishing a CI/CD process to reduce the roadblocks of development and data object integration. Developing and adapting coding standards is another aspect where Data Engineers can be utilized now that they are free from building ETL flows for business specific use cases. These coding standards can be built to optimize and quickly ingest any new pipelines for future implementations of new data acquisition tools. Coding standards can also be used to help Business Users build complex queries they would normally not be able to create themselves. These standards allow new feeds into the Central Repository to be built quickly and reliably, and reduce the barriers to accessing the data.
A Central Repository is also great at reducing and eliminating technology redundancy across an organization. By using a single database tool, and development language for one large Data Warehouse, the need for individual Business Unit’s to maintain their own data-silos and data tools is eliminated. Factor in the possibility that each Business Unit may use their own language or ETL tools to process their data in a siloed environment, you can see how quickly your organization can save money on tool licensing and support costs.
It is easy to recognize the number of hours required to stand up an environment and ingest new data sets is much greater than maintaining an existing environment. Because of this, many of Icon Analytics’ clients ask us to help set up the architecture and initial pipeline builds in conjunction with their engineers, then scale us back once the architecture is running smoothly and can be maintained by employees alone. We are extremely effective in this regard and aim to include client engineers during the entire process so that the handoff at project end is as seamless as possible.
If your organization is plagued by data-silos, data access issues, “spaghetti code,” or over-worked Data Engineers, get in contact with Icon Analytics at www.iconanalytics.io/get-started and we can see if a Centralized Repository will help fix these problems. Stay tuned in for Part 4 of this series – Enabling Self-Service and Power Users where I will take a deeper look at the advantages of giving your most knowledgeable Business Users the ability to explore your company’s data on their own.