Data Management Best Practices
Data is the new GOLD, and every organization is figuring out how to make the best out of this new GOLD rush. While many can recognize the potential of their data, most are struggling to figure out the best practices and get the appropriate value out of it. The typical data management pain points most organizations deal with includes
- Inability to extract consumable transactional data out of the legacy source systems
- Fragmented Read-only Data stores that don’t cater to all consumers of the transactional data
- Partially implemented on-prem Data Lake that feels more like a data swamp than data lake
- Partial migration of data to cloud resulting in inability to capitalize on modern cloud data management capabilities
- Inability to extract useful insights out of raw data that could be used to treat data as a product
To overcome the above-mentioned pain points, here’s a framework that can be used as the playbook for establishing and enforcing data management best practices
- Data as a Product: Many organizations are slowly realizing the importance of this concept. It needs to be deliberate and symbolic by assigning a Product owner by data domain who is responsible for identifying, establishing, governing, maintaining, and enhancing its data. Unless you are starting a new business, you most definitely are reactively trying to establish the data product strategy. It is crucial to not get stuck in trying to centralize the people, process, and technology aspects of this work. Instead, we should look to establish a Center of Excellence or Community of Practice that keeps us focused on establishing a product strategy as soon as possible and generating results and benefits out of the data product strategy. The most attractive part of this concept is the potential for generating revenue out of your data if curated and marketed appropriately.
- Transactional vs Analytics Data: Many organizations fail to separate data management strategy for transactional and analytics data. Both need a totally different way to organize, govern and support including the skill sets and technologies involved. While there could be synergies between the two, we need to explicitly separate them and manage it appropriately.
- Data in Cloud: It is strongly recommended to have all your data in Cloud, public or private. It has become a universal truth, the extraordinary speed-to-market that you gain by operating in Cloud instead of On-Prem. A fundamental way to enable this is to have the on-prem data sources to stream data real-time to Cloud using tools like Qlik, Debezium. This is the reality until you operate 100% in Cloud including your source systems that support your business.
- Robust Transactional Data Platform: Create and enable a robust transactional data platform, ideally in Cloud, that separates Raw Data from Integrated Data from For-Purpose data. These are important concepts in dealing with transactional data. Here’s a simple definition of them
- Raw Data - This is the data that gets dumped by the source systems into a common area that is not ready to be consumed by other applications but creates the separation of concerns between source systems and consuming applications that enables data consumption and query not impacting source system performance.
- Integrated Data – This is the data that is the outcome of some commonly needed data transformation and application of business rules which are needed to properly consume the transaction data. This also enables the consuming applications to not have the burden of being data domain Subject Matter Experts (SME).
- For-Purpose Data – This is the data that has the consuming application specific rules applied on the integrated data. There will be enable the consuming applications to have the flexibility to transform data as appropriate and not be dependent on others. This will also enable citizen data management culture enabling faster speed to market delivery of capabilities for our customers.
- Robust Analytics Data Platform: Create a comprehensive analytics platform that has the following components
- Data Lakehouse – This is an enhanced and matured version of traditional data warehouse and data lake that became very popular in the past decade with the explosion of non-structured data. Popular technologies that enable this new concept include Snowflake, Databricks and Palantir.
- Analytics Modeling Platform – This should be considered as the secret sauce of any organization that considers data as a product and plans to curate and sell the insights it can generate out of its data. This is where we would apply the appropriate AI/ML algorithms and generate insights that would be used across the whole organization for different strategies catering to all stakeholders that the company serves.
- Reporting - This is the collection of various self-service tools that the consumers would use within the organization. Ideally the tools would enable self-service capabilities without dependency on data specialists to consume and infer the analytical data.