Authored By: Kenneth Chisholm
As we have all seen in the last few months there have been tremendous breakthroughs and awe-inspiring demos in the field of Generative AI. ChatGPT, Google Gemini, Perplexity, Github Copilot have demonstrated Generative AI solutions for answering questions, content creation, holding conversations, sentiment analysis, text classification, entity recognition, code generation and more. The results, while not always perfect, are enough to start solutioning for a head start in the AI race. The question is how can Data Governance connect and align with Generative AI? If we look past the hype and properly account for Generative AI’s strengths and current weaknesses, we can see there are strategic points of alignment.
Point 1 – Data Consumption: Generative AI’s are fed information in 2 phases. First phase is when an AI is performing unsupervised learning on a vast amount of textual information during the creation of the foundation model. This requires an intense amount of computing resources (weeks and months), and thus this is not expected to be done by most enterprises. It is more likely that enterprises will procure open-source models (such as Llama, Falcon or Mixtral) that fit their business requirements, and perform supervised learning (also referred to as fine tuning a model). Fine tuning a model requires fewer resources and will be performed using proprietary enterprise data. The fine tuning can take place on private clouds (H20 AI, Amazon, Azure, etc…) to maintain privacy and security. This fine tuning is typically handled by Machine Learning Ops (ML Ops) and LLM Ops groups (we will look further at these in the next point).
Fine tuning requires extremely high-quality data to be successful, to reduce the chances of hallucination occurring when users are conversing with your enterprise’s GPT. As professionals in Data Governance and Data Quality know, this is easier said than done. We have all heard the analogy that data is the new oil, well now the oil needs to be refined and used in an engine. A very well-maintained data catalog is essential to locating the right high-quality data. The right high-quality data has to be measured according to many different dimensions including, is data correct, accurate, fresh, biased, legally usable in this jurisdiction, contains PII or SPI, is data owned or rented, privacy constraints, and more. This type of metadata has to be housed, tracked and maintained properly in a Data Catalog. Furthermore, the data used for fine tuning models should be tracked as part of a Model Governance initiative, which again should be done in a well- maintained Data Catalog.
Point 2 – Process Integration: Aligning Data Governance with Machine Learning Ops (ML Ops) and LLM Ops. ML Ops groups are typically the professionals in an enterprise that handle fine tuning and model deployment. ML Ops requires heavy use of data to perform fine tuning, which itself is an iterative process, requiring a constant flow of new high quality data. Data Governance can align with ML Ops to provide their data needs. Additionally, ML Ops and Data Governance can work together to find and choose the right data for fine tuning. In LLM models, large amounts of data can confuse the LLM Model and lead to more hallucination. Typically, you want multiple leaner models to handle specific business use cases, instead of building one giant model to handle everything. As such, you want to be very particular in the data used in fine tuning, and all these details are best maintained and tracked in a well-run Data Catalog.
Point 3 – Compliance and regulations: Both Privacy and AI regulations already exist in many jurisdictions and are getting more complex as complicated legal situations begin to unfold. Data Governance can help fulfill compliance obligations by working closely with legal and privacy professionals and meticulously tracking compliance dimensions in a Data Catalog. A well-maintained Data Catalog will save many enterprises from legal headaches down the road. An enterprise must establish policies, standards, frameworks and processes as soon as possible before major AI initiatives take off. Unfortunately, with the haste and enthusiasm around AI, it is likely a lot of data will be used incorrectly, or even worse illegally. Again, unfortunately there will likely be legal situations coming if data is used clumsily and just dumped into a fine-tuning process. Data Governance is the correct intersection point between legal\compliance and ML Ops, and should take advantage of this position to help fulfill compliance requirements.
In conclusion these are 3 points where Data Governance can align with Generative AI initiatives to produce a successful outcome. There are more points than these, if you have any ideas please comment below, we would love to hear your thoughts and especially if you have real world feedback.