Positioning Data Catalogs for Success in AI efforts

Article By: Kenneth Chisholm

Data Catalogs are a crucial component in any enterprise’s AI efforts. Current Generative AI GPT products like ChatGPT or Gemini are general in nature, able to read one off documents (depending on how many tokens are available), and answer general questions about what it just read. These GPT’s have their own humongous models running in the backend (which required enormous resources to train), which is great for general knowledge, but less so for specific enterprise proprietary knowledge. One of the value plays is to fine tune a foundational model to suit the needs of your enterprise and business use cases. This of course involves a lot of data, which is best curated in a Data Catalog. A Data Catalog, with properly stewarded, correct, up to date metadata will be worth its weight in gold for a successful AI project. Here are 3 ways that Data Governance can position a Data Catalog for such an outcome.

#1: Where’s the Data?

The world runs on data, the problem is finding the right data for the right use. We have all heard that Data Scientists and Data Wranglers spend a significant amount of time looking for the right data. That data should of course be properly curated in a Data Catalog. As the importance of finding the right data increases, it is more important that an enterprise properly curate it’s data assets. Additionally, external data assets should be curated in a Data Catalog too. The goal here is to provide a platform for data driven efforts (especially AI). Ideally, ML Ops should be very familiar with your enterprise’s Data Catalog.

#2: Context is King

Oftentimes finding the data is not enough, and for the data to be used correctly, it must be used in the right context. A mature Data Catalog is the best place to find the context. This can be done with descriptions, definitions, metadata flags, comments, etc… Additionally, a well-managed Data Catalog will have roles for data assets, such as Subject Matter Experts, Data Owners and Data Stewards, who can help with gaining a greater understanding of the context.

#3: Finetuning models

One of the more interesting solutions a Data Catalog can provide for AI projects is providing Question and Answer pairs. To fine tune a foundational model, a transformer algorithm can be fed question and answer pairs.

For Example:

Question: What is the primary tool that the supply chain management group uses?

Answer: The supply chain management group primarily uses FreightPOP.

Well-managed Data Catalogs are a great resource for generating question and answer pairs, as they contain metadata, comments and relations between business concepts and technical assets (and other relations as well). This metadata can be used to generate Question and Answer pairs.

There are of course many other ways a Data Catalog can be positioned for a Successful AI effort, what are your thoughts and questions? Stories and experiences from the front lines are especially welcomed.

Positioning Data Catalogs for Success in AI efforts

For more information, Please Schedule An Appointment.

Quick Links

Useful Links

Contact Us

Social Media