The Complete Guide

What is Retrieval Augmented Generation?

Updated May 5, 2024

Get RAG Demo
Retrieval-Augmented Generation (RAG) and AI

Retrieval-augmented generation (RAG) is a generative AI framework for improving the accuracy and reliability of large language models (LLMs), using relevant data from company sources.

Auto-generating reliable responses to user queries – based on an organization’s private information and data – remains an elusive goal for enterprises looking to generate value from their generative AI apps. Sure, technologies like machine translation and abstractive summarization can break down language barriers and lead to some satisfying interactions, but, overall, generating an accurate and reliable response is still a significant challenge. 

 

Get RAG Demo

01

Providing reliable responses is not so easy

Traditional text generation models, typically based on encoder/decoder architectures, can translate languages, respond in different styles, and answer simple questions. But because these models rely on the statistical patterns found in their training data, they sometimes provide incorrect or irrelevant information, called hallucinations.  

Generative AI leverages Large Language Models (LLMs) that are trained on massive amounts of publicly available (Internet) information. The few LLM vendors (such as Microsoft, Google, AWS, and Meta) can't frequently retrain their models due to the high cost and time involved. Since LLMs ingest public data as of a certain time and date, they’re never current – and have no access to the highly valuable, private data stored in an organization.

Retrieval-Augmented Generation (RAG) is an emerging generative AI technology that addresses these limitations.

A complete RAG implementation – one that retrieves structured data from enterprise systems, as well as unstructured data from knowledge bases – can transform real-time, multi-source business data into intelligent, context-aware, and compliant prompts to reduce GenAI hallucinations and elevate the effectiveness and trust of generative AI apps.

 

02

What is retrieval-augmented generation?

Retrieval-Augmented Generation (RAG) is a Generative AI (GenAI) architecture that augments a Large Language Model (LLM) with fresh, trusted data retrieved from authoritative internal knowledge bases and enterprise systems, to generate more informed and reliable responses.

The acronym “RAG” is attributed to the 2020 publication, “Retrieval-Augmented Generation for Knowledge-Intensive Tasks”, submitted by Facebook AI Research (now Meta AI). The paper describes RAG as “a general-purpose fine-tuning recipe” because it’s meant to connect any LLM with any internal data source.

As its name suggests, retrieval-augmented generation inserts a data retrieval component into the response generation process to enhance the relevance and reliability of the answers.

The retrieval model accesses, selects, and prioritizes the most relevant information and data from the appropriate sources based on the user’s query, transforms it into an enriched, contextual prompt, and invokes the LLM via its API. The LLM responds with an accurate and coherent response to the user.

A helpful RAG analogy is a stock trader using the combination of publicly available historical financial information and live market feeds.

Stock traders in a firm make investment decisions based on analyzing financial markets, industry, and company performance. However, to generate the most profitable trades, they access live market feeds and internal stock recommendations provided by their firm.

In this example, the publicly available financial data and analysis is the LLM, and the live data feeds and firm's internal stock recommendations represent the internal sources accessed by the RAG.

RAG architecture

The RAG architecture diagram below illustrates the retrieval-augmented generation framework and data flow, from the user prompt to response:

Retrieval-Augmented Generation architecture diagram enlarge
  1. The user enters a prompt, which triggers the data retrieval model to access the company's internal sources, as required.

  2. The retrieval model queries the company’s internal sources for structured data from enterprise systems and for unstructured data like docs from knowledge bases.

  3. The retrieval model crafts an enriched prompt – which augments the user’s original prompt with additional contextual information – and passes it on as input to the generation model (LLM API).

  4. The LLM uses the augmented prompt to generate a more accurate and relevant response, which is then provided to the user.

The end-to-end round trip – from user prompt to response –
should take 1-2 seconds to support conversational interfaces.

03

Retrieval-augmented generation challenges

An organization usually stores its documentation of internal processes, procedures, and operations in offline and online docs – with its business data fragmented across dozens of enterprise systems, like billing, CRM, and ERP.

Active retrieval-augmented generation integrates fresh, trusted data retrieved from a company’s internal sources – docs stored in document databases and data stored in enterprise systems – directly into the generation process.

So, instead of relying solely on its public, static, and dated knowledge base, the Large Language Model (LLM) actively ingests relevant data from a company’s own sources to generate better-informed and relevant outputs.

The RAG model essentially “grounds” the LLM with an organization’s most current information and data, resulting in more accurate, reliable, and relevant responses.

However, RAG also has its challenges:

  • RAG relies on enterprise-wide data retrieval. Not only must you maintain up-to-date and accurate information and data, you must also have accurate metadata in place that describes your data for the RAG framework.
  • The information and data stored in internal knowledge bases and enterprise systems must be RAG-ready: accessible, searchable, and of high quality. For example, if you have 100 million customer records – can you access and integrate all of David Smith’s details in less than a second?
  • To generate smart contextual prompts, you’ll need sophisticated prompt engineering capabilities, including chain-of-thought prompting, to inject the relevant data into the RAG LLM in a way that generates the most accurate responses.
  • To ensure your data remains private and secure, you must limit your LLM’s access to authorized data only – per the example above, that means only David Smith’s data and nobody else’s.

Despite these challenges, RAG still represents a great leap forward in generative AI. Its ability to leverage up-to-date internal data addresses the limitations of traditional generative models by improving the user experience with more personalized and reliable exchanges of information.

RAG AI is already delivering value in several domains, including customer service, IT service management, sales and marketing, and legal and compliance.

04

Retrieval-augmented generation use cases

RAG AI use cases span multiple enterprise domains. Here are a few examples: 

Department

Objective

Types of RAG data

Customer service

Personalize the chatbot response to the customer’s precise needs, behaviors, status and preferences, to respond more effectively.

  • Order/payment history

  • Customer status

  • Contract status

  • Network alerts

  • Next or related questions

  • Workflow for call routing

  • Agent script

  • Call history and first contact resolution

  • Customer feedback history

Sales and marketing

Engage with customers via a chatbot or a sales consultant to describe products and offer personalized recommendations.

  • Product documentation

  • Product specs

  • Customer profile and demographics

  • Customer preferences and purchase history

  • Targeted personas and behaviors

  • Campaigns

Compliance

Respond to Data Subject Access Requests (DSARs) from customers.

  • Industry standards and state regulations

  • Personal customer data stored in enterprise systems

  • Internal approval procedures

Risk

Identify fraudulent customer activity.

  • Fraudulent activities and related data, previously detected by the company

  • Real-time customer transaction and activity data 

05

Benefits of retrieval-augmented generation

For generative AI, your data is your differentiator. By deploying RAG, your organization can benefit from:

1. Quicker time to value, at lower cost

Training an LLM takes a long time and is very costly. By offering a more rapid and affordable way to introduce new data to the LLM, RAG makes generative AI accessible and reliable for customer-facing and back-office operations.

2. Personalization of user interactions

By integrating specific customer 360 data with the extensive general knowledge of the LLM, RAG personalizes user interactions via chatbots and customer service agents, with next-best-action and cross-sell and up-sell recommendations tailored for the customer in real time. Retrieval-augmented generation enables AI personalization at scale.

3. Improved user trust

RAG LLMs deliver reliable information through a combination of data accuracy, freshness, and relevance – personalized for a specific user. User trust protects and even elevates the reputation of your brand.

4. Elevate user experience and reduce customer care costs

Chatbots based on customer data elevate the user experience and reduce customer care costs, by increasing first contact resolution rates and decreasing the overall number of services calls.

 

06

Recommendations for implementing RAG

In its 12/2023 report, “Emerging Tech Impact Radar: Conversational Artificial Intelligence”, Gartner estimates that widespread enterprise adoption of RAG will take a few years because of the complexities involved in:

  • Applying generative AI (GenAI) to self-service customer support mechanisms, like chatbots

  • Keeping sensitive data hidden from people who aren’t authorized to see it
  • Combining insight engines with knowledge bases, to run the search retrieval function
  • Indexing, embedding, pre-processing, and/or graphing enterprise data and documents

  • Building and integrating retrieval pipelines into applications.

All of the above are challenging for enterprises due to skill set gaps, data sprawl, ownership issues, and technical limitations.

Also, as vendors start offering tools and workflows for data onboarding, knowledge base activation, and components for RAG application design (including conversational AI chatbots), enterprises will more actively support the grounding of generative AI apps for content consumption.

In the Gartner RAG report of 1/2024, “Quick Answer: How to Supplement Large Language Models with Internal Data”, Gartner generative AI analysts advise enterprises preparing for RAG to:

1. Select a pilot use case, in which business value can be clearly measured.

2. Classify your use case data, as structured, semi-structured, or unstructured, to decide on the best ways of handling the data and mitigating risk.

3. Get all the metadata you can, because it provides the context for your RAG deployment and the basis for selecting your enabling technologies.

In its 1/2024 article, “Architects: Jump into Generative AI”, Forrester claims that for RAG-focused architectures, gates, pipelines, and service layers work best. So, if you’re thinking about implementing generative AI apps, make sure they:

  • Feature intent and governance gates on both ends
  • Flow through pipelines that can engineer and govern prompts
  • Are grounded through RAG

07

The data retrieval process

The RAG lifecycle – from data sourcing to the final output – is based on a Large Language Model or LLM.

An LLM is a foundational Machine Learning (ML) model that employs deep learning algorithms to process and generate natural language. It’s trained on massive amounts of text data to learn complex language patterns and relationships, and perform related tasks, such as text generation, summarization, translation, and, of course, the answering of questions.

These models are pre-trained on large and diverse datasets to learn the intricacies of language, and can be fine-tuned for specific applications or tasks. The term “large” is even a bit of an understatement because these models can contain billions of data points. For example, ChatGPT4 is reported to have over a trillion.

RAGs execute retrieval and generative models in 5 key steps:

1. Sourcing

The starting point of any RAG system is data sourcing, from internal text documents and enterprise systems. The source data is queried (if structured) and searched (if unstructured) by the retrieval model to identify and collect relevant information. To ensure accurate, diverse, and trusted data sourcing, you must maintain accurate and up-to-date metadata, as well as manage and minimize data redundancy.

2. Preparing enterprise data for retrieval

Enterprises must organize their multi-source data and metadata so that RAG can access it in real time. For example, your customer data, which includes master data (the customer's unique identifying attributes), transactional data (e.g., service requests, purchases, payments, and invoices), and interaction data (emails, chats, phone call transcripts), must be integrated, unified, and organized for real-time retrieval.

Depending on the use case, you may need to arrange your data by other business entities, such as employees, products, suppliers, or anything else that’s relevant for your use case.

3. Preparing documents for retrieval

Unstructured data, such as textual documents, must be divided into smaller, manageable pieces of related information, a process called "chunking". Effective chunking can improve retrieval performance and accuracy. For example, a document may be a chunk on its own, but it could also be chunked down further into sections, paragraphs, sentences, or even words. Chunking also makes the retriever less costly, by only including the relevant part of a document in the LLM prompt, instead of the whole document.

Next, the textual chunks must be transformed into vectors so they can be stored in a vector database, to enable efficient semantic search. This process is called “embedding” and is achieved by applying an LLM-based embedding model. The embeddings are linked back to the source, enabling the creation of more accurate and meaningful responses.

4. Protecting the data

The RAG framework must employ role-based access controls to prevent users from gaining answers to questions beyond their assigned role. For example, a customer service rep serving a specific customer segment might be prevented from accessing data about customers not in their segment; similarly, a marketing analyst might not be given access to certain confidential financial data.

Further, any sensitive data retrieved by RAG, such as Personal Identifiable Information (PII), must never be accessible to unauthorized users – such as credit card information by salespeople, or Social Security Numbers by service agents. The RAG solution must employ dynamic data masking to protect the data in compliance with data privacy regulations.  

5. Engineering the prompt from enterprise data

After enterprise application data is retrieved, the retrieval-augmented generation process generates an enriched prompt by building a “story” out of the retrieved 360-degree data. There needs to be an ongoing tuning process for prompt engineering, ideally aided by Machine Learning (ML) models, and leveraging advanced chain-of-thought prompting techniques.

For example, we can ask the LLM to reflect on the data it already has and let us know what additional data is required, whether the user finished the conversation, etc. 

COMPLIMENTARY DOWNLOAD

Gartner RAG tips: FREE in this condensed version.

Get Gartner Tips
Gartner on Retrieval Augmented Generation (RAG)

09

Real-time application data retrieval

The need for RAG to access enterprise application data is especially critical in customer-facing use cases, where personalized, accurate responses are required in real time.

Sample use cases include a customer care agent answering a service call, or a conversational AI bot providing self-service technical support to authenticated users on a company portal. The data needed for these scenarios is typically found in enterprise systems, like CRM, billing, and ticketing systems.

A real-time 360° view of the relevant business entity is required, be they customers, employees, suppliers, products, loans, etc. To achieve this, you need to organize your enterprise data by whichever business entities are relevant for your use case.

But the process of locating, accessing, integrating and unifying enterprise data in real time – and then using it to ground your LLM – is extremely complex, because:

  1. Enterprise data is typically fragmented across many different applications, with different databases and structures, and formats.

  2. Querying and unifying the relevant data for 1 entity (e.g., a single customer), in real time, typically requires high-complexity and high-cost data management.

  3. Supporting high scale while maintaining low latency adds further complexity.

  4. The retrieved data has to be fresh, every time.

  5. To ensure privacy and security, your RAG LLM must only be able to access the data the user is authorized to see.

10

RAG chatbot: A natural starting point

When customers need a quick answer to a question, a RAG chatbot can be extremely helpful. The problem is that most bots are trained on a limited number of intents (or question/answer combinations) and they lack context, in the sense that they give the same answer to different users. Therefore, their responses are often ineffective – making their usefulness questionable.

A RAG tool can make conventional bots a lot smarter by empowering the LLM to provide answers to questions that aren’t on the intent list – and that are contextual to the user.

For example, an airline chatbot responding to a platinum frequent flyer asking about an upgrade on a particular flight based on accumulated miles won’t be very helpful if it winds up answering, “Please contact frequent flyer support." After all, the whole point of generative AI is to avoid a human in the loop.

But once RAG augments the airline’s LLM with that particular user’s dataset, a much more contextual response could be generated, such as, “Francis, you have 929,100 miles at your disposal. For Flight EG17 from New York to Madrid, departing at 7 pm on November 5, 2024, you could upgrade to business class for 30,000 miles, or to first class for 90,000 miles. How would you like to proceed?”

What a difference!

The Q&A nature of chat interactions make the RAG chatbot an ideal pilot use case because understanding the context of a question – and of the user – leads to a more accurate, relevant, and satisfying response.

In fact, chatbots that are used by service and sales agents are the natural entry point for RAG-based generative AI apps.

11

Current state of retrieval-augmented generation

Many companies are piloting RAG chatbots on internal users like customer service agents because they’re hesitant to use them in production, primarily due to issues surrounding hallucinations, privacy, and security. As they become more reliable, this trend will change.

In fact, new RAG generative AI use cases are emerging all over the place, for example:

A stock investor wishing to see the commissions she was charged over the last quarter

A hospital patient wanting to compare the drugs he received to the payments he made

A telco subscriber choosing a new Internet-TV-phone plan since his is about to terminate

A car owner requesting a 3-year insurance claim history to reduce her annual premium


Today, RAG is mainly used to provide accurate, contextual, and timely answers to questions – via chatbots, email, texting, and other RAG conversational AI applications. In the future, RAG AI might be used to suggest appropriate actions to contextual information and user prompts.

For example, if today RAG GenAI is used to inform army veterans about reimbursement policies for higher education, in the future it might list nearby colleges, and even recommend programs based on the applicant’s previous experience and military training. It may even be able to generate the reimbursement request itself.

12

Grounding GenAI apps with enterprise data

When all’s said and done, it’s difficult to deliver strategic value from generative AI today because your LLMs lack business context, cost too much to retrain, and hallucinate too frequently.

K2view GenAI Data Fusion is a patented solution that provides a complete retrieval-augmented generation solution, grounding your generative AI apps with real-time, multi-source enterprise data.

With GenAI Data Fusion, your data is:

  1. Ready for generative AI
  2. Deliverable in real-time and at massive scale
  3. Complete and up-to-date
  4. Governed, secured, and trusted
RAG architecture diagram for enterprise application data enlarge

K2view GenAI Data Fusion grounds LLMs with enterprise data

The diagram illustrates how K2view Data Product Platform and Micro-Database™ technology enable GenAI Data Fusion to access structured data:

  1. The user enters a prompt, which triggers GenAI Data Fusion to access the company's internal data sources.

  2. Micro-Database technology kicks in to perform queries on the relevant business entity only (e.g., a specific customer).  

  3. The user's original prompt is augmented with the new data, metadata, and context to create smart contextual prompts that are injected into the LLM.  

  4. The RAG LLM uses the augmented prompt to generate a more accurate and relevant response, which is then provided to the user.

The Micro-Database: Fueling RAG AI with enterprise data

K2view GenAI Data Fusion is built on a Micro-Database foundation that organizes your fragmented, multi-source enterprise application data into 360° views of your business entities (customers, products, loans, work orders, etc.), making them accessible to your RAG AI apps in real time.

Each 360° view is stored in its own Micro-Database, secured by a unique encryption key, compressed by up to 90%, and optimized for low footprint and low-latency RAG access.

K2view manages the data for each business entity in its own, high-performance, always-fresh Micro-Database – ever-ready for real-time access by RAG workflows.

Designed to continually sync with underlying sources, the Micro-Database can be modeled to capture any business entity – and comes with built-in data access controls and dynamic data masking capabilities.  

Micro-Database technology features:

  • Real-time data access, for any entity

  • 360° views of any business entity

  • Fresh data all the time, in sync with underlying systems

  • Dynamic data masking, to protect sensitive AI data

  • Role-based access controls, to safeguard data privacy

  • Reduced TCO, via a small footprint and the use of commodity hardware

An entity’s data can be queried by, and injected into, the RAG LLM as a contextual prompt in milliseconds. It essentially makes multi-source enterprise data “AI-ready”.

GenAI Data Fusion can:

  1. Feed real-time data about a particular customer or any other business entity.

  2. Dynamically mask PII (Personally Identifiable Information) or other sensitive data.

  3. Be reused for handling data service access requests, or for suggesting cross-sell recommendations.

  4. Access enterprise systems via API, CDC, messaging, streaming – in any combination – to unify data from multiple source systems.

It powers your RAG AI apps to respond with accurate recommendations, information, and content – for many use cases, such as:

Accelerating
issue resolution

Creating
hyper-personalized marketing campaigns

Generating
personalized cross-/up-sell recommendations for call center agents

Detecting
fraud by identifying suspicious activity
in a user account

 

Learn more about the K2view GenAI Data Fusion RAG tool.

Retrieval-Augmented Generation FAQs

1. What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a Generative AI (GenAI) framework that augments a Large Language Model (LLM) with fresh, trusted data retrieved from authoritative internal knowledge bases and enterprise systems. RAG generates more informed and reliable responses to LLM prompts, minimizing hallucinations and increasing user trust in GenAI apps.

2. What’s the relationship between Generative AI, LLMs, and RAG?

Generative AI is any type of artificial intelligence that creates new content. An excellent example of GenAI are LLMs, which can generate text, answer questions, translate languages, and even write creative content. RAG is a technique that enhances LLM capabilities by augmenting its external, publicly-available knowledge with fresh, trusted internal data and information.

3. What types of information and data does RAG make use of?

  • Unstructured data found in knowledge bases such as articles, books, conversations, documents, and web pages
  • Structured data found in databases, knowledge graphs, and ontologies
  • Semi-structured data, stored in spreadsheets, XML and JSON files

4. Can RAG provide sources for the information it retrieves?

Yes. If the knowledge bases accessed by RAG have references to the retrieved information (in the form of metadata), sources can be cited. Citing sources allows errors to be identified and easily corrected, so that future questions won’t be answered with incorrect information.

5. How does RAG differ from traditional generative models?

Compared to traditional generative models that base their responses on input context only, retrieval-augmented generation retrieves relevant information from private company sources before generating an output. This process leads to more accurate and contextually rich responses, and, therefore, more positive and satisfying user experiences.

6. What is RAG's main component?

As its name suggests, RAG inserts a data retrieval component into the generation process, aimed at enhancing the relevance and reliability of the generated responses.

7. How does the retrieval model work?

The retrieval model collects and prioritizes the most relevant information and data from underlying enterprise systems and knowledge bases, based on the user query, transforms it into an enriched, contextual prompt, and invokes the LLM via its API. The LLM responds with an accurate and coherent response to the user.

8. What are the advantages of using RAG?

  • Quicker time to value at lower cost, versus LLM retraining or fine-tuning

  • Personalization of user interactions

  • Improved user trust because of reduced hallucinations

9. What challenges are associated with RAG?

  • Accessing all the information and data stored in internal knowledge bases and enterprise systems in real time

  • Generating the most effective and accurate prompts for the RAG framework

  • Keeping sensitive data hidden from people who aren’t authorized to see it

  • Building and integrating retrieval pipelines into applications

10. When is RAG most helpful?

Retrieval-augmented generation has various applications such as conversational agents, customer support, content creation, and question answering systems. It proves particularly useful in scenarios where access to internal information and data enhances the accuracy and relevance of the generated responses.