Case Study, Developing a Q&A System Using Vector DB and LLM
Introduction
While Large Language Models (LLMs) have shown high performance in the field of natural language processing, they encounter issues like hallucinations in specific domains or specialized fields. Especially in areas not included in their training data, they cannot provide accurate answers. For instance, LLMs cannot provide appropriate responses for proprietary data such as legal precedents stored by law firms or company communications saved in Slack. Therefore, LLMs alone are insufficient for meaningfully searching through a company's vast accumulated proprietary data.
To solve these issues, using a Vector Database (VectorDB) as a long-term memory storage for LLMs is gaining attention. All types of data—documents, images, text, audio—can be converted into vector data (vector embeddings), and a Vector Database is a database designed to store and search these vector embeddings.
Vector Embedding: Representing All Data
Unlike traditional relational databases (RDBs) or NoSQL databases, which are constrained by data formats, vector databases can process all data in a unified format called vector embeddings, enabling efficient data management and search. So, how can we use a vector database as a long-term memory storage for LLMs?
Let's explore a case study where we built a Q&A system using Aeca, targeting Wikipedia. The process can be divided into the following four steps:
- Convert the original data into vector embeddings
- Store the vector embeddings
- Convert the query into a vector embedding and search for similar embeddings
- Deliver the search results as context to the LLM for generating answers
Case Study
Converting Original Data into Vector Embeddings
We download the English and Korean Wikipedia data from the Wikipedia homepage.
We convert the above original document data into vector embeddings. To convert into embeddings, we used the SentenceTransformer model "paraphrase-multilingual-mpnet-base-v2," which has a vector dimension of 782. When converting document data into vector embeddings, we need to decide on the unit of vectorization. We can vectorize individual documents, document pages, or paragraphs; in this case, we generated vector embeddings at the paragraph level.
The number of vector embeddings converted is approximately 340 million for English and 17 million for Korean.
English Wikipedia | Korean Wikipedia | |
---|---|---|
Vector Dimension | 782 | 782 |
Number of Vector Embeddings | Approximately 340 million | Approximately 17 million |
Size of Vector Embeddings | 890GB | 170GB |
Storing Vector Embeddings in Aeca
Now, we need to store the generated vector embeddings in a vector database. The schema of the file to be stored includes Paragraph ID, Original Paragraph Content, and the Vector Embedding of that paragraph. We imported Aeca into Python installed on a local laptop. Aeca's Python SDK provides various libraries to facilitate data storage and search.
channel = aeca.Channel("localhost", 10080)
Next, we call DocumentDB, one of the data models provided by Aeca, to store the vector embeddings.
doc_db = aeca.DocumentDB(channel) doc_db.insert("Example.K-GAAP", docs)
The code used in this process consists of only 25 lines.
import pickle import aeca def main(): with open("data/wiki.pk", "rb") as fp: content = pickle.load(fp) docs = [] for sentence_id, line, embed in content: if len(line) < 10: continue doc = { "sentence_id": sentence_id, "sentence": line, "sentence_embed": embed.tolist(), } docs.append(doc) channel = aeca.Channel("localhost", 10080) doc_db = aeca.DocumentDB(channel) doc_db.insert("Example.wiki", docs) if __name__ == "__main__": main()
Converting Queries into Vector Embeddings and Searching for Similar Embeddings
We need to search the Wikipedia encyclopedia stored in Congica to find relevant documents. To find highly relevant results, we used hybrid search that utilizes both vector search and full-text search.
To perform vector search, we need to convert the input questions or search queries into vector embeddings. At this time, we must use the same Transformer model used for the stored data to create the query embeddings. Then, after finding the embeddings most similar to the query embedding, we consider the full-text search results together and deliver the original data of the returned search results to the LLM. There are algorithms like ANN for searching similar vector embeddings; in this case, we used the HNSW algorithm.
The code used in this process is as follows.
import aeca def main(): channel = aeca.Channel("localhost", 10080) doc_db = aeca.DocumentDB(channel) model = aeca.SentenceTransformerEncoder( channel, "paraphrase-multilingual-mpnet-base-v2" ) query = input("> ") query_embed = model.encode(query) dtypes = {"sentence_embed": "json", "_meta": "json"} query_embed = ", ".join([str(v) for v in query_embed[0].tolist()]) search_query = { "$search": { "query": ( f"(sentence:({query}))^0.2 or" f" (sentence_embed:[{query_embed}])^20" ), "limit": 1000, "min_score": 0.7, }, "$hint": "sk_fts", } df = doc_db.find("Example.wiki", search_query, dtypes=dtypes) print(df) sentence_id = int(df["sentence_id"][0]) context_query = { "sentence_id": {"$gte": sentence_id - 2, "$lte": sentence_id + 18}, "$project": ["sentence_id", "sentence"], "$sort": ["sentence_id"], } df = doc_db.find("Example.wiki", context_query, limit=1000) print(len(df)) print(df) if __name__ == "__main__": main()
Delivering Search Results to LLM as Context for Answers
We deliver the original data of the search results to the LLM. The LLM generates answers appropriate to the questions based on the received text. Below is an illustration of this process.
Reviewing the Results
Now, let’s take a look at the results. Let’s ask the following two questions and compare with ChatGPT.
- How many countries are there in the world?
- What is the first search engine in South Korea?
How many countries are there in the world?
LLM with Aeca | ChatGPT |
---|---|
There are over 220 countries in the world. A country is a political entity that occupies a certain territory and has an organized political form, i.e., a government, exercising internal and external sovereignty. The stages of national development typically proceed in the order of security state, developmental state, democratic state, and welfare state. The elements of a country include territory, people, and sovereignty. Additionally, a country can be defined as a group of humans who have successfully established a situation where they can legally use physical force exclusively within a certain territory. However, opinions on the concept and role of a country vary, and advanced countries claim to be welfare states, taking efforts for the happiness and welfare of their people as an indicator. Between the state and civil society, the civil society acts as a third force that monitors and criticizes the state, exerting strong influence. Reference: Country - Wikipedia, the free encyclopedia | As of September 2021, there are 195 independent countries in the world. However, the number of countries can change due to international politics and geographical changes, so the current number may not always be accurate. The recognition of a country’s identity and status can vary depending on the international community and international organizations, and the number of countries can change slightly whenever new countries declare independence or when a country’s identity changes. |
When using a vector database, the LLM searches the “Country” article on Wikipedia and generates an answer based on that content. By providing the original document, we can confirm from the document that there are over 220 countries in the world.
ChatGPT provides an answer that as of September 2021, there are 195 independent countries but does not provide specific evidence for the answer.
대한민국 최초의 검색엔진은?
LLM with Aeca | ChatGPT |
---|---|
The first internet search engine in South Korea is ‘코시크’, developed in 1995 by graduate student Kim Young-ryeol at Chungnam National University. KOSDAQ is a Korean search engine that finds Korean information on the internet according to the user’s request. Subsequently, in 1996, ‘Kachi-net’ and ‘Simmani’ started services, and in 1998, ‘Naver’ began its service. This information is based on an article from the Maeil Business Newspaper dated June 20, 1996. | The first search engine in South Korea is ‘WebMate’. WebMate was first launched in 1996 and was widely used as a leading Korean web search engine at that time. Since then, various search engines have emerged, and the web search field has continued to develop. Currently, various search engines such as Google, Naver, and Daum are used in South Korea. |
When searching Wikipedia, the result is ‘코시크’, developed in 1995, whereas ChatGPT provides ‘WebMate’, launched in 1996, as the answer. We checked the answer and the provided original document in Aeca. In the page titled ‘Korean Search Engine’, we can confirm that ‘코시크’ was developed in 1995.
When using only an LLM like ChatGPT, there is a problem where incorrect information may be provided, or the basis for the answer cannot be verified. However, by using a vector database, we can provide accurate answers even for the latest information not learned by the LLM, and we can verify the basis of the answer to increase reliability.
Exploring Application Cases
In this case study, we targeted the Wikipedia encyclopedia, but the same can be applied to all document data. In particular, the following internal data are things that LLMs cannot learn, but if we store them in Aeca and use them with LLMs, we can quickly find and utilize answers appropriate to the questions.
Data | Application |
---|---|
Legal precedents and latest laws | You can quickly find similar cases to use in drafting documents. In particular, when applying hybrid search to judgment data, which are structured documents, dramatic quality improvements can be expected. |
Company communication materials | Collaborative tools like Slack store vast amounts of work-related conversations. By utilizing a vector database, you can quickly review how tasks were handled in the past and what decisions were made. |
Numerous contract information | By vectorizing the numerous contracts held by the company and storing them, you can find similar contracts and past contract information. |
Customer inquiry content | You can quickly search through the vast amount of customer consultation records stored in ChannelTalk to derive customer inconveniences and improvements, and automate the consultation process. |