Applying Natural Language Search to Product Search

Posted on June 12, 2024

Introduction

Most services require search functionality along with data storage. While the data storage method and search system design vary depending on the type of service, they generally follow similar processes. After collecting and processing/refining the raw data to determine its form, an index is built for search. To create robust searches that handle natural language or synonyms effectively, vector search is needed. Typically, building such a system requires one or more services like a DBMS and a Vector DB, which can increase complexity due to synchronization between services and individual operations.

We will explore how to provide a web service integrating Full-Text Search and Vector Search using a single Aeca database. The search demo we'll examine implements a product (cat food) search function, and the collected data is a mix of structured and unstructured data. In this environment, when a user asks, "Which cat food without soy has a high protein content?", it includes searching the fields where ingredient information is stored, including appropriate range values.

In the Aeca search demo created with cat food data, you can search based on product, ingredient amounts, ingredient names, prices, and more.

Summary of Aeca Database and Search Features for Natural Language Search

Aeca comes equipped with various data models (Document, Vector, etc.) and search functionalities within the database itself, allowing you to implement natural language search development using just the Aeca product without additional services.

Supports nested fields, allowing storage and index specification without additional database normalization processes.
Multiple embedding models can be used within a single collection, and you can adjust the weights of embedding models in real-time.
Built-in ML Model Serving functionality eliminates the need for separate model serving preparations.
A single query using Full-Text Search or Vector Search can return results ranked by integrated scores within Aeca.
Vector Search supports quantization and can continue service without interruption even before meeting the minimum samples required for quantization training.

Implementing Natural Language Search Development with Aeca Product

We have broadly divided the process into three steps: (1) Data Collection and Preprocessing → (2) Indexing → (3) Demo Search Development.

After going through the data preprocessing stage in step 1, you can see the actual implementation process from data storage to search using the Aeca product in steps 2 and 3.

1. Data Collection and Preprocessing

We collected product data (cat food) ourselves to build the demo. Generally, such data acquisition is done through processes like web crawling or manual collection.

The data includes information such as company, brand, product name, ingredients, ingredient amounts, and price. You can directly check the data here.

1.1. Data Extraction

The ingredient amounts were collected in the following unprocessed form during the collection process.

조단백질
37.00% 이상
조지방
14.00% 이상
칼슘
1.00% 이상
인
0.80% 이상
수분
12.00% 이하
조회분
9.00% 이하
조섬유
6.00% 이하 칼로리 3,750 kcal/kg

In this form, we cannot search for 'cat food with more than 30% protein', so we extracted the data as follows using an LLM. We used Llama3-8B, which is not specifically tuned for Korean, as shown in the code below. A larger model can be chosen for performance improvement. No significant errors were found during the extraction process, so we did not perform additional validation for the sake of convenience.

from llama_cpp import Llama
llm = Llama(
      model_path="./data/models/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf",
      n_gpu_layers=-1,
      seed=1337,
      n_ctx=4096,
      chat_format="llama-3",
      verbose=True
)

_INGREDIENT_COMPOSITION_PROMPT = """Convert the following data to JSON format. Ensures that numbers and units are converted separately. Doesn't generate any code, just returns the converted result. Outputs the JSON block directly without any commentary.

example:
{
  "조단백질": {"value": 31.00, "unit": "%", "condition": "이상"},
  "칼로리": {"value": 2680, "unit": "kcal/kg"}
}"""

def extract_ingredient_composition(llm, text):
    response = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": _INGREDIENT_COMPOSITION_PROMPT},
            {"role": "user", "content": text},
        ],
    )

    result = response["choices"][0]["message"]["content"]
    return result

df["ingredient_composition"] = df["등록성분량"].apply(lambda x: extract_ingredient_composition(llm, x))
df.to_csv("data/food.csv", index=False)

Through this process, the extracted data has the following JSON format. As we'll explain in the indexing process, Aeca supports nested fields, so data stored in this format can be saved and indexed without additional normalization.

{
    "조단백질": {
        "value": 33,
        "unit": "%",
        "condition": "이상"
    },
    "조지방": {
        "value": 20,
        "unit": "%",
        "condition": "이상"
    },
    "조회분": {
        "value": 9.5,
        "unit": "%",
        "condition": "이하"
    },
    "조섬유": {
        "value": 2,
        "unit": "%",
        "condition": "이하"
    },
    "인": {
        "value": 1.3,
        "unit": "%",
        "condition": "이상"
    },
    "칼슘": {
        "value": 1.6,
        "unit": "%",
        "condition": "이상"
    },
    "수분": {
        "value": 7,
        "unit": "%",
        "condition": "이하"
    },
    "칼로리": {
        "value": 3700,
        "unit": "kcal/kg"
    }
}

We process ingredients similarly.

∙원재료 국문: 가수분해 참치, 해바라기씨박, 현미, 어분, 아마씨, 사탕무박(무 섬유소), 어유,
베타글루칸, 비타민제합제, 미네랄제합제, 유카추출물, DL-메치오닌,
녹두, 천일염, 어골칼슘, 타우린, 프락토올리고당, 염화칼륨, 가수분해
초록입홍합복합물, 알긴산나트륨, 씨벅턴열매, 달맞이꽃종자, 고수,
당근, 시금치, 글루코사민, SPM OMEGA-3, 유익균합제, 아스코르빈산 ∙원재료 영문: 데이터 없음

An example of the extracted data is as follows.

{
    "ko": [
        "연어",
        "건조 연어",
        "고구마",
        "완두",
        "이집트콩",
        "사과",
        "미량광물질-[아미노산킬레이트(아연, 철분, 망간, 구리), 요오드산칼슘]",
        "셀룰로오스",
        "타피오카",
        "알팔파",
        "크랜베리",
        "배",
        ...
        "마리골드",
        "아니스",
        "호로파",
        "계피",
        "비타민합제[Vitamin A/D3/ E]"
    ],
    "en": [
        "Salmon",
        "Dried Salmon",
        "Sweet Potato",
        "Peas",
        "Chickpeas",
        "Apple",
        ...
        "Copper",
        "Iodine"
    ]
}

1.2. Data Processing

Data preprocessing generally includes the following tasks. We will now proceed with similar operations.

Data type normalization
- Modify data fields to have the same type as much as possible
  - Aeca can store data even if types are different, but it may cause inconvenience in retrieval, index creation, and search.
- Ensure that None, NaN, NaT, [], etc., have Null values appropriate to their types
Normalization of categorical types
- Process so that values with the same meaning like 오메가-3 and 오메가-3 지방산 do not have different values
Fix errors in data transformation and extraction processes

We find the fields that predominantly appear among ingredient amounts. We plan to search in the form of subfield values like ingredient_amounts.조지방.value, and since indexing all fields is inefficient, we limited it to allow search on key features only. Among 101 data points, crude fat(조지방) and crude fiber(조섬유) are always included, and including up to Omega-3 covers most ingredients.

For data preprocessing, we did not normalize data with similar but different names like 오메가-3 and 오메가-3 지방산 for convenience.

{
    '조지방': 101,
    '조섬유': 101,
    '수분': 100,
    '칼슘': 100,
    '인': 100,
    '조단백질': 97,
    '조회분': 97,
    '칼로리': 81,
    '타우린': 20,
    '마그네슘': 16,
    'DHA': 16,
    'EPA': 11,
    '오메가-6': 11,
    '오메가-3': 10,
    'Chondroitin Sulphate': 5,
    '비타민E': 5,
    '콘드로이틴 황산': 4,
    '오메가 3 지방산': 4,
    '조단백': 4,
    'EHA': 3,
    '오메가 6지방산': 3,
    '오메가-6 지방산': 3,
    '오메가-3 지방산': 3,
    '클루코사민': 2,
    '글루코사민': 2,
    'DHA/EPA': 2,
    '비타민 E': 2,
    '오메가-3 Fatty Acids': 2,
    '오메가-6 Fatty Acids': 2,
    ...
}

Therefore, the selected fields are as follows.

_INGREDIENT_FIELDS = [
    "조지방",
    "조섬유",
    "수분",
    "칼슘",
    "인",
    "조단백질",
    "조회분",
    "칼로리",
    "타우린",
    "마그네슘",
    "DHA",
    "EPA",
    "오메가-6",
    "오메가-3",
]

For convenience, we change the Korean columns to English field names as follows.

_NAME_MAP = {
    "회사": "company",
    "브랜드": "brand",
    "제품명": "product_name",
    "가격정보": "price_info",
    "가격": "price",
    "가격 기준": "price_basis",
    "1kg당 가격": "price_per_1kg",
    "가격정보_text": "price_info_text",
    "기타정보": "other_info",
    "기타정보_text": "other_info_text",
    "등록성분량": "ingredient_amounts",
    "등록성분량_text": "ingredient_amounts_text",
    "성분": "ingredients",
    "성분_text": "ingredients_text",
    "첨가제_text": "additives_text",
}

We gather these definitions and save the refined data as follows.

Convert the absence of additive data to None value
Calculate statistical values of ingredient amounts
- To handle abstract expressions like average/much/little during query transformation using LLM

import json
from json import JSONDecodeError

import numpy as np
import pandas as pd

# Omitted

_JSON_FIELDS = [
    "price_info",
    "other_info",
    "ingredients",
    "ingredient_amounts",
]

def _parse_json(text):
    if not isinstance(text, str):
        return {}

    try:
        data = json.loads(text)
        return data
    except JSONDecodeError:
        return {}

def main():
    df = pd.read_csv("data/food.csv")
    df = df.rename(columns=_NAME_MAP)
    df[_JSON_FIELDS] = df[_JSON_FIELDS].map(_parse_json)
    df.price_info = df.price_info.apply(
        lambda x: {_NAME_MAP.get(k, k): v for k, v in x.items()}
    )
    df["ingredient_amounts"] = df["ingredient_amounts"].apply(
        lambda x: {
            ingredient: {k: v for k, v in values.items() if v}
            for ingredient, values in x.items()
        },
    )
    df["additives_text"] = df.additives_text.apply(
        lambda x: None if x == "데이터 없음" else x
    )
    df_ingredient = df.ingredient_amounts.apply(pd.Series).map(
        lambda x: x.get("value") if isinstance(x, dict) else np.nan
    )
    df["company_alias"] = df.company.apply(lambda x: _COMPANY_ALIAS.get(x))

    df_ingredient[_INGREDIENT_FIELDS].describe().loc[
        ["mean", "25%", "75%"]
    ].to_csv("data/food_ingredient.csv", float_format="%.1f")

    df_price_info = (
        df.price_info.apply(pd.Series)
        .price.apply(lambda x: pd.to_numeric(x.get("value")))
        .describe()
    )
    print(df_price_info)

    df.to_pickle("data/food.pk", compression="gzip")

The processed data after this process has the following format.

Maintains the original text data fields like ingredient_amounts and ingredient_amounts_text extracted as JSON
- The extracted numeric type data is used for range searches
- _text data is used for keyword and natural language search

The final stored data format is as follows.

1.3. Data Encoding

Now, we convert the text data into vector data. Before extracting to JSON, we specified that the original text data should have a _text suffix. Now, we convert _text to _embed. We used sentence_transformers for the transformation, and the model used was intfloat/multilingual-e5-large. You need to know the dimension of the selected model, as this will be used as a variable in the indexing process later.

Most multilingual models do not perform well in Korean, so it's better to select an appropriate model. For example, the recommended multilingual models in sentence_transformers often have low performance in Korean or are weak in colloquial sentences. Since embedding is often done by encoding large volumes of documents at once, it is convenient to use an embedding API, but costs may increase. Therefore, it is advisable to decide whether to use it, considering your current resources (GPU, etc.) and budget.

If you want to store the results of more than one embedding model to reflect them in the search after selection, or to aggregate the results by adjusting the weights of both, you can store and index them under separate field names (e.g., _embed_e5, _embed_minilm). The indices stored in this way can be selected or integrated through queries for search.

The code used for the transformation is a simple encoding and saving as follows.

from pathlib import Path

import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

_MODEL_PATH = "intfloat/multilingual-e5-large"  # 1024, 560M
_CHUNK_SIZE = 32
_BATCH_SIZE = 16
_POOL_SIZE = 2

def _chunkify(df: pd.DataFrame, chunk_size: int):
    return [df.iloc[i : i + chunk_size] for i in range(0, len(df), chunk_size)]
    
def _encode(
    model: SentenceTransformer,
    pool: dict[str, any],
    df: pd.DataFrame,
    output_file: Path,
):
    dfs = []
    target_columns = [
        ("product_name", "product_name_embed"),
    ] + [(x, x.replace("_text", "_embed")) for x in df.columns if "_text" in x]
    for name, embed_name in target_columns:
        column = df[name]
        column = column[column.notnull()]
        if column.empty:
            item = pd.Series([], name=embed_name)
            dfs.append(item)
        else:
            embed = model.encode_multi_process(
                column.tolist(), pool, batch_size=min(len(column), _BATCH_SIZE)
            )
            item = pd.Series(list(embed), index=column.index, name=embed_name)
            dfs.append(item)

    df = pd.concat([df, *dfs], axis=1)
    df.to_pickle(output_file, compression="gzip")

def main():
    model = SentenceTransformer(_MODEL_PATH, device="mps")
    print(f"Dims: {model.get_sentence_embedding_dimension()}")

    model_name = _MODEL_PATH.rsplit("/", maxsplit=1)[-1]
    output_path = Path(f"data/food/{model_name}")
    if output_path.exists():
        print("Output path already exists: ", output_path)
        return
    print(f"Output path: {output_path}")

    output_path.mkdir(exist_ok=True, parents=True)
    pool = model.start_multi_process_pool(
        [f"mps:{id}" for id in range(_POOL_SIZE)]
    )

    df = pd.read_pickle("data/food.pk", compression="gzip")
    chunks = _chunkify(df, _CHUNK_SIZE)
    for i, chunk in enumerate(tqdm(chunks)):
        output_file = output_path / f"food_{i:03d}.pk"
        _encode(model, pool, chunk, output_file)

Here, you can additionally consider the following:

Selection of fields to embed
- Choose fields that are lengthy and where semantic matching is important, since encoding and storage incur costs
Unit of encoding
- Sentence, chunk
Splitting criteria
- Sentence, fixed length, semantic unit, whether to overlap, whether to include hierarchy

2. Indexing

Now, we input the data into Aeca and create the index. The index configuration defines company, brand, and product_name as the primary keys.

Overall, the process is divided into data input and index creation parts. Aeca recommends creating the index after data insertion, which allows for faster data input and index creation.

from pathlib import Path
from pprint import pprint
import numpy as np
import pandas as pd
from aeca import Channel, DocumentDB
from tqdm import tqdm


_COLLECTION_NAME = "demo.cat_food"
_DATA_PATH = "data/food/multilingual-e5-large"
_AECA_HOST = "localhost"
_AECA_PORT = 10080

def main():
    collection_name = _COLLECTION_NAME
    channel = Channel(_AECA_HOST, _AECA_PORT)
    docdb = DocumentDB(channel)
    data_path = Path(_DATA_PATH)

    _insert(docdb, collection_name, data_path)
    _create_index(docdb, collection_name)

In the data input code, we define the primary key and load the preprocessed data. After some type conversion and transforming it into a dictionary, we input it into DocumentDB.

def _convert_dtype(item: np.ndarray | float):
    result = item
    if isinstance(item, np.ndarray):
        result = item.tolist()
    if isinstance(item, float):
        if np.isnan(item):
            result = None

    return result
    
def _insert(docdb: DocumentDB, collection_name: str, data_path: Path):
    if collection_name in docdb.list_collections():
        docdb.drop_collection(collection_name)

    indexes = [
        {
            "fields": ["company", "brand", "product_name"],
            "unique": True,
            "index_type": "kPrimaryKey",
        },
    ]
    docdb.create_collection(collection=collection_name, indexes=indexes)

    files = sorted(data_path.glob("*.pk"))
    for file in tqdm(files):
        df = pd.read_pickle(file, compression="gzip")
        columns = [x for x in df.columns if "_embed" in x]
        df[columns] = df[columns].map(_convert_dtype)

        docdb.insert(collection_name, df.to_dict(orient="records"))

After that, we create the index.

_HNSW_OPTIONS = {
    "index_type": "HNSW",
    "dims": 1024,
    "m": 64,
    "ef_construction": 100,
    "ef_search": 32,
    "metric": "inner_product",
    "normalize": True,
    "shards": 1,
}

def _create_index(docdb: DocumentDB, collection_name: str):
    default_analyzer = {
        "analyzer": {
            "type": "standard_cjk",
            "options": {"ngram_filter": {"min_size": 1, "max_size": 4}},
        },
        "index_options": "offsets",
    }
    int_analyzer = {"analyzer": {"type": "int64"}, "index_options": "doc_freqs"}
    float_analyzer = {
        "analyzer": {"type": "float64"},
        "index_options": "doc_freqs",
    }
    dense_vector_analyzer = {
        "analyzer": {
            "type": "DenseVectorAnalyzer",
            "options": _HNSW_OPTIONS,
        },
        "index_options": "doc_freqs",
    }

    ingredient_fields = (
        pd.read_csv("data/food_ingredient.csv").columns[1:].to_list()
    )
    ingredient_options = {
        f"ingredient_amounts.{x}.value": float_analyzer
        for x in ingredient_fields
    }
    common_options = {
        "company": default_analyzer,
        "company_alias": default_analyzer,
        "brand": default_analyzer,
        "product_name": default_analyzer,
        "product_name_embed": dense_vector_analyzer,
        "price_info.price.value": int_analyzer,
        "price_info.price_per_1kg.value": float_analyzer,
        "additives_text": default_analyzer,
        "additives_embed": dense_vector_analyzer,
        "ingredients.ko": default_analyzer,
        "ingredients_embed": dense_vector_analyzer,
        "ingredient_amounts_text": default_analyzer,
        "ingredient_amounts_embed": dense_vector_analyzer,
        "other_info_text": default_analyzer,
        "other_info_embed": dense_vector_analyzer,
    }
    index = {
        "index_name": "sk_fts",
        "index_type": "kFullTextSearchIndex",
        "unique": False,
        "options": {**common_options, **ingredient_options},
    }
    index["fields"] = list(index["options"].keys())
    pprint(index)

    docdb.create_index(collection_name, **index)

For general text fields (*_text), we choose the standard_cjk analyzer, which splits tokens into n-grams for storage. To handle single-character components like 콩 (bean), ngram_filter.min_size is set to 1. For Korean, the MecabTokenizer can also be a good choice. N-gram divides tokens into consistent units, which is advantageous for colloquial language and neologisms, but it can generate many duplicate tokens, potentially increasing storage size. On the other hand, MecabTokenizer splits at the morpheme level, which can result in semantic units and reduce the number of tokens, but it may include errors from the morphological analyzer.

Next, for fields where embedding vectors are stored (*_embed), we selected the dense analyzer and chose HNSW. Here, the dimension (dims) matches the number of dimensions output by the embedding model. Additionally, Aeca supports quantization, which can be a good option to enhance speed and reduce index size if you can tolerate a slight decrease in recall. Quantization requires meeting the minimum number of samples needed for training. If the minimum sample size is not met, a non-quantized HNSW index is used, and it's designed to switch over when the sample size is sufficient, allowing the index to be used without service interruption.

Looking at the nested field indexing, we can see that the price.value stored in the price_info field is specified as price_info.price.value with the float64 analyzer.

The FTS index options created in this way are as follows.

{
    "company": {
        "analyzer": {
            "type": "standard_cjk",
            "options": {
                "ngram_filter": {
                    "min_size": 1,
                    "max_size": 4
                }
            }
        },
        "index_options": "offsets"
    },
    "company_alias": {
        "analyzer": {
            "type": "standard_cjk",
            "options": {
                "ngram_filter": {
                    "min_size": 1,
                    "max_size": 4
                }
            }
        },
        "index_options": "offsets"
    },
    "brand": {
        "analyzer": {
            "type": "standard_cjk",
            "options": {
                "ngram_filter": {
                    "min_size": 1,
                    "max_size": 4
                }
            }
        },
        "index_options": "offsets"
    },
    "product_name": {
        "analyzer": {
            "type": "standard_cjk",
            "options": {
                "ngram_filter": {
                    "min_size": 1,
                    "max_size": 4
                }
            }
        },
        "index_options": "offsets"
    },
    "product_name_embed": {
        "analyzer": {
            "type": "DenseVectorAnalyzer",
            "options": {
                "index_type": "HNSW",
                "dims": 1024,
                "m": 64,
                "ef_construction": 100,
                "ef_search": 32,
                "metric": "inner_product",
                "normalize": true,
                "shards": 1
            }
        },
        "index_options": "doc_freqs"
    },
    "price_info.price.value": {
        "analyzer": {
            "type": "int64"
        },
        "index_options": "doc_freqs"
    },
    "price_info.price_per_1kg.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    "additives_text": {
        "analyzer": {
            "type": "standard_cjk",
            "options": {
                "ngram_filter": {
                    "min_size": 1,
                    "max_size": 4
                }
            }
        },
        "index_options": "offsets"
    },
    "additives_embed": {
        "analyzer": {
            "type": "DenseVectorAnalyzer",
            "options": {
                "index_type": "HNSW",
                "dims": 1024,
                "m": 64,
                "ef_construction": 100,
                "ef_search": 32,
                "metric": "inner_product",
                "normalize": true,
                "shards": 1
            }
        },
        "index_options": "doc_freqs"
    },
    "ingredients.ko": {
        "analyzer": {
            "type": "standard_cjk",
            "options": {
                "ngram_filter": {
                    "min_size": 1,
                    "max_size": 4
                }
            }
        },
        "index_options": "offsets"
    },
    "ingredients_embed": {
        "analyzer": {
            "type": "DenseVectorAnalyzer",
            "options": {
                "index_type": "HNSW",
                "dims": 1024,
                "m": 64,
                "ef_construction": 100,
                "ef_search": 32,
                "metric": "inner_product",
                "normalize": true,
                "shards": 1
            }
        },
        "index_options": "doc_freqs"
    },
    "ingredient_amounts_text": {
        "analyzer": {
            "type": "standard_cjk",
            "options": {
                "ngram_filter": {
                    "min_size": 1,
                    "max_size": 4
                }
            }
        },
        "index_options": "offsets"
    },
    "ingredient_amounts_embed": {
        "analyzer": {
            "type": "DenseVectorAnalyzer",
            "options": {
                "index_type": "HNSW",
                "dims": 1024,
                "m": 64,
                "ef_construction": 100,
                "ef_search": 32,
                "metric": "inner_product",
                "normalize": true,
                "shards": 1
            }
        },
        "index_options": "doc_freqs"
    },

	// 중략

    "ingredient_amounts.조지방.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    "ingredient_amounts.조섬유.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    "ingredient_amounts.수분.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    "ingredient_amounts.칼슘.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    "ingredient_amounts.인.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    "ingredient_amounts.조단백질.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    "ingredient_amounts.조회분.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    "ingredient_amounts.칼로리.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    },
    
    // 중략
    
    "ingredient_amounts.오메가-3.value": {
        "analyzer": {
            "type": "float64"
        },
        "index_options": "doc_freqs"
    }
}

Now, the data is stored, and we're ready to retrieve data through search.

As needed, you can use $group to calculate statistics as follows. The following example shows the calculation of average prices by brand.

3. Demo Development

The demo is implemented using Next.js. Next.js is an organic combination of a backend using Node.js and a frontend composed of React. The demo communicates with Aeca via the Aeca JavaScript SDK on Node.js. This means that most services, including search and RAG, can be implemented with a minimal setup consisting of just Aeca and the web server.

The code implemented for the backend consists of a single API, and the content is as follows.

import { StatusCode, statusSuccess } from "@_api/_lib/status"
import { convertToSearchQuery } from "@_api/catfood/_lib/chat"
import {
  SearchResult,
} from "@_app/api/catfood/_lib/document"
import config from "@_app/config"
import { parseJSONStringsInObject } from "@_lib/transform"
import { Channel, DocumentDB, SentenceTransformer } from "@aeca/client"
import { NextResponse } from "next/server"
import { NextRequest } from "next/server"

const _COLLECTION_GOODS = "demo.cat_food"
const _RESULT_FILTER_PATTERN = /^.+_embed/
const _FOOD_EMBED_COLUMNS = [
  "product_name_embed",
  "ingredients_embed",
  "ingredient_amounts_embed",
  "additives_embed",
  "other_info_embed",
]

export async function GET(
  request: NextRequest,
): Promise<NextResponse<SearchResult>> {
  const channel = new Channel(config.host, config.port)
  const docdb = new DocumentDB(channel)
  const model = new SentenceTransformer(channel, config.model)
  const searchParams = request.nextUrl.searchParams
  let query = searchParams.get("query")

  if (!query) {
    return NextResponse.json(
      { code: StatusCode.INVALID_ARGUMENT, message: "query is empty" },
      { status: 500 },
    )
  }

  let convertedQuery: string | null = query
  let convertElapsedTime
  if (config.queryConversion) {
    if (query && query.startsWith(">")) {
      query = query.substring(1).trim()
      convertedQuery = query
    } else {
      const convertStartTime = performance.now()
      convertedQuery = await convertToSearchQuery(query)
      convertElapsedTime = performance.now() - convertStartTime
    }
  }

  const queryEmbed = await model.encode([query])
  let productQueryString
  if (queryEmbed.length > 0) {
    const embedQueryString = queryEmbed[0].data.join(",")
    const embedFieldsQuery = _FOOD_EMBED_COLUMNS
      .map((x) => `${x}:[${embedQueryString}]`)
      .join(" OR ")

    productQueryString = `(${convertedQuery}) AND (${embedFieldsQuery})^0.5`
  } else {
    productQueryString = convertedQuery
  }

  const stopwords = ["사료"]
  const productQuery = {
    $search: {
      query: productQueryString,
      custom_stop_words: stopwords,
      highlight: true,
      limit: 12,
    },
  }

  const findStartTime = performance.now()
  const productTable = await docdb.find(_COLLECTION_GOODS, productQuery)
  const findElapsedTime = performance.now() - findStartTime

  const productDataRaw = productTable?.data.toArray()
  const productData = parseJSONStringsInObject(
    productDataRaw,
    _RESULT_FILTER_PATTERN,
  )

  return NextResponse.json({
    ...statusSuccess,
    product: productData,
    info: {
      query: query,
      convertedQuery: convertedQuery,
      convertElapsedTime: convertElapsedTime,
      findElapsedTime: findElapsedTime,
    },
  })
}

Let’s look at the main code step by step:

convertedQuery = await convertToSearchQuery(query)
- Converts the user’s natural language query (query) into an Aeca search query using an LLM
const queryEmbed = await model.encode([query])
- Converts the natural language query into an embedding vector
- Aeca includes an ML model serving feature
productQueryString = (${convertedQuery}) AND (${embedFieldsQuery})^0.5
- Constructs an Aeca search query by combining the converted query and the embedding query

Examining the query conversion (convertToSearchQuery) process, we send the following prompt to the GPT-4o model and receive its response. We included the statistical quantities of component amounts (food_ingredient.csv) calculated during the preprocessing step in the prompt to handle abstract expressions like “high”, “low”, “average”, etc.

Convert the input natural language query into a Lucene search query.

The search target fields are as follows:

company
brand
price_info.price.value
price_info.price_per_1kg.value
ingredient_amounts.조지방.value
ingredient_amounts.조섬유.value
ingredient_amounts.수분.value
ingredient_amounts.칼슘.value
ingredient_amounts.인.value
ingredient_amounts.조단백질.value
ingredient_amounts.조회분.value
ingredient_amounts.칼로리.value
ingredient_amounts.타우린.value
ingredient_amounts.마그네슘.value
ingredient_amounts.DHA.value
ingredient_amounts.EPA.value
ingredient_amounts.오메가-6.value
ingredient_amounts.오메가-3.value

Among these, price_info (amount in KRW) and ingredient_amounts (percentage value: %) are numeric types, and users can search ranges based on these fields. When searching fields, you must group them with the keyword list using the AND operator.

Below are the statistics for each field. For example, when a user uses abstract expressions like “high content,” you can use the 75th percentile. However, to handle expressions like “contains taurine,” we do not convert it to a full range like ingredient_amounts.taurine.value:[* TO *]. In this case, we convert it to the keyword “taurine.”

,조지방,조섬유,수분,칼슘,인,조단백질,조회분,칼로리,타우린,마그네슘,DHA,EPA,오메가-6,오메가-3
mean,15.5,5.1,11.9,1.1,0.8,34.1,10.6,3563.8,0.7,0.1,1.1,0.4,2.6,1.1
25%,12.0,4.0,8.0,0.9,0.7,31.0,8.0,3460.0,0.1,0.1,0.3,0.2,2.2,0.8
75%,19.0,6.0,12.0,1.2,0.9,36.5,9.5,4050.0,0.2,0.1,0.8,0.5,3.1,1.0

Statistics for price_info.price.value:
mean,40512
25%,28900
75%,47000

Data stored in brand:

Royal Canin, Iskhan, Natural Core, Welz, N&D, Acana, Ziwi Peak, Sam’s Field, Natural Balance, Nutro, Natural Lab, Halo, Orijen, ANF, My Pet Doctor, AATU, CheongKwanJang JINIPET

Data stored in company:

Royal Canin, Champion Petfoods, Natural Core, Farmina Pet Foods, Good Day, ZIWI Limited, VAFO Production.sro, Natural Balance, Mars, Halo Pets, AATU Pet Food, KGC Live&Gin

Notes:

- Do not perform field searches outside of company, brand, price_info, ingredient_amounts; list them as keywords.
- Provide a short answer for the converted content without additional output.

Examples:

로얄캐닌 추천사료는?
로얄캐닌 추천사료

돌보는 길냥이 사료로 로얄캐닌 어떤거 주면 좋을까요?
길냥이 사료 로얄캐닌

콩이 들어 있지 않은 사료는?
사료 -콩

단백질 함량이 높으면서 대두가 빠진 사료는?
단백질 -대두 사료

조단백질이 30%이상인 사료는?
(ingredient_amounts.조단백질.value:[30 TO *]) AND (사료)

Using this prompt, the examples converted are as follows.

Input	Conversion
콩이 없는 ANF 사료	(brand:ANF) AND (-콩)
칼로리가 낮은 사료중에 가격이 저렴한 사료는?	(ingredient_amounts.칼로리.value:[* TO 3460]) AND (price_info.price.value:[* TO 28900]) AND (사료)
단백질이 많이 들어 있는 사료중에 콩이 없는 사료	(ingredient_amounts.조단백질.value:[36.5 TO *]) AND (사료 -콩)

In summary, the queries generated through the above code take the following format. The query below is an example for “Farmina pet food without potatoes,” and you can check the search results at this link.

{
    "$search": {
        "query": "(Farmina 사료 -감자) AND (product_name_embed:[0.0271,-0.0008, ...] OR ingredients_embed:[0.0271,-0.0008, ...] OR ingredient_amounts_embed:[0.0271,-0.0008, ...] OR additives_embed:[0.0271,-0.0008, ...] OR other_info_embed:[0.0271,-0.0008, ...",
        "highlight": true,
        "limit": 12
    }
}

Examining the query input into $search, it is largely divided into two parts.

Farmina pet food -potato
- Keyword-based FTS
- Since no field names are specified, all defined fields except for vector search fields in sk_fts become the target
product_name_embed:[0.0271,-0.0008, ...] OR ingredients_embed:[0.0271,-0.0008, ...] ...
- Vector search
- Searches based on the similarity between embedding vectors from natural language

In this way, natural language input is converted into a query that considers appropriate fields in structured data, finding numeric and range values. For text data, FTS or vector search is used. Through a single query, documents can be retrieved from Aeca, ranked with integrated scores.

Getting Started with Aeca

If you want to explore Aeca further, you can easily install it with Docker and start using it right away. For more detailed explanations on adopting Aeca or to request a product brochure, please contact us through customer support.

Searching Case Law Data with Natural Language

Explains how to build a natural language search service by applying vector search to a case law search demo using FTS.

By Aeca Team|2024-07-04

Making Case Law Data Quickly Searchable

Explains the process of downloading case law data and building a case law search service in just one day using Aeca.

By Aeca Team|2024-06-21

Tags:

#Search

#Vector Embedding

#LLM

Made with ☕️ and 😽 in San Francisco, CA.

Terms Privacy

Applying Natural Language Search to Product Search

Read more