Making Case Law Data Quickly Searchable
Introduction
To develop a successful service, it's important to plan and design the service well and establish a stable technical foundation. However, most businesses have an appropriate timing when the service needs to operate, and having something visible to convince stakeholders can be the driving force to move forward. In this sense, enabling rapid development and iterative experimentation while reducing implementation complexity is just as crucial as building it stably.
This article explains how long it takes to input and index approximately 1.1GB of text data to make it searchable and develop a service. Here, we focus only on Full-Text Search; for natural language search or hybrid search, you can refer to Applying Natural Language Search to Product Search.
The text data used in the demo is case law data provided by the Ministry of Legislation, which can be viewed at the Case Law Search Demo.
Implementation of Case Law Search Service
The implementation consists of the following steps: 1. Data Collection and Preprocessing, 2. Indexing, 3. Demo Service Development.
1. Data Collection and Preprocessing
Case law data can be downloaded by applying for the Open API at the Ministry of Legislation's National Law Information Sharing Site. The APIs are largely divided into ones that query lists and ones that query the main text, and the main text is provided in both XML and HTML formats. Although HTML contains slightly more information than XML, since we are targeting the entire content for search without analyzing the case details specifically, we utilized XML data that doesn't require HTML tag removal. In the future, if we need to link legal information or separate judgments, we may need to consider this data.
Download the entire list of cases and then download all the XML data indicated in the list. An example of the downloaded XML data is as follows. The scale of the downloaded documents is as follows:
- 87,491 items
- 1.1GB of XML text data
<PrecService> <판례정보일련번호>64441</판례정보일련번호> <사건명> <![CDATA[ 임대차보증금등·지료등 ]]> </사건명> <사건번호>2006다62492,62508</사건번호> <선고일자>20080925</선고일자> <선고>선고</선고> <법원명>대법원</법원명> <법원종류코드>400201</법원종류코드> <사건종류명>민사</사건종류명> <사건종류코드>400101</사건종류코드> <판결유형>판결</판결유형> <판시사항> <![CDATA[ [1] 계약의 법정 또는 약정 해지사유 발생시, 당사자가 경매신청 등 계약해지를 전제로 하는 행위 또는 기존 계약관계를 유지할 의사가 없음을 파악할 수 있는 행위를 하고 상대방도 그로 인하여 계약이 종료됨을 객관적으로 인식할 수 있었던 경우, 계약해지의 효과가 발생하는지 여부(적극)<br/>[2] 건물의 소유를 목적으로 하는 토지임대차에서 차임을 담보할 목적으로 그 건물에 대한 근저당권을 설정받은 임대인이 차임 연체를 이유로 근저당권을 실행하여 임의경매를 신청하였다면 이는 묵시적 임대차계약 해지의 의사표시라 볼 수 있으므로, 법원의 경매개시결정이 임차인에게 송달된 때에 위 임대차가 종료되었다고 본 사례<br/> ]]> </판시사항> <판결요지> <![CDATA[ ]]> </판결요지> <참조조문> <![CDATA[ [1] 민법 제105조, 제543조 / [2] 민법 제105조, 제543조, 제640조, 제641조<br/> ]]> </참조조문> <참조판례> <![CDATA[ ]]> </참조판례> <판례내용> <![CDATA[ 【원고(반소피고), 상고인 겸 피상고인】 <br/>【피고(반소원고), 피상고인 겸 상고인】 <br/>【원심판결】 청주지법 2006. 8. 14. 선고 2005나3466, 나3473(반소) 판결<br/>【주 문】<br/> 원심판결 중 임차보증금반환청구의 본소에 관한 부분과 철거 및 원상복구비용 청구의 반소에 관한 부분을 각 파기하고, 이 부분 사건을 청주지방법원 본원 합의부에 환송한다. 원고(반소피고) 및 피고(반소원고)의 나머지 상고를 모두 기각한다.<br/><br/>【이 유】 각 상고이유를 판단한다. ... 중략 ...<br/><br/>대법관 김영란(재판장) 이홍훈 안대희(주심) 양창수 ]]> </판례내용> </PrecService>
If we are not extracting additional information, the data is relatively clean, so we proceed with the following preprocessing steps:
- Change field names to English variable names
- Convert the case information serial number to numeric type
- Change
<br>
tags in some fields to line breaks
import logging import re import sys from concurrent.futures import ThreadPoolExecutor, as_completed from pathlib import Path import click import pandas as pd from lxml import etree logger = logging.getLogger(__file__) _COLUMN_NAME_MAPPING = { "판례정보일련번호": "doc_id", "사건명": "name", "사건번호": "number", "선고일자": "judgment_date", "선고": "judgment", "법원명": "court_name", "법원종류코드": "court_type_code", "사건종류명": "type_name", "사건종류코드": "type_code", "판결유형": "judgment_type", "판시사항": "holding_statement", "판결요지": "judgment_summary", "참조조문": "reference_provisions", "참조판례": "reference_cases", "판례내용": "content", } _TAG_PATTERN = re.compile(r"<br ?/>") _MAX_WORKERS = 1 _CHUNK_SIZE = 10000 def _rename_fields(row: dict) -> dict: return {_COLUMN_NAME_MAPPING[k]: v for k, v in row.items()} def _convert_tag(text: str) -> str: return _TAG_PATTERN.sub("\n", text) def _to_int(text: str) -> int | None: if isinstance(text, str) and text.isnumeric(): return int(text) return None _PROCESSING_MAP = { "doc_id": _to_int, "holding_statement": _convert_tag, "judgment_summary": _convert_tag, "reference_provisions": _convert_tag, "reference_cases": _convert_tag, "content": _convert_tag, } def _convert_fields(row: dict[str, any]) -> dict[str, any]: return { k: _PROCESSING_MAP[k](v) if k in _PROCESSING_MAP else v for k, v in row.items() } def _convert(file: Path) -> dict: logger.debug("file: %s", file.name) with open(file, encoding="utf-8") as f: text = f.read() root = etree.fromstring(text.encode()) row = {x.tag: x.text.strip() if x.text else x.text for x in root} row = _rename_fields(row) row = _convert_fields(row) return row def _save(data: list[dict], index: int, output_path: Path) -> None: file = output_path / f"docs_{index}.pk" df = pd.DataFrame(data) df.to_pickle(file, compression="gzip") logger.info("saved: %s", file.name) @click.command() @click.option("--data_path", type=Path, help="Data path", default="data/raw/docs") def main(data_path): output_path = Path("data/docs") output_path.mkdir(exist_ok=True, parents=True) files = sorted(data_path.glob("*.xml")) with ThreadPoolExecutor(_MAX_WORKERS) as pool: futures = [] index = 0 for file in files: futures.append(pool.submit(_convert, file)) chunk = [] for future in as_completed(futures): row = future.result() chunk.append(row) if len(chunk) >= _CHUNK_SIZE: _save(chunk, index, output_path) index += 1 chunk = [] if chunk: _save(chunk, index, output_path) if __name__ == "__main__": main()
2. Data Insertion and Indexing
Check the distribution of the transformed data to find the field to use as the primary key. In a sample of 10,000 items, the case information serial number (doc_id
) is unique, but the case number (number
) is not unique.
Set doc_id
as the primary key and insert the data into Aeca.
from pathlib import Path import pandas as pd from aeca import Channel, DocumentDB from tqdm import tqdm _COLLECTION_NAME = "demo.law" _DATA_PATH = "data/docs" _AECA_HOST = "localhost" _AECA_PORT = 10080 def _insert(docdb: DocumentDB, collection_name: str, data_path: Path) -> None: if collection_name in docdb.list_collections(): docdb.drop_collection(collection_name) indexes = [ { "fields": ["doc_id"], "unique": True, "index_type": "kPrimaryKey", }, ] docdb.create_collection(collection=collection_name, indexes=indexes) files = sorted(data_path.glob("*.pk")) for file in tqdm(files): df = pd.read_pickle(file, compression="gzip") data = df.to_dict(orient="records") docdb.insert(collection_name, data) def main(): collection_name = _COLLECTION_NAME channel = Channel(_AECA_HOST, _AECA_PORT) docdb = DocumentDB(channel) data_path = Path(_DATA_PATH) _insert(docdb, collection_name, data_path) _create_index(docdb, collection_name) if __name__ == "__main__": main()
Then, create an index. For fields to be used in Full-Text Search (FTS), use the standard_cjk analyzer, and specify the tokenizer as mecab because the documents are in written language and likely have few grammatical errors. For fields that can be used as filters, such as case type name (type_name
) and judgment (judgment
), use the keyword analyzer. Lastly, designate the judgment date (judgment_date
) with the datetime analyzer.
def _create_index(docdb: DocumentDB, collection_name: str) -> None: default_analyzer = { "analyzer": {"type": "standard_cjk", "options": {"tokenizer": "mecab"}}, "index_options": "offsets", } int_analyzer = {"analyzer": {"type": "int64"}, "index_options": "doc_freqs"} keyword_analyzer = { "analyzer": {"type": "keyword"}, "index_options": "doc_freqs", } datetime_analyzer = { "analyzer": {"type": "datetime"}, "index_options": "doc_freqs", } index = { "index_name": "sk_fts", "fields": [ "doc_id", "name", "number", "judgment_date", "judgment", "court_name", "court_type_code", "type_name", "type_code", "judgment_type", "holding_statement", "judgment_summary", "reference_provisions", "reference_cases", "content", ], "index_type": "kFullTextSearchIndex", "unique": False, "options": { "doc_id": int_analyzer, "name": default_analyzer, "number": keyword_analyzer, "judgment_date": datetime_analyzer, "judgment": keyword_analyzer, "court_name": keyword_analyzer, "court_type_code": keyword_analyzer, "type_name": keyword_analyzer, "type_code": keyword_analyzer, "judgment_type": keyword_analyzer, "holding_statement": default_analyzer, "judgment_summary": default_analyzer, "reference_provisions": default_analyzer, "reference_cases": default_analyzer, "content": default_analyzer, }, } docdb.create_index(collection_name, **index)
The time to input and index 87,000 documents totaling 1.1GB, and the resulting index size, are as follows. Time measurements were taken on a MacBook Pro M3 Max with 36GB of memory. Although the overall data size increased due to index creation, we can see that the 1.1GB of text data was compressed to about 244MB, or approximately 20%.
Time Taken | Storage Size | |
Primary Key | 41 seconds | 244MB |
FTS | 3 minutes | 2.9GB |
Depending on the index configuration, the time may vary, but this means that about 1GB of text data can be made searchable with data input and indexing in about 4 minutes. For example, during the process, we quickly recognized an indexing error when we specified the int64 analyzer for court_type_code
without realizing it contained null values, and then switched to the keyword analyzer to retry indexing.
3. Demo Development
The demo was implemented with slight modifications from Applying Natural Language Search to Product Search. Similarly, we used Next.js and the Aeca Javascript SDK to implement it without configuring a separate API server.
import { StatusCode, statusSuccess } from "@_api/_lib/status" import { SearchResult } from "@_app/api/law/_lib/document" import config from "@_app/config" import { parseJSONStringsInObject } from "@_lib/transform" import { Channel, DocumentDB } from "@aeca/client" import { NextResponse } from "next/server" import { NextRequest } from "next/server" const _COLLECTION = "demo.law" export async function GET( request: NextRequest, ): Promise<NextResponse<SearchResult>> { const channel = new Channel(config.host, config.port) const docdb = new DocumentDB(channel) const searchParams = request.nextUrl.searchParams let query = searchParams.get("query") if (!query) { return NextResponse.json( { code: StatusCode.INVALID_ARGUMENT, message: "query is empty" }, { status: 500 }, ) } const aql = { $search: { query: query, highlight: true, limit: 30, }, } const findStartTime = performance.now() const df = await docdb.find(_COLLECTION, aql) const findElapsedTime = performance.now() - findStartTime return NextResponse.json({ ...statusSuccess, docs: df?.data, info: { findElapsedTime: findElapsedTime, }, }) }
As shown in the code above, most of the operations involve converting variables passed as query
into Aeca query language. You can receive results through queries mostly in under 200ms. For reference, the currently deployed demo includes a slight delay due to the physical distance between the Aeca server and the web server.
Although omitted here, if necessary, you can exclude unused fields through $project or cache search results using KeyValueDB. By utilizing KeyValueDB, Aeca often does not require in-memory cache servers like Redis in most situations.
{ "$search": { "query": "공직자 부정청탁", "highlight": true, "limit": 12 } }
Additionally, the case law search demo includes filters like the following:
This is converted into queries like the following. The syntax of $search.query is similar to Lucene. By leveraging this feature, you can simplify the API without adding variables to the API or adding API endpoints.
{ "$search": { "query": "(공직자 부정청탁) AND (type_name:(민사 OR 형사) AND judgment:(선고))", "highlight": true, "limit": 30 } }
Conclusion
The demo implementation utilized some of the content from Applying Natural Language Search to Product Search. However, it took about a day to develop a service in a completely different domain, from data collection to service development. While it might be challenging to apply this exactly to real-world cases, it at least means that the experimental costs for infrastructure configuration, data processing, and storage are not significant, and it fully supports the functions needed for service implementation.
Using Aeca, you can include the DBMS for data storage, indexing services for full-text search, cache services for fast response, and model serving for vector embedding, all while keeping memory usage restrained compared to in-memory storage. This allows you to reduce costs and complexity for the service and align the development environment on local machines with the production deployment environment.
This demo has been enhanced with vector search for semantic search and additional features, and the improved content can be found in Searching Case Law Data with Natural Language.
Getting to Know Aeca
If you'd like to use Aeca further, you can install it easily with Docker and start using it right away. For more detailed explanations about adopting Aeca or to request a product brochure, please contact us via Customer Support.
Read more
Searching Case Law Data with Natural Language
Explains how to build a natural language search service by applying vector search to a case law search demo using FTS.
By Aeca Team|2024-07-04
Applying Natural Language Search to Product Search
We explain the process of data collection and processing, search, and service development for product search using Aeca. Learn how to index when structured and unstructured data are mixed, and how to transform queries for search using LLM.
By Aeca Team|2024-06-12