랭체인 파이썬 - Semantic Chunking 의미론적 청킹

anpigon (71)in #blog • 8 months ago

Semantic Chunking는 의미론적으로 유사성을 기준으로 텍스트를 분할합니다.높은 수준에서 이는 문장으로 분할된 다음 3개의 문장 그룹으로 그룹화되고 임베딩 공간에서 유사한 문장을 병합합니다.

참고: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/semantic-chunker/

pip install -U -q langchain_experimental langchain_community

긴 문서 파일 가져오기

with open("./긴문서.txt") as f:
    text = f.read()

SemanticChunker 생성

from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="jhgan/ko-sroberta-multitask",
    encode_kwargs={'normalize_embeddings': False},
    model_kwargs={'device': 'mps'},
)

text_splitter = SemanticChunker(embeddings)

텍스트 분할

chunks = text_splitter.split_text(text)

# 또는
docs = text_splitter.create_documents([text])

percentile

기본 분할 방법은 백분위수를 기반으로 한 방법입니다. 문장 간의 모든 차이가 계산된 다음 X 백분위수보다 큰 차이가 분할됩니다.

text_splitter = SemanticChunker(
    embeddings, 
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70,
)
text_splitter.split_text(text)

standard_deviation

X 표준 편차보다 큰 차이가 분할됩니다.

text_splitter = SemanticChunker(
    embeddings, 
    breakpoint_threshold_type="standard_deviation"  
    breakpoint_threshold_amount=1.25,
)
text_splitter.split_text(text)

interquartile

사분위간 거리를 사용하여 청크를 분할됩니다.

text_splitter = SemanticChunker(
    embeddings, 
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=0.5,
)
text_splitter.split_text(text)

토큰 기반으로 분할하기

랭체인에 한국어 분할기가 있어서 사용해봤습니다.

참고: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/split_by_token/#konlpy

KoNLPy의 Kkma

KoNLPY에는 형태소 분석기인 Kkma(한국어 지식 형태소 분석기)가 포함되어 있습니다. Kkma는 한국어 텍스트에 대한 상세한 형태소 분석을 제공합니다. 문장을 단어로, 단어를 각각의 형태소로 분해하여 각 토큰의 품사를 식별합니다. 텍스트 블록을 개별 문장으로 분할할 수 있어 긴 텍스트를 처리할 때 특히 유용합니다.

사용 고려 사항

Kkma는 상세한 분석으로 유명하지만, 이러한 정밀도가 처리 속도에 영향을 미칠 수 있다는 점에 유의해야 합니다. 따라서 Kkma는 빠른 텍스트 처리보다 분석의 깊이가 우선시되는 애플리케이션에 가장 적합합니다.

pip install -U -q konlpy

from langchain_text_splitters import KonlpyTextSplitter

text_splitter = KonlpyTextSplitter(
    separator='\n\n',
)
text_splitter.split_text(text)