医疗文本预处理：Python+LangChain技巧

引言

在医疗领域，文本数据（如电子病历、医学文献、患者反馈等）蕴含着宝贵的信息。但这些数据往往存在格式不统一、专业术语多、隐私信息混杂等问题，直接使用会影响后续分析效果。本文将介绍如何使用Python和LangChain对医疗文本进行高效预处理，为后续的NLP任务打下良好基础。

准备工作

环境要求

Python 3.8+
pip包管理工具
Jupyter Notebook（可选，推荐）

安装依赖包

代码片段

pip install langchain python-dotenv pandas spacy scikit-learn
python -m spacy download en_core_web_sm  # 英文模型

医疗文本预处理全流程

1. 数据加载与初步清洗

代码片段

import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 示例医疗记录数据
data = {
    "patient_id": [101, 102, 103],
    "record_text": [
        "Patient presents with fever (38.5°C) and persistent cough for 3 days. No known allergies.",
        "MRI shows mild osteoarthritis in both knees. Rx: ibuprofen 200mg bid.",
        "DOB: 01/01/1980. BP:120/80, HR:72. History of hypertension."
    ]
}

df = pd.DataFrame(data)

def clean_medical_text(text):
    """基础清洗函数"""
    import re

    # 移除日期信息 (保护隐私)
    text = re.sub(r'\d{1,2}/\d{1,2}/\d{2,4}', '[DATE]', text)

    # 标准化温度表示
    text = re.sub(r'(\d+\.?\d*)°[CF]', r'\1 degrees', text)

    # 移除多余空格和换行符
    text = ' '.join(text.split())

    return text

df['cleaned_text'] = df['record_text'].apply(clean_medical_text)
print(df[['patient_id', 'cleaned_text']].head())

关键点解释：
– RecursiveCharacterTextSplitter是LangChain提供的智能文本分割器，特别适合处理长文档
– 医疗文本中的日期需要特殊处理以保护患者隐私
– 温度等测量值的标准化有助于后续分析

2. 使用LangChain进行专业术语识别与提取

代码片段

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
import os

# 配置OpenAI API密钥 (建议使用环境变量)
os.environ["OPENAI_API_KEY"] = "your-api-key"

prompt_template = """
Identify and extract medical terms from the following text:

{text}

Return the terms as a comma-separated list grouped by categories:
- Symptoms: 
- Diagnoses: 
- Medications: 
- Procedures:
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=prompt_template,
)

llm = OpenAI(temperature=0)  
term_chain = LLMChain(llm=llm, prompt=prompt)

# 对第一条记录进行测试
sample_text = df.iloc[0]['cleaned_text']
result = term_chain.run(sample_text)
print("提取的医学术语:\n", result)

输出示例:

代码片段

Symptoms: fever, persistent cough  
Diagnoses:  
Medications:  
Procedures:

注意事项：
1. API调用会产生费用，批量处理前建议先小规模测试
2. temperature参数设为0可减少随机性，适合术语提取任务

3. Spacy医学实体识别增强版

虽然spacy的标准模型可以识别基本实体，但对于医疗领域我们需要增强：

代码片段

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")

# 添加自定义医学词汇表 (实际项目中可以从医学词典导入)
medical_terms = ["osteoarthritis", "hypertension", "ibuprofen"]
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp(text) for text in medical_terms]
matcher.add("MEDICAL_TERMS", patterns)

def enhanced_ner(text):
    doc = nlp(text)

    # 使用内置NER识别实体
    print("标准NER识别:")
    for ent in doc.ents:
        print(ent.text, ent.label_)

    # 使用自定义匹配器增强识别
    print("\n自定义医学术语:")
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        print(span.text)

# 测试第二条记录        
enhanced_ner(df.iloc[1]['cleaned_text'])

输出示例:

代码片段

标准NER识别:
mild osteoarthritis ORG  
both knees ORG  
ibuprofen PRODUCT  

自定义医学术语:
osteoarthritis  
ibuprofen

4. LangChain智能分块处理

医疗记录往往很长，需要合理分块：

代码片段

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
)

long_document = """
Patient history: 
55yo male with Type II DM diagnosed in 2010. 
Current medications:
- Metformin 1000mg bid 
- Lisinopril 10mg qd 
Recent HbA1c:7.2%. 

Last visit notes:
Complains of occasional dizziness when standing up quickly.
No changes to medication recommended at this time.
Follow up in three months.
"""

chunks = text_splitter.create_documents([long_document])
print(f"生成 {len(chunks)}个文本块:")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1}:")
    print(chunk.page_content)

分块策略说明：
– chunk_size=300：适合大多数NLP模型的最大长度限制
– chunk_overlap=50：避免在句子中间切断重要信息
– RecursiveCharacterTextSplitter会优先在段落、句子边界处分割

高级技巧：构建医疗知识图谱预处理管道

代码片段

from langchain.chains import TransformChain, SequentialChain

# Step1:去标识化 (De-identification) 
def deidentify(inputs):
    text = inputs["text"]

    # (实际项目中应使用更复杂的正则表达式或NER模型) 
    text = text.replace("55yo", "[AGE]")

    return {"deidentified_text": text}

deid_chain = TransformChain(
    input_variables=["text"],
    output_variables=["deidentified_text"],
    transform=deidentify,
)

# Step2:关键信息提取 
prompt_template2 = """Extract key clinical facts from:

{deidentified_text}

Format as JSON with keys: symptoms, medications, lab_results"""

prompt2 = PromptTemplate(
    input_variables=["deidentified_text"],
    template=prompt_template2,
)

extract_chain = LLMChain(
    llm=OpenAI(temperature=0),
    prompt=prompt2,
)

# Combine chains 
full_pipeline = SequentialChain(
   chains=[deid_chain, extract_chain],
   input_variables=["text"],
   output_variables=["json"],
   verbose=True,
)

result_json_str = full_pipeline.run(long_document) 
print("\n最终提取结果:", result_json_str)

常见问题与解决方案

处理非结构化数据困难
- 解决方案：先用正则表达式提取结构化部分（如”BP:120/80″），剩余部分再用NLP处理
医学术语变体多
- 建议：构建同义词词典，如{“MI”: “myocardial infarction”}
隐私保护要求高
- 必须：去除所有PHI（受保护健康信息），包括姓名、地址、社保号等
多语言混合记录
- 策略：先检测语言（如用langdetect），再分别处理不同语言部分

总结

本文介绍了医疗文本预处理的完整流程：

基础清洗：标准化格式、去除噪声
术语提取：利用LangChain的LLM能力识别专业术语
实体增强：结合Spacy和自定义规则提升NER效果
智能分块：适应下游NLP任务的输入要求

通过以上步骤处理后的医疗文本，可以更好地用于：
-临床决策支持系统
-电子病历分析
-医学研究数据准备

完整代码已放在GitHub仓库：[示例仓库链接]

最佳实践提示：对于生产环境，建议将预处理步骤封装为可复用的Pipeline组件，并添加单元测试确保数据处理一致性。