CentOS 9下RAG从安装到运行的全流程图解

引言

RAG（Retrieval-Augmented Generation）是一种结合检索和生成的AI技术，能够显著提升大语言模型的知识准确性和回答质量。本文将手把手指导你在CentOS 9系统上完成RAG环境的搭建和运行，包含详细的命令解释和常见问题解决方案。

准备工作

环境要求

CentOS 9操作系统（已测试版本：9.2）
Python 3.8+（推荐3.9）
至少16GB内存（处理大型模型需要）
50GB以上磁盘空间
NVIDIA GPU（可选，用于加速）

前置知识

基本的Linux命令行操作
Python环境管理基础
Git版本控制基础

安装步骤

1. 系统更新与依赖安装

代码片段

# 更新系统包
sudo dnf update -y

# 安装基础开发工具
sudo dnf groupinstall "Development Tools" -y

# 安装Python和相关依赖
sudo dnf install python3 python3-devel python3-pip git wget -y

# 验证Python版本
python3 --version

注意事项：
– CentOS默认可能使用较旧的Python版本，如果低于3.8，需要通过源码编译或第三方仓库安装新版Python。

2. Python虚拟环境创建

代码片段

# 创建项目目录
mkdir rag_project && cd rag_project

# 创建虚拟环境
python3 -m venv rag_env

# 激活虚拟环境
source rag_env/bin/activate

# 升级pip
pip install --upgrade pip setuptools wheel

3. RAG核心组件安装

代码片段

# 安装PyTorch（根据有无GPU选择）
# CPU版本：
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# GPU版本（CUDA11.7）：
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

# 安装transformers和sentence-transformers库
pip install transformers sentence-transformers faiss-cpu datasets langchain pypdf tiktoken openai gradio flask flask-cors flask-restful requests beautifulsoup4 html2text nltk scikit-learn sentencepiece protobuf==3.20.*

# GPU用户额外安装faiss-gpu：
pip install faiss-gpu==1.7.2

常见问题：
– protobuf版本冲突：明确指定protobuf==3.20.*可解决大多数兼容性问题
– CUDA版本不匹配：需确保PyTorch的CUDA版本与系统NVIDIA驱动匹配

4. RAG示例代码实现

创建rag_demo.py文件：

代码片段

import os
from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration, RagTokenForGeneration

def setup_rag_model(model_name="facebook/rag-sequence-nq"):
    """
    初始化RAG模型

    参数:
        model_name: RAG模型名称，可选:
            - facebook/rag-sequence-nq (问答任务)
            - facebook/rag-token-nq (标记级别生成)
    """
    # 初始化tokenizer和retriever
    tokenizer = RagTokenizer.from_pretrained(model_name)
    retriever = RagRetriever.from_pretrained(
        model_name,
        index_name="exact",
        use_dummy_dataset=True
    )

    # 根据模型类型选择适当的生成器类
    if "sequence" in model_name:
        model = RagSequenceForGeneration.from_pretrained(model_name, retriever=retriever)
    else:
        model = RagTokenForGeneration.from_pretrained(model_name, retriever=retriever)

    return tokenizer, retriever, model

def generate_answer(question, tokenizer, model):
    """
    使用RAG生成答案

    参数:
        question: str - 要回答的问题文本
        tokenizer: RagTokenizer实例 
        model: RAG模型实例

    返回:
        生成的答案文本和检索到的相关文档上下文列表
    """
    # Tokenize输入问题并生成答案    
    inputs = tokenizer(question, return_tensors="pt")
    outputs = model.generate(input_ids=inputs["input_ids"])

    # Decode生成的答案文本并获取相关文档上下文    
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # (可选)获取检索到的文档上下文用于调试和分析    
    retrieved_docs = []
    with torch.no_grad():
        retrieved_docs_dict = retriever(
            inputs["input_ids"],
            n_docs=5,
            return_tensors="pt"
        )

        for doc in retrieved_docs_dict["documents"][0]:
            retrieved_docs.append(tokenizer.decode(doc, skip_special_tokens=True))

    return answer, retrieved_docs[:5] if retrieved_docs else []

if __name__ == "__main__":
    import torch

    print("正在加载RAG模型...")

    # (可选)设置日志级别为info以查看详细过程  
    import logging 
    logging.basicConfig(level=logging.INFO)

    # Step1:初始化RAG组件  
    tokenizer, retriever, model = setup_rag_model()

    print("\n模型加载完成！输入您的问题或输入'quit'退出\n")

    while True:
        question = input("您的问题: ")

        if question.lower() in ["quit", "exit"]:
            break

        if not question.strip():
            continue

        print("\n思考中...\n")

        try:
            answer, contexts = generate_answer(question, tokenizer, model)

            print(f"\n回答: {answer}\n")

            if contexts:
                print("相关参考:")
                for i, ctx in enumerate(contexts):
                    print(f"[{i+1}] {ctx[:200]}...") 
                print()

        except Exception as e:
            print(f"处理问题时出错: {str(e)}")

5. Web界面集成（可选）

使用Gradio快速创建Web界面：

代码片段

import gradio as gr 

def rag_web_interface(question):
    try:
        answer, _ = generate_answer(question)
        return answer 

iface = gr.Interface(
   fn=rag_web_interface,
   inputs="text",
   outputs="text",
   title="RAG问答系统",
   description="输入您的问题，基于Wikipedia知识的RAG系统将为您解答"
)

if __name__ == "__main__":
   iface.launch(server_name="0.0.0.0", server_port=7860)

RAG工作流程解析

检索阶段：
- RAG首先从知识库中检索与问题相关的文档片段（默认使用Wikipedia作为知识源）
- FAISS索引加速向量相似度搜索过程
生成阶段：
- BART或T5等序列到序列模型将问题和检索到的上下文一起作为输入生成最终答案
优势特点：
- 动态知识更新：只需更新检索库而无需重新训练整个模型
- 可解释性：可以查看哪些文档片段影响了最终答案

FAQ常见问题解决

Q1: RuntimeError: CUDA out of memory

解决方案：

代码片段

model.half() # FP16半精度减少显存占用  
inputs.to("cuda") #确保输入数据在GPU上  
torch.cuda.empty_cache() #清空缓存

Q2: ValueError: Index name not found

原因：本地未下载索引数据
解决方案：添加use_dummy_dataset=True参数或下载完整索引

Q3: HTTP连接错误

原因：HuggingFace服务器连接问题
解决方案：配置镜像源或重试

代码片段

export HF_ENDPOINT=https://hf-mirror.com

RAG优化建议

本地知识库构建：替换默认Wikipedia索引为自定义知识库

代码片段

from langchain.document_loaders import DirectoryLoader 

loader = DirectoryLoader('./my_data/', glob="*.pdf") 
documents = loader.load() 

from langchain.text_splitter import RecursiveCharacterTextSplitter 

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000) 
docs_chunks = text_splitter.split_documents(documents) 

from langchain_community.embeddings import HuggingFaceEmbeddings 

embedding_model = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2') 

from langchain_community.vectorstores import FAISS 

db = FAISS.from_documents(docs_chunks, embedding_model) 
db.save_local("my_index")

性能优化技巧：

代码片段

# ONNX运行时加速   
from optimum.onnxruntime import ORTModelForSequenceClassification   

ort_model = ORTModelForSequenceClassification.from_pretrained("model_path", export=True)   

# TensorRT优化   
from transformers import TensorRTModelForSequenceClassification   

trt_model = TensorRTModelForSequenceClassification.from_pretrained("model_path")   <br>

Docker快速部署方案

创建Dockerfile:

代码片段

FROM centos:9 

RUN dnf update -y && \     
dnf install python39 git wget -y && \     
alternatives --set python /usr/bin/python3 && \     
pip3 install --upgrade pip 

WORKDIR /app 

COPY requirements.txt . 

RUN pip install -r requirements.txt 

COPY . . 

EXPOSE 7860 

CMD ["python", "rag_web.py"]

构建并运行容器：

代码片段

docker build -t rag-app . && docker run -p7860:7860 rag-app

Web API服务化部署

基于Flask的API服务示例(api.py)：

代码片段

from flask import Flask, request, jsonify 

app = Flask(__name__) 

@app.route('/ask', methods=['POST']) 
def ask():     
data = request.get_json()     
question = data.get('question', '')     
answer,_=generate_answer(question)     
return jsonify({"answer":answer}) 

if __name__=='__main__':     
app.run(host='0.0.0.0',port=5000)

启动服务后可通过curl测试：

代码片段

curl -X POST http://localhost:5000/ask \
-H "Content-Type: application/json" \
-d '{"question":"量子计算的主要优势是什么？"}'

Nginx反向代理配置示例

代码片段

server {
 listen80; server_name your.domain.com; location / {
 proxy_pass http://127.0.0.1:5000; proxy_set_header Host $host;
 } }

Prometheus监控集成

添加性能指标监控：

python from prometheus_flask_exporter import PrometheusMetrics metrics = PrometheusMetrics(app) metrics.info('app_info','Application info',version='1.