解决Ubuntu 22.04上安装Text Generation Inference时的常见问题与疑难杂症

引言

Text Generation Inference (TGI) 是Hugging Face推出的高性能文本生成推理服务，但在Ubuntu 22.04上安装时可能会遇到各种依赖问题和环境配置挑战。本文将带你一步步解决这些常见问题，让你顺利部署TGI服务。

准备工作

在开始前，请确保你的系统满足以下要求：

Ubuntu 22.04 LTS
至少16GB RAM（大型模型需要更多）
NVIDIA GPU（推荐）或CPU模式
Python 3.8或更高版本
pip最新版本
CUDA工具包（如果使用GPU）

步骤1：系统环境准备

1.1 更新系统软件包

代码片段

sudo apt update && sudo apt upgrade -y

1.2 安装基础依赖

代码片段

sudo apt install -y git curl build-essential cmake pkg-config libssl-dev libclang-dev clang llvm

注意：这些是编译Rust代码和构建TGI所需的基础工具链。

步骤2：安装Rust工具链

TGI使用Rust编写，因此需要正确配置Rust环境：

代码片段

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
rustup default stable
rustup update stable

验证安装：

代码片段

rustc --version
cargo --version

常见问题：如果遇到SSL证书问题，可以尝试：

代码片段

sudo apt install --reinstall ca-certificates

步骤3：安装CUDA（GPU用户）

3.1 检查NVIDIA驱动

代码片段

nvidia-smi

如果没有输出，需要先安装NVIDIA驱动：

代码片段

sudo ubuntu-drivers autoinstall && sudo reboot

3.2 安装CUDA Toolkit

对于Ubuntu 22.04，推荐CUDA 11.7：

代码片段

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update && sudo apt -y install cuda-toolkit-11-7 libcudnn8-dev libcudnn8-samples libnccl-dev libnccl2 nccl-tests-2.12.10+cuda11.7 libopenmpi-dev openmpi-bin openmpi-common ocl-icd-opencl-dev opencl-headers clinfo screen htop nvtop gcc g++ make cmake git wget unzip vim tmux software-properties-common apt-transport-https ca-certificates gnupg-agent lsb-release python3-pip python3-dev python3-setuptools python3-wheel python3-virtualenv python3-venv python3-distutils python3-testresources python-is-python3 patchelf protobuf-compiler protobuf-c-compiler libprotobuf-dev libprotobuf-c-dev libprotoc-dev libgoogle-perftools4 libtcmalloc-minimal4 google-perftools google-perftools-dbg google-perftools-doc google-perftools-profiler gperftools-libs gperftools-doc gperftools-profiler gperftools-tools jq numactl numactl-devel numactl-libs numactl-tools numa-examples numaplot numautils numad numamon numastat numatop numatune numad numactl-devel numactl-libs numactl-tools numa-examples numaplot numautils numad numamon numastat numatop numatune

注意：这是一个大包，包含了所有可能的依赖项。

3.3 配置环境变量

将以下内容添加到~/.bashrc中：

代码片段

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda

# For NCCL and other CUDA libraries (optional)
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH

# For cuDNN (optional)
export CUDNN_INCLUDE_DIR=/usr/include/
export CUDNN_LIBRARY=/usr/lib/x86_64-linux-gnu/

然后应用更改：

代码片段

source ~/.bashrc

验证CUDA安装：

代码片段

nvcc --version

步骤4：安装Text Generation Inference

4.1 Clone仓库并构建

代码片段

git clone https://github.com/huggingface/text-generation-inference.git && cd text-generation-inference

# Install with specific CUDA version (adjust as needed)
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels

# Alternative for CPU-only:
# BUILD_EXTENSIONS=False make install

常见问题：
1. 内存不足：如果构建失败且内存不足，尝试设置交换空间：

代码片段

sudo fallocate -l 16G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile && free -h <br>

CMake错误：确保安装了最新版CMake：
代码片段
```
sudo snap install cmake --classic <br>
```

4.2 Python环境设置（可选但推荐）

建议使用conda或venv创建隔离环境：

代码片段

python -m venv tgi-env 
source tgi-env/bin/activate 

pip install --upgrade pip setuptools wheel 
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117 
pip install transformers sentencepiece protobuf accelerate bitsandbytes scipy safetensors 

# Install TGI Python package (optional)
pip install text-generation-inference[all]

步骤5：运行TGI服务

5.1 GPU模式启动示例（使用Llama2模型）

首先下载模型权重（需要Hugging Face账号和访问令牌）：

代码片段

huggingface-cli login # Enter your token when prompted 

text-generation-launcher --model-id meta-llama/Llama-2-7b-chat-hf --num-shard=1 --quantize bitsandbytes-nf4 --max-input-length=2048 --max-total-tokens=4096 --port=8080

5.2 CPU模式启动示例（性能较低）

代码片段

text-generation-launcher --model-id meta-llama/Llama-2-7b-chat-hf --num-shard=1 --max-input-length=512 --max-total-tokens=1024 --port=8080 --disable-custom-kernels

5.3 Docker方式运行（推荐生产环境）

如果你不想处理复杂的依赖关系，可以使用官方Docker镜像：

代码片段

docker run -d \
    -p8080:80 \
    -e MODEL_ID=meta-llama/Llama-2-7b-chat-hf \
    -e NUM_SHARD=1 \
    -e QUANTIZE=bitsandbytes-nf4 \  
    ghcr.io/huggingface/text-generation-inference:latest

常见问题解决方案

Q1: `libtensorflow.so not found`错误

解决方法：

代码片段

wget https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-gpu-linux-x86_64+version.tar.gz  
tar xzf libtensorflow-gpu-linux-x86_64+version.tar.gz  
sudo cp lib/* /usr/local/lib/  
sudo ldconfig

Q2: `CUDA out of memory`错误

尝试以下方法：
1. 减少批次大小：添加--batch-size=4参数降低显存占用。
2. 启用量化：使用--quantize bitsandbytes-nf4或--quantize bitsandbytes-fp4。
3. 减少最大token数：调整--max-input-length和--max-total-tokens参数。

Q3: `Error loading model weights`

可能原因及解决方案：
1. 模型下载不完整：删除缓存目录并重新下载。

代码片段

rm -rf ~/.cache/huggingface/hub/models--*  <br>

磁盘空间不足：清理空间或指定其他缓存目录。
代码片段
```
export HF_HOME=/path/to/larger/drive  <br>
```

TGI性能优化技巧

使用Flash Attention
在启动参数中添加：
代码片段
```
--enable-flash-attention=true  
```

调整并行度
根据GPU数量调整shard数：

代码片段

# For multi-GPU systems  
text-generation-launcher ... --num-shard=$NUM_GPUS

监控资源使用
推荐工具组合：

代码片段

# GPU监控  
nvtop  

# CPU/RAM监控  
htop  

# API请求监控  
watch "netstat -tulnp | grep text-gen"

API测试示例

服务启动后可以使用curl测试：

代码片段

curl http://localhost:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \   
    -H 'Content-Type: application/json'  

# Expected response format:
{
 "generated_text": "Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers to model complex patterns in data..."
}

对于流式响应：

代码片段

curl http://localhost:8080/generate_stream \   
    -X POST \   
    -d '{"inputs":"Explain AI in simple terms","parameters":{"max_new_tokens":50}}' \   
    -H 'Content-Type: application/json' \   
    -H 'Accept: text/event-stream' \   
    --no-buffer

Docker Compose部署示例（生产环境推荐）

创建docker-compose.yml文件：

代码片段

version: '3'

services:
 tgi:
 image: ghcr.io/huggingface/text-generation-inference:latest    
 container_name: tgi-server    
 ports:
     - "8080:80"    
 environment:
     MODEL_ID: meta-llama/Llama-2-7b-chat-hf    
     NUM_SHARD: "auto"    
     QUANTIZE: bitsandbytes-nf4    
     MAX_INPUT_LENGTH: "2048"    
     MAX_TOTAL_TOKENS: "4096"    
 volumes:
     # Optional volume for model caching     
     # Uncomment if you want persistent model storage     
     # - ./model-cache:/data     
 deploy:
 resources:
 limits:
 cpus: '8'
 memory: '16G'
 devices:
     # Pass through all GPUs     
     # Remove if running on CPU     
     driver: nvidia        
 capabilities:
     gpu      
 restart: unless-stopped    

networks:
 default:
 name: tgi-network    

volumes:
 model-cache:

# Optional reverse proxy example (uncomment if needed)    
# nginx-proxy:
# image: nginxproxy/nginx-proxy    
# ports:
#     "80:80"    
# volumes:
#     /var/run/docker.sock:/tmp/docker.sock:ro    
# depends_on:
#     tgi

然后运行：

bash docker-compose up -d

Kubernetes部署示例（高级用户）

创建tgi-deployment.yaml:

yaml apiVersionapps/v1 kindDeployment metadata name tgi-deployment spec replicas selector matchLabels app tgi template metadata labels app tgi spec containers name tgi image ghcr.io/huggingface/text-generation-inference latest ports containerPort env name MODEL_ID value meta llama Llama b chat hf name NUM_SHARD value auto name QUANTIZE value bitsandbytes nf resources limits cpu m memory Gi nvidia com/gpu requests cpu m memory Gi volumeMounts mountPath data name model cache volumes emptyDir sizeLimit Gi nodeSelector kubernetes io hostname gpu node tolerations key nvidia com/gpu operatorExists effectNoSchedule

应用配置:

bash kubectl apply f tgi deployment yaml kubectl expose deployment tgi deployment port type LoadBalancer port targetPort

总结

本文详细介绍了在Ubuntu . LTS上安装Text Generation Inference的全过程以及各种疑难杂症的解决方案关键点回顾:

确保系统满足最低要求包括足够的RAM和正确的GPU驱动

正确安装Rust工具链和CUDA环境

通过源码或Docker方式部署TGI服务

针对不同硬件配置进行性能优化

采用容器化部署方案便于维护和扩展

通过本文的指导你应该能够成功在Ubuntu .上部署高性能的文本生成推理服务如仍有问题可以参考官方文档或在社区寻求帮助

解决Ubuntu 22.04上安装Text Generation Inference时的常见问题与疑难杂症

引言

准备工作

步骤1：系统环境准备

1.1 更新系统软件包

1.2 安装基础依赖

步骤2：安装Rust工具链

步骤3：安装CUDA（GPU用户）

3.1 检查NVIDIA驱动

3.2 安装CUDA Toolkit

3.3 配置环境变量

步骤4：安装Text Generation Inference

4.1 Clone仓库并构建

4.2 Python环境设置（可选但推荐）

步骤5：运行TGI服务

5.1 GPU模式启动示例（使用Llama2模型）

5.2 CPU模式启动示例（性能较低）

5.3 Docker方式运行（推荐生产环境）

常见问题解决方案

Q1: libtensorflow.so not found错误

Q2: CUDA out of memory错误

Q3: Error loading model weights

TGI性能优化技巧

API测试示例

Docker Compose部署示例（生产环境推荐）

Kubernetes部署示例（高级用户）

总结

Q1: `libtensorflow.so not found`错误

Q2: `CUDA out of memory`错误

Q3: `Error loading model weights`