2024年开发者必备：在Ubuntu 20.04上配置Ollama大模型的完整步骤详解

引言

Ollama是一个强大的开源项目，它让开发者能够轻松地在本地运行和管理大型语言模型(LLM)。在2024年，随着AI技术的快速发展，掌握本地部署大模型的能力已经成为开发者的必备技能之一。本文将详细介绍如何在Ubuntu 20.04系统上配置Ollama环境，让你能够轻松运行Llama、Mistral等流行的大语言模型。

准备工作

在开始之前，请确保你的系统满足以下要求：

Ubuntu 20.04 LTS（其他Linux发行版可能需要调整部分命令）
至少16GB RAM（运行7B参数模型的最低要求）
50GB可用磁盘空间
NVIDIA GPU（推荐）或仅CPU模式
稳定的网络连接（用于下载模型）

检查系统信息

首先，让我们确认你的系统信息：

代码片段

# 查看Ubuntu版本
lsb_release -a

# 查看内存大小
free -h

# 查看磁盘空间
df -h

# 检查NVIDIA GPU（如果有的话）
nvidia-smi

步骤1：安装基础依赖

Ollama需要一些基础依赖才能正常运行：

代码片段

# 更新软件包列表
sudo apt update && sudo apt upgrade -y

# 安装基础依赖
sudo apt install -y curl wget git build-essential libssl-dev zlib1g-dev \
     libbz2-dev libreadline-dev libsqlite3-dev llvm \
     libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev \
     libffi-dev liblzma-dev python3-pip python3-venv

原理说明：这些依赖包括编译工具链、Python开发环境以及各种库文件，它们对于后续安装和运行Ollama都是必需的。

步骤2：安装Docker（推荐）

虽然Ollama可以直接安装，但使用Docker容器可以更好地隔离环境并简化管理：

代码片段

# 安装Docker依赖
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common

# 添加Docker官方GPG密钥
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# 添加Docker仓库
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# 安装Docker引擎
sudo apt update && sudo apt install -y docker-ce docker-ce-cli containerd.io

# 将当前用户添加到docker组（避免每次使用sudo）
sudo usermod -aG docker $USER && newgrp docker

# 验证Docker安装
docker run hello-world

注意事项：
1. Docker组权限等同于root权限，请谨慎操作
2. newgrp命令可能需要你重新登录才能生效

步骤3：安装NVIDIA容器工具包（GPU用户）

如果你有NVIDIA GPU并希望加速推理过程：

代码片段

# 添加NVIDIA容器工具包仓库
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 安装NVIDIA容器运行时
sudo apt update && sudo apt install -y nvidia-container-toolkit nvidia-docker2

# 重启Docker服务使配置生效
sudo systemctl restart docker

# 验证NVIDIA支持是否正常工作
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

实践经验：
1. CUDA版本可能需要根据你的GPU型号调整（11.0是较通用的版本）
2. GPU加速可以显著提高推理速度，特别是对于较大的模型

步骤4：安装Ollama

现在我们可以安装Ollama本体了：

代码片段

# Ollama官方一键安装脚本（适用于Linux）
curl https://ollama.ai/install.sh | sh

# Docker用户也可以选择容器化部署方式（可选）
docker pull ollama/ollama:latest

原理说明：
1. Ollama的官方脚本会自动检测你的系统架构并下载合适的版本
2. Docker方式提供了更好的隔离性，适合生产环境使用

步骤5：启动Ollama服务

根据你的安装方式选择启动方法：

Systemd方式（直接安装）

代码片段

# Ollama服务会自动注册为systemd服务并启动
systemctl status ollama.service

# 如果需要手动控制服务状态：
sudo systemctl start ollama   # 启动服务
sudo systemctl stop ollama    # 停止服务 
sudo systemctl enable ollama   # 设置开机自启

Docker方式启动容器

代码片段

# CPU模式启动容器（适合没有GPU的用户）
docker run -d --name ollama -p 11434:11434 ollama/ollama:latest 

# GPU模式启动容器（需要先完成NVIDIA配置）
docker run -d --gpus all --name ollama-gpu -p11434:11434 ollama/ollama:latest 

# Docker Compose方式更推荐用于生产环境：
mkdir ~/ollama && cd ~/ollama 
cat > docker-compose.yml <<EOF 
version: '3'
services:
 ollamagpu:
    image: ollama/ollamagpu:latest 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia 
              count: all 
              capabilities: [gpu] 
    ports:
      - "11434:11434" 
    volumes:
      - ./models:/root/.ollamagpu/models 
EOF 

docker-compose up -d

步骤6：下载和运行第一个模型

现在你可以开始下载和使用大语言模型了：

代码片段

# Llama2是最流行的开源大模型之一，我们先尝试7B参数版本：
ollamagpu pull llama2:7b 

# CPU用户可以使用量化版本来减少资源需求：
ollamagpu pull llama2:7b-q4_0 

# Mistral是另一个优秀的开源模型：
ollamagpu pull mistral

下载完成后，你可以与模型交互：

代码片段

ollamagpu run llama2:7b "请用中文解释量子计算的基本原理"

或者通过REST API访问：

代码片段

curl http://localhost:11434/api/generate \ 
    -d '{
        "model": "llamagpu2",
        "prompt": "为什么天空是蓝色的？",
        "stream": false,
        "options": {
            "temperature":0.7,
            "num_ctx":2048,
            "top_p":0.9,
            "top_k":40,
            "repeat_penalty":1.18,
            "stop":["</s>"]
        }
    }'

实践经验：
1. q4_0等后缀表示量化级别，数字越小占用资源越少但精度也越低
2. API调用时可以通过调整temperature等参数控制生成结果的质量和多样性

（可选）步骤7：配置Web UI界面

虽然命令行已经足够强大，但Web界面可以提供更好的交互体验：

Open WebUI项目部署

代码片段

git clone https://github.com/open-webui/open-webui.git  
cd open-webui  

cat > .env <<EOF  
OLLAMA_API_BASE_URL=http://localhost:11434  
WEBUI_SECRET_KEY=your-secret-key-here  
EOF  

docker-compose up -d

访问 http://localhost:8080 ，你将看到一个类似ChatGPT的界面。

（可选）步骤8：性能优化

根据你的硬件情况调整配置以获得最佳性能：

CPU优化

编辑 /etc/systemd/system/ollamaservice (或 ~/.config/systemd/user/ollamaservice)

代码片段

[Service]  
Environment="OLLAMA_NUM_PARALLEL=4"   # CPU核心数  
Environment="OLLAMA_MAX_LOADED_MODELS=2" #同时加载的模型数限制  
LimitNOFILE=65536                     #文件描述符限制提升

然后执行：

代码片段

systemctl daemon-reload  
systemctl restart ollamaservice

GPU优化

对于NVIDIA显卡用户：

代码片段

nano ~/.config/systemd/user/ollamaservice.conf.d/cuda.conf  

[Service]  
Environment="CUDA_VISIBLE_DEVICES=0"      #指定使用的GPU索引  
Environment="TF_FORCE_UNIFIED_MEMORY=1"   #统一内存管理   
Environment="XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda" #指定CUDA路径

VRAM不足时的解决方案

如果遇到显存不足错误(CUDA out of memory)，可以尝试以下方法之一：

使用更小的量化版本: llamagpu2-13b-q3_K_S比q4_K_M占用更少显存
启用分页机制: export OLLAMA_MMAP=1
减少上下文长度: --num_ctx2048改为1024或更低

（可选）步骤9：多模型管理技巧

随着使用的深入，你可能会下载多个不同版本的模型。以下是一些实用命令：

代码片段

ollamagpu list               #列出所有已下载的模型及其变体   
ollamagpu copy llama27b my-finetuned-model   #复制一个基础模型作为微调起点   
ollamagpu delete mistralinstruct             #删除不再需要的模型释放空间   
ollamagpu show llama213b --modelfile         #查看模型的原始定义文件

对于高级用户，还可以创建自定义的Modelfile来组合不同能力或添加系统提示词。例如创建一个专门用于代码生成的变体：

代码片段

FROM llama27b   
SYSTEM """你是一个专业的编程助手。回答时优先考虑代码的正确性和最佳实践。"""   
PARAMETER temperature0.3    
PARAMETER num_predict512    
TEMPLATE """{{if.System}}<|im_start|>system\n{{System}}<|im_end|>\n{{end}}{{if.Prompt}}<|im_start|>user\n{{Prompt}}<|im_end|>\n{{end}}<|im_start|>assistant\n"""

保存为codellamamodelfile后运行：

代码片段

ollamagpu create codellamamodelfile codellamamodelfile

（可选）步骤10：生产环境部署建议

如果你计划将Ollamas作为API服务长期运行，考虑以下增强措施：

1.反向代理和安全加固

代码片段

location /api {    
    proxy_pass http://localhost11434;    
    proxy_set_header Host $host;    
    proxy_set_header XRealIP $remote_addr;    
    proxy_set_header XForwardedFor $proxy_add_x_forwarded_for;    
    auth_basic "Restricted";    
    auth_basic_user_file /etc/nginx/.htpasswd;     
}

2.监控和日志收集

代码片段

journalctl _SYSTEMD_UNIT=ollamaservice.service _PID=1 --since="20240101" --until="20241231" > ollamamalog.txt     
prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/data     
grafana-server --config=/etc/grafana/grafana.inicfg

3.自动备份策略

代码片段

crontab e    

30 * * * rsync avz ~/.local/share/share/share/share/share/share/share/share/share/share/share/share/share/share/share/share/share/share/share/share/share/share//models user@backupserver:/mnt/nas/modelsbackup$(date +\%Y\%m\%d)     
00 */6 * * pg_dump U postgres d ollamametrics backup.sql

（可选）步骤11：微调自定义数据

虽然本文主要关注部署预训练模型，但简要介绍如何基于自有数据微调也很重要。以创建一个法律问答专用版本为例:

准备数据集(legalqa.jsonl)格式示例:

代码片段

{"text":"<s>[INST]中国民法典中关于租赁合同的规定是什么?[/INST]根据《中华人民共和国民法典》第七百零三条..."}
{"text":"<s>[INST]劳动合同解除需要提前多少天通知?[/INST]依据《劳动合同法》第三十七条..."}

然后执行:

代码片段

pip install transformers datasets accelerate peft bitsandbytes wandb tensorboard pyarrow 

python finetune.py \        
--model_name_or_path meta llma/Llammaggpuggpuggpuggpuggpuggpuggpuggpuggpuggpuggu22 \        
--train_file legalqa.jsonl \        
--output_dir legal llammaggpuggpuggu22 \        
--perdevice train batchsize12 \        
--gradient accumulationsteps24 \        
--learningrate25e5 \        
--numtrain epochs33 \        
--fp16True \        
--loggingsteps100 \        
--save total limit33 \        
--push to hubTrue \        
--hub model id yourusername/law llammaggpuggu22

完成后即可加载自定义版本:

代码片段

from transformers import AutoModelForCausalLM, AutoTokenizer        

model = AutoModelForCausalLM.frompretrained("yourusername/law llammaggpuggu22")         
tokenizer = AutoTokenizer.frompretrained("yourusername/law llammaggpuggu22")         
inputstext = "<s>[INST]公司股权转让有哪些法律风险?[/INST]"         
inputs = tokenizer(inputstext, return tensors="pt").to("cuda")         
outputs = model.generate(**inputs, maxnew tokens200)         
print(tokenizer.decode(outputs[0], skip special tokens=True))

或者转换为Ollamamodelfile格式集成到现有部署中:

代码片段

FROM yourusername/law llammaggpuggu22       
SYSTEM """你是一名专业律师助手,仅提供符合中国法律的建议"""       
TEMPLATE """{{if.System}}<|imstart|>system\n{{System}}<|imend|>\n{{end}}{{if.Prompt}}<|imstart|>user\n{{Prompt}}<|imend|>\n{{end}}<|imstart|>assistant\n"""       
PARAMETER temperature0.5       
PARAMETER repeat penalty11       
PARAMETER topk40       
PARAMETER stoppattern ["</s>"]

保存为lawyer.modelfile后创建新模型:

代码片段

cat lawyer.modelfile | ollamacreate lawyerv01      
ollaamarun lawyerv01

（可选）步骤12：集群化部署

当单机资源不足时可以考虑分布式方案。以下是两种常见架构:

方案A:NFS共享存储+多节点负载均衡

主节点(192168110):

代码片段

apt install nfs kernel server      
mkdir p /mnt/models      
chown nobody:nogroup /mnt/models      
chmod777 /mnt/models      
echo "/mnt/models19216810024(rw,sync,no subtree check)" >> etc/exports      
exportfs a      
systemctl restart nfs kernel server      
scopy r ~/.local share share share share share share share share share share share share share share share share share//models root@192168110:/mnt/models

工作节点(19216812X):

代码片段

apt install nfs common      
mount t nfs4192168110:/mnt/models/mnt/models      
dockerrun d name ollaamanodeX p1143411434 e OLLAMA MODELS/mnt/models ollaamaserve base URLhttp://19216811011434 advertise addr19216812X30080 replicas3

负载均衡器(Nginx配置):

代码片段

upstream ollaamaclusternodes {      
server19216812130080;      
server19216812230080;      
server19216812330080;      
}      


server{      
listen443 sslhttp two;      


location/api{      


proxypasshttp://ollaamaclusternodes;      


proxy set headerHost$host;      


proxy set headerXRealIP$remote addr;      


proxy set headerXForwardedFor$proxy add x forwarded for;      


}      


}

方案B:KubernetesOperator方式(生产级推荐)

首先部署OLM(Operator Lifecycle Manager):

kubectl apply fhttps://githubcomoperator frameworkoperator lifecycle managerreleasesdownloadv022olm.yaml

然后安装OllaamasOperator:

helm repo add ollaamasoperatorhttps://chartsollaamasai

helm install my ollaamasoperator ollaamasoperator

创建自定义资源实例:

cat > ollaamaclusteryaml <

apiVersionaiollaamacom/v1alpha1

kindCluster

metadatanameproduction cluster

specnodesize3

modelStorageclassnfs client

resourceslimitscpu8 memory32Gi gpunvidiacom/gpu2

tolerationskeydedicated operatorvalue true effectNoSchedule

affinitynodeAffinityrequiredDuringSchedulingIgnoredDuringExecutionnodeSelectorTermsmatchExpressionskey node role value worker operatorIn values [ai inferencing] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}} } } } } } } } } } } } } } } } } }}}}