使用TypeScript和LiteLLM构建本地部署：完整实战指南

引言

在当今AI应用开发领域，能够快速搭建本地大语言模型(LLM)服务变得越来越重要。本文将带你使用TypeScript和LiteLLM构建一个可在本地运行的大语言模型服务，让你能够：
– 完全掌控数据隐私
– 无需依赖云服务API
– 自由定制模型行为

准备工作

环境要求

Node.js (v16或更高版本)
npm或yarn
TypeScript (我们将安装)
Python环境 (用于某些本地模型)

前置知识

基本的JavaScript/TypeScript理解
REST API概念
命令行基础操作

项目初始化

首先创建一个新项目目录并初始化：

代码片段

mkdir ts-litemodel && cd ts-litemodel
npm init -y

安装TypeScript和相关依赖：

代码片段

npm install typescript @types/node --save-dev
npx tsc --init

安装LiteLLM核心库：

代码片段

npm install litellm express @types/express dotenv

配置TypeScript

修改tsconfig.json文件，确保包含以下关键配置：

代码片段

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "moduleResolution": "node"
  }
}

构建基础服务器

创建src/server.ts文件：

代码片段

import express from 'express';
import dotenv from 'dotenv';
import { LiteLLM } from 'litellm';

dotenv.config();

const app = express();
const port = process.env.PORT || 3000;

// 中间件配置
app.use(express.json());

// 初始化LiteLLM实例
const llm = new LiteLLM({
  model: process.env.MODEL_NAME || 'gpt2', // 默认使用GPT-2小型模型
});

// API路由定义
app.post('/api/chat', async (req, res) => {
  try {
    const { messages } = req.body;

    if (!messages || !Array.isArray(messages)) {
      return res.status(400).json({ error: 'Invalid messages format' });
    }

    const response = await llm.chatCompletion({
      messages,
      temperature: 0.7,
      max_tokens: 1000,
    });

    res.json(response);
  } catch (error) {
    console.error('Error processing chat request:', error);
    res.status(500).json({ error: 'Internal server error' });
  }
});

// 启动服务器
app.listen(port, () => {
  console.log(`Server running on http://localhost:${port}`);
});

.env配置文件

创建.env文件：

代码片段

PORT=3000
MODEL_NAME=gpt2 # gpt2是默认的轻量级模型，可以替换为其他支持的模型如 llama2等

# LiteLLM特定配置（如果使用需要API密钥的模型）
# OPENAI_API_KEY=your_api_key_here 
# ANTHROPIC_API_KEY=your_api_key_here

package.json脚本配置

添加以下脚本到package.json：

代码片段

"scripts": {
  "build": "tsc",
  "start": "node dist/server.js",
  "dev": "ts-node-dev src/server.ts"
}

运行项目

开发模式运行（使用ts-node-dev）：

代码片段

npm run dev

或者先编译再运行：

代码片段

npm run build && npm start

API测试示例

你可以使用curl测试API：

代码片段

curl -X POST http://localhost:3000/api/chat \
-H "Content-Type: application/json" \
-d '{
   "messages": [
     {"role": "user", "content": "你好，介绍一下你自己"}
   ]
}'

或者使用Postman等工具发送POST请求。

LiteLLM高级配置

1.切换不同模型

修改.env中的MODEL_NAME即可切换不同模型。LiteLLM支持多种开源和商业模型：

代码片段

MODEL_NAME=llama2 # HuggingFace上的开源模型，需要额外设置HF_TOKEN=
# MODEL_NAME=gpt-3.5-turbo # OpenAI的商业模型，需要OPENAI_API_KEY=

2.本地模型部署（以Llama2为例）

如果你想完全在本地运行而不依赖任何外部API，可以按照以下步骤：

1.首先安装Python依赖（确保有Python3环境）：

代码片段

pip install torch transformers sentencepiece protobuf==3.20.* huggingface-hub accelerate bitsandbytes xformers scipy peft datasets trl auto-gptq optimum einops autoawq vllm ninja --upgrade --quiet --no-cache-dir --force-reinstall --extra-index-url https://download.pytorch.org/whl/cu118

2.然后修改你的代码以加载本地模型：

代码片段

const llm = new LiteLLM({
 model: 'huggingface/meta-llama/Llama-2-7b-chat-hf',
 api_base: 'http://localhost:8000' //假设你已经在本地启动了vllm服务端 
});

3.启动vllm服务端（需要GPU支持）：

代码片段

python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--tokenizer hf-internal-testing/llama-tokenizer \
--tensor-parallel-size <GPU数量> \
--trust-remote-code \
--quantization awq \
--dtype half \
--max-model-len <最大上下文长度>

3.流式响应支持

修改server.ts添加流式响应支持：

代码片段

app.post('/api/chat-stream', async (req, res) => {
 try {
   const { messages } = req.body;

   if (!messages || !Array.isArray(messages)) {
     return res.status(400).json({ error: 'Invalid messages format' });
   }

   //设置流式响应头 
   res.setHeader('Content-Type', 'text/event-stream');
   res.setHeader('Cache-Control', 'no-cache');
   res.setHeader('Connection', 'keep-alive');

   const stream = await llm.chatCompletion({
     messages,
     temperature: 0.7,
     max_tokens: undefined, //不限制最大token数 
     stream: true //启用流式响应 
   });

   for await (const chunk of stream) {
     const content = chunk?.choices?.[0]?.delta?.content;
     if (content) {
       res.write(`data: ${JSON.stringify({ content })}\n\n`);
       //手动触发缓冲区刷新 
       res.flush();
     }
   }

   res.end();
 } catch (error) {
   console.error('Error processing stream request:', error);
   res.status(500).json({ error: 'Internal server error' });
 }
});

前端可以使用EventSource接收流式数据。

Docker部署（可选）

如果你想将应用容器化，可以创建Dockerfile:

代码片段

FROM node:18-alpine 

WORKDIR /app 

COPY package*.json ./
RUN npm install 

COPY . .
RUN npm run build 

EXPOSE ${PORT}

CMD ["node", "dist/server.js"]

构建并运行容器:

代码片段

docker build -t ts-litemodel .
docker run -p ${PORT}:${PORT} ts-litemodel

常见问题解决

1.内存不足错误
– FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
解决方案：增加Node内存限制：

代码片段

NODE_OPTIONS="--max-old-space-size=4096" npm start

或者在Docker中增加资源限制。

2.CUDA相关错误
如果你在GPU上运行本地大模型时遇到CUDA错误：

代码片段

RuntimeError: CUDA out of memory.

解决方案：
-减少批量大小(–batch-size参数)
-降低精度(–dtype float16或bfloat16)
-使用量化(–quantization awq或gptq)

3.下载大型HuggingFace模型超时
解决方案：

代码片段

export HF_HUB_ENABLE_HF_TRANSFER=1 #启用高速下载模式 
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir ./models --local-dir-use-symlinks False --resume-download

然后指定本地路径作为model参数。

4.Windows特定问题
在Windows上可能遇到路径问题，建议：