Stable Diffusion实战：如何用Shell开发高效语义搜索

引言

在AI图像生成领域，Stable Diffusion已经成为最受欢迎的开源模型之一。但当我们生成了大量图片后，如何快速找到想要的图像就成了一个挑战。本文将教你如何使用Shell脚本结合CLIP模型，为Stable Diffusion生成的图片构建一个高效的语义搜索系统。

准备工作

环境要求

Linux/macOS系统（Windows可使用WSL）
Python 3.7+
已安装Stable Diffusion环境
基本的Shell脚本知识

需要安装的Python包

代码片段

pip install torch torchvision openai-clip pillow

实现步骤

1. 图片特征提取

首先我们需要使用CLIP模型提取图片的语义特征向量：

代码片段

#!/bin/bash

# clip_encoder.sh - 使用CLIP模型提取图片特征向量

# 定义输入输出目录
INPUT_DIR="./generated_images"
OUTPUT_DIR="./image_embeddings"

# 创建输出目录
mkdir -p "$OUTPUT_DIR"

# Python脚本部分
python3 - <<EOF
import os
import torch
import clip
from PIL import Image

# 加载CLIP模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 处理每张图片
for img_file in os.listdir("$INPUT_DIR"):
    if img_file.lower().endswith(('.png', '.jpg', '.jpeg')):
        try:
            # 加载和预处理图像
            image = preprocess(Image.open(os.path.join("$INPUT_DIR", img_file))).unsqueeze(0).to(device)

            # 提取特征向量
            with torch.no_grad():
                image_features = model.encode_image(image)
                features = image_features.cpu().numpy().tolist()[0]

            # 保存特征向量到文件
            output_file = os.path.join("$OUTPUT_DIR", f"{os.path.splitext(img_file)[0]}.txt")
            with open(output_file, 'w') as f:
                f.write(','.join(map(str, features)))

            print(f"Processed: {img_file}")
        except Exception as e:
            print(f"Error processing {img_file}: {str(e)}")
EOF

原理说明：
– CLIP模型由OpenAI开发，能够将图像和文本映射到同一语义空间
– clip.load()加载预训练的ViT-B/32模型版本
– model.encode_image()生成512维的特征向量
– 我们将这些向量保存为文本文件供后续搜索使用

2. 构建搜索索引

接下来我们创建一个简单的搜索系统：

代码片段

#!/bin/bash

# semantic_search.sh - 基于CLIP特征的语义搜索系统

SEARCH_QUERY="$1"
EMBEDDINGS_DIR="./image_embeddings"
IMAGES_DIR="./generated_images"

if [ -z "$SEARCH_QUERY" ]; then
    echo "Usage: $0 <search_query>"
    exit 1
fi

# Python处理部分
python3 - <<EOF
import os
import clip
import torch
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _ = clip.load("ViT-B/32", device=device)

# 编码搜索文本
text_input = clip.tokenize([$SEARCH_QUERY]).to(device)
with torch.no_grad():
    text_features = model.encode_text(text_input)
text_features = text_features.cpu().numpy()[0]

scores = []

# 计算相似度分数 (余弦相似度)
for emb_file in os.listdir("$EMBEDDINGS_DIR"):
    if emb_file.endswith('.txt'):
        # 加载图像特征向量
        with open(os.path.join("$EMBEDDINGS_DIR", emb_file), 'r') as f:
            img_features = np.array([float(x) for x in f.read().split(',')])

        # 计算余弦相似度 (归一化后点积)
        similarity = np.dot(text_features, img_features) / (
            np.linalg.norm(text_features) * np.linalg.norm(img_features)
        )

        scores.append((os.path.splitext(emb_file)[0], similarity))

# 按相似度排序并显示结果前5名 
scores.sort(key=lambda x: x[1], reverse=True)

print("\nTop results:")
for i, (img_id, score) in enumerate(scores[:5]):
    print(f"{i+1}. {img_id} (score: {score:.3f})")

    # macOS可以直接打开图片预览，Linux可能需要其他查看器 
    if input("Show this image? (y/n): ").lower() == 'y':
        img_path = os.path.join("$IMAGES_DIR", f"{img_id}.jpg")
        if os.path.exists(img_path):
            import subprocess
            subprocess.run(['open', img_path]) if os.uname().sysname == 'Darwin' else \
            subprocess.run(['xdg-open', img_path])
EOF

使用示例：

代码片段

./semantic_search.sh "\"a sunset over mountains\""

原理说明：
– CLIP将文本查询编码为与图像相同的语义空间中的向量
– 我们计算查询向量与每个图像向量的余弦相似度
– clip.tokenize()将文本转换为模型理解的格式
– model.encode_text()生成文本的特征向量

3. （可选）批量处理优化

对于大量图片，我们可以优化处理速度：

代码片段

#!/bin/bash

# batch_process.sh - 并行处理加速特征提取

NUM_PROCESSES=4   # CPU核心数调整这里 

find ./generated_images -type f | parallel -j $NUM_PROCESSES --bar '
    python3 -c "
import sys; import torch; import clip; from PIL import Image;
device=\"cuda\" if torch.cuda.is_available() else \"cpu\";
model, preprocess = clip.load(\"ViT-B/32\", device=device);
try:
    image = preprocess(Image.open(sys.argv[1])).unsqueeze(0).to(device);
    features = model.encode_image(image).cpu().numpy().tolist()[0];
    with open(\"./image_embeddings/\" + sys.argv[2] + \".txt\", \"w\") as f:
        f.write(\",\".join(map(str, features)));
except Exception as e:
    print(f\"Error processing {sys.argv[1]}: {str(e)}\", file=sys.stderr)
" {} $(basename {} .jpg)
'

注意事项：
– GNU Parallel工具可以大幅加速处理
– GPU环境下请确保CUDA可用
– RAM不足时减少并行进程数

实践经验分享

性能优化技巧：
- GPU加速：确保PyTorch启用了CUDA支持
- LRU缓存：对重复查询添加缓存机制
- FAISS索引：对于超大规模数据集，考虑使用Facebook的FAISS库

常见问题解决：

代码片段

# CLIP下载失败时手动下载模型 (Linux/Mac)
mkdir -p ~/.cache/clip && cd ~/.cache/clip && \
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt -O ViT-B-32.pt

# Windows用户需要设置环境变量让CLIP能找到模型 
setx CLIP_MODEL_PATH "C:\path\to\downloaded\model"

扩展功能建议：

代码片段

# search_with_feedback.sh部分代码示例 (交互式反馈改进搜索结果)

echo "Initial results for: $SEARCH_QUERY"
./semantic_search.sh "$SEARCH_QUERY"

read -p "Enter additional keywords to refine search: " REFINE_QUERY  

IMPROVED_QUERY="$SEARCH_QUERY, $REFINE_QUERY"
./semantic_search.sh "$IMPROVED_QUERY"

总结与关键点回顾

通过本教程，我们实现了：

✅ 完整流程：从图片特征提取到语义搜索的系统搭建
✅ 关键技术点：CLIP模型的跨模态理解能力
✅ 实用技巧：Shell与Python的高效结合、并行处理优化

关键命令速查表：

Command	Description
`./clip_encoder.sh`	Extract embeddings for all images
`./semantic_search.sh "query"`	Search images by semantic meaning
`./batch_process.sh`	Parallel processing optimization

下一步改进方向：可以考虑集成到Stable Diffusion WebUI中，或者构建Web服务提供API访问。