2024年最新C#使用LangChain处理非结构化数据完全指南：Web开发实例

引言

在当今数据驱动的Web开发中，我们经常需要处理各种非结构化数据（如PDF、Word文档、网页内容等）。本文将介绍如何使用C#结合LangChain这一强大的AI框架来处理和分析这些数据。通过本教程，你将学会如何构建一个能够理解、分析和回答关于非结构化数据问题的Web应用。

准备工作

环境要求

.NET 6或更高版本
Visual Studio 2022或VS Code
Python 3.8+（用于运行LangChain服务）
OpenAI API密钥（或其他LLM提供商的API）

安装必要的NuGet包

代码片段

dotnet add package Microsoft.SemanticKernel --version 1.0.0-beta8
dotnet add package Newtonsoft.Json
dotnet add package PuppeteerSharp # 用于网页抓取

项目设置

创建一个新的ASP.NET Core Web应用：

代码片段

dotnet new webapp -n LangChainDemo
cd LangChainDemo

添加一个Services文件夹，用于存放LangChain相关服务。

LangChain服务集成

1. LangChain服务封装

在Services文件夹中创建LangChainService.cs：

代码片段

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.AI.OpenAI;

public class LangChainService 
{
    private readonly IKernel _kernel;

    public LangChainService(string apiKey)
    {
        _kernel = Kernel.Builder.Build();
        _kernel.Config.AddOpenAITextCompletionService(
            "gpt-3.5-turbo", 
            apiKey);
    }

    public async Task<string> ProcessUnstructuredDataAsync(string data, string prompt)
    {
        var function = _kernel.CreateSemanticFunction(prompt);
        var result = await function.InvokeAsync(data);
        return result.ToString();
    }
}

2. 注册服务

在Program.cs中添加：

代码片段

// 从配置中读取API密钥
var openAiApiKey = builder.Configuration["OpenAI:ApiKey"];

// 注册LangChain服务
builder.Services.AddSingleton<LangChainService>(new LangChainService(openAiApiKey));

Web应用开发实例

场景：从PDF/网页提取并分析信息

1. PDF文本提取

首先添加一个PDF处理服务：

代码片段

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;

public class PdfTextExtractor 
{
    public string ExtractTextFromPdf(byte[] pdfBytes)
    {
        using var reader = new PdfReader(new MemoryStream(pdfBytes));
        using var pdfDoc = new PdfDocument(reader);

        var text = new StringBuilder();
        for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            text.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
        }

        return text.ToString();
    }
}

2. Web控制器实现

创建DocumentAnalysisController.cs：

代码片段

[ApiController]
[Route("api/[controller]")]
public class DocumentAnalysisController : ControllerBase 
{
    private readonly LangChainService _langChain;
    private readonly PdfTextExtractor _pdfExtractor;

    public DocumentAnalysisController(LangChainService langChain, PdfTextExtractor pdfExtractor)
    {
        _langChain = langChain;
        _pdfExtractor = pdfExtractor;
    }

    [HttpPost("analyze-pdf")]
    public async Task<IActionResult> AnalyzePdf(IFormFile file, [FromQuery] string question)
    {
        if (file == null || file.Length == 0)
            return BadRequest("No file uploaded");

        using var ms = new MemoryStream();
        await file.CopyToAsync(ms);
        var pdfBytes = ms.ToArray();

        // Step 1: Extract text from PDF
        var extractedText = _pdfExtractor.ExtractTextFromPdf(pdfBytes);

        // Step 2: Process with LangChain
        var prompt = $@"请根据以下文档内容回答问题：
{question}

文档内容：
{extractedText}";

        var answer = await _langChain.ProcessUnstructuredDataAsync(extractedText, prompt);

        return Ok(new { answer });
    }
}

3. Web页面集成

在Razor页面中添加文件上传表单：

代码片段

@page "/document-analysis"
@model DocumentAnalysisModel

<h2>文档分析工具</h2>

<form method="post" enctype="multipart/form-data">
    <div class="form-group">
        <label for="file">上传PDF文档</label>
        <input type="file" class="form-control-file" id="file" name="file" accept=".pdf" required>
    </div>

    <div class="form-group mt-3">
        <label for="question">你的问题</label>
        <input type="text" class="form-control" id="question" name="question" 
               placeholder="例如：这份合同的主要条款是什么？" required>
    </div>

    <button type="submit" class="btn btn-primary mt-3">分析文档</button>
</form>

@if (!string.IsNullOrEmpty(Model.Answer))
{
    <div class="mt-4 p-3 bg-light rounded">
        <h4>分析结果：</h4>
        <p>@Model.Answer</p>
    </div>
}

对应的Page Model:

代码片段

public class DocumentAnalysisModel : PageModel 
{
    private readonly LangChainService _langChain;

    [BindProperty]
    public IFormFile File { get; set; }

    [BindProperty]
    public string Question { get; set; }

    public string Answer { get; set; }

    public DocumentAnalysisModel(LangChainService langChain)
    {
        _langChain = langChain;
    }

    public async Task<IActionResult> OnPostAsync()
    {
        if (!ModelState.IsValid)
            return Page();

        using var ms = new MemoryStream();
        await File.CopyToAsync(ms);

        // Call our API endpoint (could also call service directly)
        var client = new HttpClient();

        using var formContent = new MultipartFormDataContent();

        // Add file content
        var fileContent = new StreamContent(ms);
        fileContent.Headers.ContentType = MediaTypeHeaderValue.Parse("application/pdf");
        formContent.Add(fileContent, "file", File.FileName);

        // Add question as query parameter in the URL
        var response = await client.PostAsync($"http://localhost:5000/api/DocumentAnalysis/analyze-pdf?question={Uri.EscapeDataString(Question)}", formContent);

        if (response.IsSuccessStatusCode)
        {
            var result = await response.Content.ReadFromJsonAsync<dynamic>();
            Answer = result?.answer;
        }

        return Page();
     }
}

高级功能扩展

1. URL内容抓取与分析

添加网页内容抓取功能：

代码片段

public class WebContentFetcher 
{
    public async Task<string> FetchContent(string url) 
    {
        // Launch browser and open page (requires PuppeteerSharp)
        using var browserFetcher = new BrowserFetcher();
        await browserFetcher.DownloadAsync(BrowserFetcher.DefaultChromiumRevision);
        await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = true });
        await using var page = await browser.NewPageAsync();

        await page.GoToAsync(url, WaitUntilNavigation.Networkidle0);

        // Extract main content (simplified - in production you'd want more sophisticated extraction)
        return await page.EvaluateExpressionAsync<string>("document.body.innerText");
    }
}

然后在控制器中添加新的端点：

代码片段

[HttpPost("analyze-url")]
public async Task<IActionResult> AnalyzeUrl([FromBody] UrlAnalysisRequest request)
{
    var contentFetcher = new WebContentFetcher();
    var webContent = await contentFetcher.FetchContent(request.Url);

    var promptTemplate =
@"请根据以下网页内容回答问题：
问题：{0}

网页内容：
{1}";

    var promptWithContext =
    string.Format(promptTemplate,
         request.Question,
         webContent.Substring(0, Math.Min(webContent.Length, 10000))); // Limit context size

    var answer =
         await _langchainService.GetCompletionAsync(promptWithContext);

    return Ok(new { answer });
}

public record UrlAnalysisRequest(string Url, string Question);

性能优化与注意事项

分块处理大型文档：

LangChain有token限制，对于大型文档需要分块处理：

代码片段

public IEnumerable<string> ChunkText(string text, int maxChunkSize) 
{
    for (int i=0; i<text.Length; i+=maxChunkSize) 
    {
        yield return text.Substring(i, Math.Min(maxChunkSize, text.Length-i));
    }
}<br>

缓存策略：

API调用可能很昂贵，考虑实现缓存机制：

代码片段

// Simple in-memory cache example (consider distributed cache for production)
private static readonly ConcurrentDictionary<string, string> _cache =
    new ConcurrentDictionary<string, string>();

public async Task<string> GetCachedCompletion(string key, Func<Task<string>> valueFactory) 
{
    if (_cache.TryGetValue(key, out var cachedValue))
        return cachedValue;

    var value = await valueFactory();
    _cache.TryAdd(key, value);
    return value;
}<br>

错误处理：

LLM API可能失败，实现重试逻辑：

代码片段

public async Task<string> GetCompletionWithRetry(string prompt,
                                                 int maxRetries=3,
                                                 int delayMs=1000) 
{
    for (int i=0; i<maxRetries; i++) 
    {
        try 
        {
            return await ProcessUnstructuredDataAsync(prompt);
        } 
        catch (Exception ex) when (i < maxRetries -1) 
        {
            Console.WriteLine($"Attempt {i+1} failed: {ex.Message}");
            await Task.Delay(delayMs * (i+1));
        }
    }

    throw new Exception($"Failed after {maxRetries} attempts");
}<br>

总结与最佳实践

通过本教程，我们构建了一个完整的Web应用，能够处理和分析非结构化数据。关键要点包括：

架构选择：
- C#后端负责数据处理和API集成
- Lang Chain/Semantic Kernel作为AI中间件层
性能考虑：
- PDF/网页解析是CPU密集型操作，考虑后台任务或队列处理大型文件
安全实践：
- API密钥安全存储（使用Azure Key Vault或类似方案）
- PDF上传验证（文件类型、大小限制）

4.扩展方向：
-添加向量数据库支持(Pinecone、Weaviate等)实现语义搜索
-集成更多文件类型支持(Word、Excel等)

这个示例展示了如何将现代AI能力整合到传统Web应用中。你可以在此基础上继续扩展功能，打造更强大的智能文档处理系统。