2024年最新C#使用LangChain处理非结构化数据完全指南:Web开发实例

云信安装大师
90
AI 质量分
3 5 月, 2025
5 分钟阅读
0 阅读

2024年最新C#使用LangChain处理非结构化数据完全指南:Web开发实例

引言

在当今数据驱动的Web开发中,我们经常需要处理各种非结构化数据(如PDF、Word文档、网页内容等)。本文将介绍如何使用C#结合LangChain这一强大的AI框架来处理和分析这些数据。通过本教程,你将学会如何构建一个能够理解、分析和回答关于非结构化数据问题的Web应用。

准备工作

环境要求

  • .NET 6或更高版本
  • Visual Studio 2022或VS Code
  • Python 3.8+(用于运行LangChain服务)
  • OpenAI API密钥(或其他LLM提供商的API)

安装必要的NuGet包

代码片段
dotnet add package Microsoft.SemanticKernel --version 1.0.0-beta8
dotnet add package Newtonsoft.Json
dotnet add package PuppeteerSharp # 用于网页抓取

项目设置

  1. 创建一个新的ASP.NET Core Web应用:
代码片段
dotnet new webapp -n LangChainDemo
cd LangChainDemo
  1. 添加一个Services文件夹,用于存放LangChain相关服务。

LangChain服务集成

1. LangChain服务封装

Services文件夹中创建LangChainService.cs

代码片段
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.AI.OpenAI;

public class LangChainService 
{
    private readonly IKernel _kernel;

    public LangChainService(string apiKey)
    {
        _kernel = Kernel.Builder.Build();
        _kernel.Config.AddOpenAITextCompletionService(
            "gpt-3.5-turbo", 
            apiKey);
    }

    public async Task<string> ProcessUnstructuredDataAsync(string data, string prompt)
    {
        var function = _kernel.CreateSemanticFunction(prompt);
        var result = await function.InvokeAsync(data);
        return result.ToString();
    }
}

2. 注册服务

Program.cs中添加:

代码片段
// 从配置中读取API密钥
var openAiApiKey = builder.Configuration["OpenAI:ApiKey"];

// 注册LangChain服务
builder.Services.AddSingleton<LangChainService>(new LangChainService(openAiApiKey));

Web应用开发实例

场景:从PDF/网页提取并分析信息

1. PDF文本提取

首先添加一个PDF处理服务:

代码片段
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;

public class PdfTextExtractor 
{
    public string ExtractTextFromPdf(byte[] pdfBytes)
    {
        using var reader = new PdfReader(new MemoryStream(pdfBytes));
        using var pdfDoc = new PdfDocument(reader);

        var text = new StringBuilder();
        for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            text.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
        }

        return text.ToString();
    }
}

2. Web控制器实现

创建DocumentAnalysisController.cs

代码片段
[ApiController]
[Route("api/[controller]")]
public class DocumentAnalysisController : ControllerBase 
{
    private readonly LangChainService _langChain;
    private readonly PdfTextExtractor _pdfExtractor;

    public DocumentAnalysisController(LangChainService langChain, PdfTextExtractor pdfExtractor)
    {
        _langChain = langChain;
        _pdfExtractor = pdfExtractor;
    }

    [HttpPost("analyze-pdf")]
    public async Task<IActionResult> AnalyzePdf(IFormFile file, [FromQuery] string question)
    {
        if (file == null || file.Length == 0)
            return BadRequest("No file uploaded");

        using var ms = new MemoryStream();
        await file.CopyToAsync(ms);
        var pdfBytes = ms.ToArray();

        // Step 1: Extract text from PDF
        var extractedText = _pdfExtractor.ExtractTextFromPdf(pdfBytes);

        // Step 2: Process with LangChain
        var prompt = $@"请根据以下文档内容回答问题:
{question}

文档内容:
{extractedText}";

        var answer = await _langChain.ProcessUnstructuredDataAsync(extractedText, prompt);

        return Ok(new { answer });
    }
}

3. Web页面集成

在Razor页面中添加文件上传表单:

代码片段
@page "/document-analysis"
@model DocumentAnalysisModel

<h2>文档分析工具</h2>

<form method="post" enctype="multipart/form-data">
    <div class="form-group">
        <label for="file">上传PDF文档</label>
        <input type="file" class="form-control-file" id="file" name="file" accept=".pdf" required>
    </div>

    <div class="form-group mt-3">
        <label for="question">你的问题</label>
        <input type="text" class="form-control" id="question" name="question" 
               placeholder="例如:这份合同的主要条款是什么?" required>
    </div>

    <button type="submit" class="btn btn-primary mt-3">分析文档</button>
</form>

@if (!string.IsNullOrEmpty(Model.Answer))
{
    <div class="mt-4 p-3 bg-light rounded">
        <h4>分析结果:</h4>
        <p>@Model.Answer</p>
    </div>
}

对应的Page Model:

代码片段
public class DocumentAnalysisModel : PageModel 
{
    private readonly LangChainService _langChain;

    [BindProperty]
    public IFormFile File { get; set; }

    [BindProperty]
    public string Question { get; set; }

    public string Answer { get; set; }

    public DocumentAnalysisModel(LangChainService langChain)
    {
        _langChain = langChain;
    }

    public async Task<IActionResult> OnPostAsync()
    {
        if (!ModelState.IsValid)
            return Page();

        using var ms = new MemoryStream();
        await File.CopyToAsync(ms);

        // Call our API endpoint (could also call service directly)
        var client = new HttpClient();

        using var formContent = new MultipartFormDataContent();

        // Add file content
        var fileContent = new StreamContent(ms);
        fileContent.Headers.ContentType = MediaTypeHeaderValue.Parse("application/pdf");
        formContent.Add(fileContent, "file", File.FileName);

        // Add question as query parameter in the URL
        var response = await client.PostAsync($"http://localhost:5000/api/DocumentAnalysis/analyze-pdf?question={Uri.EscapeDataString(Question)}", formContent);

        if (response.IsSuccessStatusCode)
        {
            var result = await response.Content.ReadFromJsonAsync<dynamic>();
            Answer = result?.answer;
        }

        return Page();
     }
}

高级功能扩展

1. URL内容抓取与分析

添加网页内容抓取功能:

代码片段
public class WebContentFetcher 
{
    public async Task<string> FetchContent(string url) 
    {
        // Launch browser and open page (requires PuppeteerSharp)
        using var browserFetcher = new BrowserFetcher();
        await browserFetcher.DownloadAsync(BrowserFetcher.DefaultChromiumRevision);
        await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = true });
        await using var page = await browser.NewPageAsync();

        await page.GoToAsync(url, WaitUntilNavigation.Networkidle0);

        // Extract main content (simplified - in production you'd want more sophisticated extraction)
        return await page.EvaluateExpressionAsync<string>("document.body.innerText");
    }
}

然后在控制器中添加新的端点:

代码片段
[HttpPost("analyze-url")]
public async Task<IActionResult> AnalyzeUrl([FromBody] UrlAnalysisRequest request)
{
    var contentFetcher = new WebContentFetcher();
    var webContent = await contentFetcher.FetchContent(request.Url);

    var promptTemplate =
@"请根据以下网页内容回答问题:
问题:{0}

网页内容:
{1}";

    var promptWithContext =
    string.Format(promptTemplate,
         request.Question,
         webContent.Substring(0, Math.Min(webContent.Length, 10000))); // Limit context size

    var answer =
         await _langchainService.GetCompletionAsync(promptWithContext);

    return Ok(new { answer });
}

public record UrlAnalysisRequest(string Url, string Question);

性能优化与注意事项

  1. 分块处理大型文档

    • LangChain有token限制,对于大型文档需要分块处理:
      代码片段
      public IEnumerable<string> ChunkText(string text, int maxChunkSize) 
      {
          for (int i=0; i<text.Length; i+=maxChunkSize) 
          {
              yield return text.Substring(i, Math.Min(maxChunkSize, text.Length-i));
          }
      }<br>
      
  2. 缓存策略

    • API调用可能很昂贵,考虑实现缓存机制:
      代码片段
      // Simple in-memory cache example (consider distributed cache for production)
      private static readonly ConcurrentDictionary<string, string> _cache =
          new ConcurrentDictionary<string, string>();
      
      public async Task<string> GetCachedCompletion(string key, Func<Task<string>> valueFactory) 
      {
          if (_cache.TryGetValue(key, out var cachedValue))
              return cachedValue;
      
          var value = await valueFactory();
          _cache.TryAdd(key, value);
          return value;
      }<br>
      
  3. 错误处理

    • LLM API可能失败,实现重试逻辑:
      代码片段
      public async Task<string> GetCompletionWithRetry(string prompt,
                                                       int maxRetries=3,
                                                       int delayMs=1000) 
      {
          for (int i=0; i<maxRetries; i++) 
          {
              try 
              {
                  return await ProcessUnstructuredDataAsync(prompt);
              } 
              catch (Exception ex) when (i < maxRetries -1) 
              {
                  Console.WriteLine($"Attempt {i+1} failed: {ex.Message}");
                  await Task.Delay(delayMs * (i+1));
              }
          }
      
          throw new Exception($"Failed after {maxRetries} attempts");
      }<br>
      

总结与最佳实践

通过本教程,我们构建了一个完整的Web应用,能够处理和分析非结构化数据。关键要点包括:

  1. 架构选择

    • C#后端负责数据处理和API集成
    • Lang Chain/Semantic Kernel作为AI中间件层
  2. 性能考虑

    • PDF/网页解析是CPU密集型操作,考虑后台任务或队列处理大型文件
  3. 安全实践

    • API密钥安全存储(使用Azure Key Vault或类似方案)
    • PDF上传验证(文件类型、大小限制)

4.扩展方向
-添加向量数据库支持(Pinecone、Weaviate等)实现语义搜索
-集成更多文件类型支持(Word、Excel等)

这个示例展示了如何将现代AI能力整合到传统Web应用中。你可以在此基础上继续扩展功能,打造更强大的智能文档处理系统。

原创 高质量