Skip to content

Conversation

@CodFrm
Copy link
Member

@CodFrm CodFrm commented Jan 8, 2026

概述 Descriptions

close #1115

变更内容 Changes

截图 Screenshots

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

这个 PR 修复了脚本安装时的字符编码检测问题(issue #1115)。主要改进是将基于 HTTP Content-Type header 的编码检测替换为使用 chardet 库进行自动编码检测,使得编码识别更加可靠和准确。

主要变更:

  • 引入 chardet 库用于自动检测脚本文件的字符编码
  • 移除了原有的基于 Content-Type header 的 charset 解析逻辑
  • 改进了错误处理机制,在编码检测或解码失败时回退到 UTF-8 而不是抛出异常

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/pages/install/App.tsx 将脚本获取函数中的编码检测从 Content-Type header 解析改为使用 chardet 自动检测,并优化了错误处理逻辑
package.json 添加 chardet@^2.1.1 依赖
pnpm-lock.yaml 添加 chardet 依赖项并更新 lock 文件格式(自动添加 libc 字段以更好地支持平台特定依赖)
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

@CodFrm
Copy link
Member Author

CodFrm commented Jan 8, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Contributor

Copilot AI commented Jan 8, 2026

@CodFrm I've opened a new pull request, #1139, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 3 commits January 8, 2026 17:53
* Initial plan

* 优化编码检测性能并添加完整测试覆盖

Co-authored-by: CodFrm <22783163+CodFrm@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: CodFrm <22783163+CodFrm@users.noreply.github.com>
@cyfung1031
Copy link
Collaborator

你的代码看起来很不错!有几个小的改进建议:

建议的改进:

  1. 增加更多回退编码选项
  2. 添加更详细的日志(可选)
  3. 处理常见的编码别名

改进后的版本:

import chardet from "chardet";

/**
 * 从 Content-Type header 中解析 charset
 */
export const parseCharsetFromContentType = (contentType: string | null): string | null => {
  if (!contentType) return null;

  const match = contentType.match(/charset=([^;]+)/i);
  if (match && match[1]) {
    return match[1].trim().toLowerCase().replace(/['"]/g, "");
  }
  return null;
};

/**
 * 常见编码的别名映射(解决 chardet 可能返回别名的问题)
 */
const ENCODING_ALIASES: Record<string, string> = {
  'ascii': 'utf-8',
  'us-ascii': 'utf-8',
  'iso-8859-1': 'windows-1252', // 常见混淆
  'gb2312': 'gb18030', // GB2312 是 GB18030 的子集
  'cp1252': 'windows-1252',
  'cp1251': 'windows-1251',
  'shift-jis': 'shift_jis',
  'ms932': 'shift_jis',
};

/**
 * 标准化编码名称
 */
const normalizeEncoding = (encoding: string): string => {
  const normalized = encoding.toLowerCase().trim();
  return ENCODING_ALIASES[normalized] || normalized;
};

/**
 * 验证编码是否有效
 */
const isValidEncoding = (encoding: string): boolean => {
  try {
    new TextDecoder(encoding);
    return true;
  } catch {
    return false;
  }
};

/**
 * 尝试解码以验证编码是否正确
 */
const testDecode = (data: Uint8Array, encoding: string, sampleSize: number = 1024): boolean => {
  try {
    const sample = data.subarray(0, Math.min(data.length, sampleSize));
    const decoder = new TextDecoder(encoding, { fatal: true });
    decoder.decode(sample);
    return true;
  } catch {
    return false;
  }
};

/**
 * 检测字节数组的编码
 * 优先使用 Content-Type header,失败时使用 chardet(仅对前16KB检测以提升性能)
 */
export const detectEncoding = (
  data: Uint8Array, 
  contentType: string | null,
  options: {
    verbose?: boolean;
    fallbackEncodings?: string[];
  } = {}
): string => {
  const {
    verbose = false,
    fallbackEncodings = ['utf-8', 'windows-1252', 'iso-8859-1']
  } = options;

  // 1. 优先尝试使用 Content-Type header 中的 charset
  const headerCharset = parseCharsetFromContentType(contentType);
  if (headerCharset) {
    const normalizedHeaderCharset = normalizeEncoding(headerCharset);
    
    if (isValidEncoding(normalizedHeaderCharset)) {
      if (testDecode(data, normalizedHeaderCharset)) {
        if (verbose) console.log(`Using charset from Content-Type header: ${normalizedHeaderCharset}`);
        return normalizedHeaderCharset;
      } else if (verbose) {
        console.warn(`Charset from header failed to decode: ${normalizedHeaderCharset}`);
      }
    } else if (verbose) {
      console.warn(`Invalid charset from Content-Type header: ${headerCharset} (normalized: ${normalizedHeaderCharset})`);
    }
  }

  // 2. 使用 chardet 检测编码
  const sampleSize = Math.min(data.length, 16 * 1024);
  const sample = data.subarray(0, sampleSize);
  
  try {
    const detected = chardet.detect(sample);
    
    if (detected) {
      const detectedEncoding = normalizeEncoding(detected);
      
      if (isValidEncoding(detectedEncoding)) {
        if (testDecode(data, detectedEncoding)) {
          if (verbose) console.log(`Using charset detected by chardet: ${detectedEncoding}`);
          return detectedEncoding;
        } else if (verbose) {
          console.warn(`Charset detected by chardet failed to decode: ${detectedEncoding}`);
        }
      } else if (verbose) {
        console.warn(`Invalid charset detected by chardet: ${detected} (normalized: ${detectedEncoding})`);
      }
    }
  } catch (error: any) {
    if (verbose) console.warn(`chardet detection failed: ${error.message}`);
  }

  // 3. 尝试回退编码
  for (const fallback of fallbackEncodings) {
    if (isValidEncoding(fallback) && testDecode(data, fallback)) {
      if (verbose) console.log(`Using fallback encoding: ${fallback}`);
      return fallback;
    }
  }

  // 4. 最终回退到 UTF-8(带错误恢复)
  if (verbose) console.log(`Falling back to utf-8`);
  return "utf-8";
};

/**
 * 安全解码字节数组到字符串
 */
export const decodeBuffer = (
  data: Uint8Array,
  contentType: string | null,
  options?: Parameters<typeof detectEncoding>[2]
): string => {
  const encoding = detectEncoding(data, contentType, options);
  const decoder = new TextDecoder(encoding, { fatal: false });
  return decoder.decode(data);
};

主要改进:

  1. 编码别名处理:处理常见编码的别名问题
  2. 实际解码测试:不仅验证 TextDecoder 能否创建,还测试能否实际解码数据
  3. 详细日志选项:添加 verbose 选项用于调试
  4. 多级回退:可以指定多个回退编码
  5. 工具函数:添加了 decodeBuffer 函数方便直接使用

使用示例:

// 基本使用
const encoding = detectEncoding(data, contentType);

// 带详细日志
const encoding = detectEncoding(data, contentType, { verbose: true });

// 自定义回退编码
const encoding = detectEncoding(data, contentType, {
  fallbackEncodings: ['utf-8', 'gb18030', 'big5']
});

// 直接解码
const text = decodeBuffer(data, contentType);

你的原始代码已经很好了,这些改进是可选的,根据你的具体需求选择使用。

@CodFrm CodFrm merged commit a3abaf0 into release/v1.3 Jan 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants