-
Notifications
You must be signed in to change notification settings - Fork 310
🐛 处理脚本编码问题 #1115 #1138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 处理脚本编码问题 #1115 #1138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
这个 PR 修复了脚本安装时的字符编码检测问题(issue #1115)。主要改进是将基于 HTTP Content-Type header 的编码检测替换为使用 chardet 库进行自动编码检测,使得编码识别更加可靠和准确。
主要变更:
- 引入 chardet 库用于自动检测脚本文件的字符编码
- 移除了原有的基于 Content-Type header 的 charset 解析逻辑
- 改进了错误处理机制,在编码检测或解码失败时回退到 UTF-8 而不是抛出异常
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/pages/install/App.tsx | 将脚本获取函数中的编码检测从 Content-Type header 解析改为使用 chardet 自动检测,并优化了错误处理逻辑 |
| package.json | 添加 chardet@^2.1.1 依赖 |
| pnpm-lock.yaml | 添加 chardet 依赖项并更新 lock 文件格式(自动添加 libc 字段以更好地支持平台特定依赖) |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
|
@copilot open a new pull request to apply changes based on the comments in this thread |
* Initial plan * 优化编码检测性能并添加完整测试覆盖 Co-authored-by: CodFrm <22783163+CodFrm@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: CodFrm <22783163+CodFrm@users.noreply.github.com>
|
你的代码看起来很不错!有几个小的改进建议: 建议的改进:
改进后的版本:import chardet from "chardet";
/**
* 从 Content-Type header 中解析 charset
*/
export const parseCharsetFromContentType = (contentType: string | null): string | null => {
if (!contentType) return null;
const match = contentType.match(/charset=([^;]+)/i);
if (match && match[1]) {
return match[1].trim().toLowerCase().replace(/['"]/g, "");
}
return null;
};
/**
* 常见编码的别名映射(解决 chardet 可能返回别名的问题)
*/
const ENCODING_ALIASES: Record<string, string> = {
'ascii': 'utf-8',
'us-ascii': 'utf-8',
'iso-8859-1': 'windows-1252', // 常见混淆
'gb2312': 'gb18030', // GB2312 是 GB18030 的子集
'cp1252': 'windows-1252',
'cp1251': 'windows-1251',
'shift-jis': 'shift_jis',
'ms932': 'shift_jis',
};
/**
* 标准化编码名称
*/
const normalizeEncoding = (encoding: string): string => {
const normalized = encoding.toLowerCase().trim();
return ENCODING_ALIASES[normalized] || normalized;
};
/**
* 验证编码是否有效
*/
const isValidEncoding = (encoding: string): boolean => {
try {
new TextDecoder(encoding);
return true;
} catch {
return false;
}
};
/**
* 尝试解码以验证编码是否正确
*/
const testDecode = (data: Uint8Array, encoding: string, sampleSize: number = 1024): boolean => {
try {
const sample = data.subarray(0, Math.min(data.length, sampleSize));
const decoder = new TextDecoder(encoding, { fatal: true });
decoder.decode(sample);
return true;
} catch {
return false;
}
};
/**
* 检测字节数组的编码
* 优先使用 Content-Type header,失败时使用 chardet(仅对前16KB检测以提升性能)
*/
export const detectEncoding = (
data: Uint8Array,
contentType: string | null,
options: {
verbose?: boolean;
fallbackEncodings?: string[];
} = {}
): string => {
const {
verbose = false,
fallbackEncodings = ['utf-8', 'windows-1252', 'iso-8859-1']
} = options;
// 1. 优先尝试使用 Content-Type header 中的 charset
const headerCharset = parseCharsetFromContentType(contentType);
if (headerCharset) {
const normalizedHeaderCharset = normalizeEncoding(headerCharset);
if (isValidEncoding(normalizedHeaderCharset)) {
if (testDecode(data, normalizedHeaderCharset)) {
if (verbose) console.log(`Using charset from Content-Type header: ${normalizedHeaderCharset}`);
return normalizedHeaderCharset;
} else if (verbose) {
console.warn(`Charset from header failed to decode: ${normalizedHeaderCharset}`);
}
} else if (verbose) {
console.warn(`Invalid charset from Content-Type header: ${headerCharset} (normalized: ${normalizedHeaderCharset})`);
}
}
// 2. 使用 chardet 检测编码
const sampleSize = Math.min(data.length, 16 * 1024);
const sample = data.subarray(0, sampleSize);
try {
const detected = chardet.detect(sample);
if (detected) {
const detectedEncoding = normalizeEncoding(detected);
if (isValidEncoding(detectedEncoding)) {
if (testDecode(data, detectedEncoding)) {
if (verbose) console.log(`Using charset detected by chardet: ${detectedEncoding}`);
return detectedEncoding;
} else if (verbose) {
console.warn(`Charset detected by chardet failed to decode: ${detectedEncoding}`);
}
} else if (verbose) {
console.warn(`Invalid charset detected by chardet: ${detected} (normalized: ${detectedEncoding})`);
}
}
} catch (error: any) {
if (verbose) console.warn(`chardet detection failed: ${error.message}`);
}
// 3. 尝试回退编码
for (const fallback of fallbackEncodings) {
if (isValidEncoding(fallback) && testDecode(data, fallback)) {
if (verbose) console.log(`Using fallback encoding: ${fallback}`);
return fallback;
}
}
// 4. 最终回退到 UTF-8(带错误恢复)
if (verbose) console.log(`Falling back to utf-8`);
return "utf-8";
};
/**
* 安全解码字节数组到字符串
*/
export const decodeBuffer = (
data: Uint8Array,
contentType: string | null,
options?: Parameters<typeof detectEncoding>[2]
): string => {
const encoding = detectEncoding(data, contentType, options);
const decoder = new TextDecoder(encoding, { fatal: false });
return decoder.decode(data);
};主要改进:
使用示例:// 基本使用
const encoding = detectEncoding(data, contentType);
// 带详细日志
const encoding = detectEncoding(data, contentType, { verbose: true });
// 自定义回退编码
const encoding = detectEncoding(data, contentType, {
fallbackEncodings: ['utf-8', 'gb18030', 'big5']
});
// 直接解码
const text = decodeBuffer(data, contentType);你的原始代码已经很好了,这些改进是可选的,根据你的具体需求选择使用。 |
概述 Descriptions
close #1115
变更内容 Changes
截图 Screenshots