VB.net 2010 视频教程 VB.net 2010 视频教程 python基础视频教程
SQL Server 2008 视频教程 c#入门经典教程 Visual Basic从门到精通视频教程
当前位置:
首页 > Python基础教程 >
  • C#教程之如何检测或判断一个文件或字节流(无(2)

代码流程(和内涵)翻译下来是这样的:

1
2
3
4
5
6
7
8
9
10
11
1、检测BOM头,这个很Easy。
 
2、检测UTF8编码(这个还是很有创意的),如果编码的规则完全符合UTF8,则认为是UTF8。
 
3、检测字节中是否有换行符(根据换行符中的0的位置,区分是Utf16的BE大尾还是LE小尾)。
 
这个概率要看字节抽样的长度,带不带换行符。
 
4、检测字节中,单偶数出现的0的概率,设定了一个期望值来预判(对于中文而言,基本没用),大概是老外写的,只根据英文情况分析的概率。
 
5、检测字节中,有没有出现0,如果没有,返回系统默认编码(不同系统环境编码是不同的)。

首先,不得不说,原作者还是有一定想法的。

虽然代码中除了UTF8按规则写的分析外,其它的都无法代入中文环境里通过。

但至少思路上,就能得到不少启发。

于是,坑了我大半天,进行重写,改造,代入中文环境测试。

无BOM代码检测的改造过程:

改造后的代码流程是这样的:

复制代码
public Encoding DetectWithoutBom(byte[] buffer, int size)
        {
            // Now check for valid UTF8
            Encoding encoding = CheckUtf8(buffer, size);
            if (encoding != Encoding.None)
            {
                return encoding;
            }

            // ANSI or None (binary) then 一个零都没有情况。
            if (!ContainsZero(buffer, size))
            {
                CheckChinese(buffer, size);
                return Encoding.Ansi;
            }

            // Now try UTF16  按寻找换行字符先进行判断
            encoding = CheckByNewLineChar(buffer, size);
            if (encoding != Encoding.None)
            {
                return encoding;
            }

            // 没办法了,只能按0出现的次数比率,做大体的预判
            encoding = CheckByZeroNumPercent(buffer, size);
            if (encoding != Encoding.None)
            {
                return encoding;
            }

            // Found a null, return based on the preference in null_suggests_binary_
            return Encoding.None;
        }
复制代码

用中文解释流程是这样的:

复制代码
1、UTF8编码的检测规则,这个是通用的有效,可以保留。

2、调整顺序:先检测字节有没有0字节,若无,补一个是否中文的编码的检测(GB2312、GBK、Big5)。

这个后续有点用。

3、检测换行符:增加UTF-32编码的检测(原来的思路只有UTF16)。

4、预判概率:改造成同时适应中文环境。
复制代码

测试的结果是这样的:

A、纯中文的:

该测试下,对于BigEndianUnicode的会产生乱码。

B、非纯中文的

一切编码正常通用。

改进后的完整源码:

复制代码
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace CYQ.Data.Tool
{
    internal static class IOHelper
    {
        internal static Encoding DefaultEncoding = Encoding.Default;

        private static List<object> tenObj = new List<object>(10);
        private static List<object> TenObj
        {
            get
            {
                if (tenObj.Count == 0)
                {
                    for (int i = 0; i < 10; i++)
                    {
                        tenObj.Add(new object());
                    }
                }
                return tenObj;
            }
        }
        private static object GetLockObj(int length)
        {
            int i = length % 9;
            return TenObj[i];
        }
        /// <summary>
        /// 先自动识别UTF8,否则归到Default编码读取
        /// </summary>
        /// <returns></returns>
        public static string ReadAllText(string fileName)
        {
            return ReadAllText(fileName, DefaultEncoding);
        }
        public static string ReadAllText(string fileName, Encoding encoding)
        {
            try
            {
                if (!File.Exists(fileName))
                {
                    return string.Empty;
                }
                Byte[] buff = null;
                lock (GetLockObj(fileName.Length))
                {
                    if (!File.Exists(fileName))//多线程情况处理
                    {
                        return string.Empty;
                    }
                    buff = File.ReadAllBytes(fileName);
                    return BytesToText(buff, encoding);
                }

            }
            catch (Exception err)
            {
                Log.WriteLogToTxt(err);
            }
            return string.Empty;
        }
        public static bool Write(string fileName, string text)
        {
            return Save(fileName, text, false, DefaultEncoding, true);
        }
        public static bool Write(string fileName, string text, Encoding encode)
        {
            return Save(fileName, text, false, encode, true);
        }
        public static bool Append(string fileName, string text)
        {
            return Save(fileName, text, true, true);
        }

        internal static bool Save(string fileName, string text, bool isAppend, bool writeLogOnError)
        {
            return Save(fileName, text, true, DefaultEncoding, writeLogOnError);
        }
        internal static bool Save(string fileName, string text, bool isAppend, Encoding encode, bool writeLogOnError)
        {
            try
            {
                string folder = Path.GetDirectoryName(fileName);
                if (!Directory.Exists(folder))
                {
                    Directory.CreateDirectory(folder);
                }

                lock (GetLockObj(fileName.Length))
                {
                    using (StreamWriter writer = new StreamWriter(fileName, isAppend, encode))
                    {
                        writer.Write(text);
                    }
                }
                return true;
            }
            catch (Exception err)
            {
                if (writeLogOnError)
                {
                    Log.WriteLogToTxt(err);
                }
                else
                {
                    Error.Throw("IOHelper.Save() : " + err.Message);
                }
            }
            return false;
        }

        internal static bool Delete(string fileName)
        {
            try
            {
                if (File.Exists(fileName))
                {
                    lock (GetLockObj(fileName.Length))
                    {
                        if (File.Exists(fileName))
                        {
                            File.Delete(fileName);
                            return true;
                        }
                    }
                }
            }
            catch
            {

            }
            return false;
        }

        public static bool IsLastFileWriteTimeChanged(string fileName, ref DateTime compareTimeUtc)
        {
            bool isChanged = false;
            IOInfo info = new IOInfo(fileName);
            if (info.Exists && info.LastWriteTimeUtc != compareTimeUtc)
            {
                isChanged = true;
                compareTimeUtc = info.LastWriteTimeUtc;
            }
            return isChanged;
        }
        public static string BytesToText(byte[] buff, Encoding encoding)
        {
            if (buff.Length == 0) { return ""; }
            //if (buff[0] == 239 && buff[1] == 187 && buff[2] == 191)
            //{
            //    return Encoding.UTF8.GetString(buff, 3, buff.Length - 3);
            //}
            //else if (buff[0] == 255 && buff[1] == 254)
            //{
            //    return Encoding.Unicode.GetString(buff, 2, buff.Length - 2);
            //}
            //else if (buff[0] == 254 && buff[1] == 255)
            //{
            //    if (buff.Length > 3 && buff[2] == 0 && buff[3] == 0)
            //    {
            //        return Encoding.UTF32.GetString(buff, 4, buff.Length - 4);
            //    }
            //    return Encoding.BigEndianUnicode.GetString(buff, 2, buff.Length - 2);
            //}
            //else
            //{
            TextEncodingDetect detect = new TextEncodingDetect();

            //检测Bom
            switch (detect.DetectWithBom(buff))
            {
                case TextEncodingDetect.Encoding.Utf8Bom:
                    return Encoding.UTF8.GetString(buff, 3, buff.Length - 3);
                case TextEncodingDetect.Encoding.UnicodeBom:
                    return Encoding.Unicode.GetString(buff, 2, buff.Length - 2);
                case TextEncodingDetect.Encoding.BigEndianUnicodeBom:
                    return Encoding.BigEndianUnicode.GetString(buff, 2, buff.Length - 2);
                case TextEncodingDetect.Encoding.Utf32Bom:
                    return Encoding.UTF32.GetString(buff, 4, buff.Length - 4);
            }
            if (encoding != DefaultEncoding && encoding != Encoding.ASCII)//自定义设置编码,优先处理。
            {
                return encoding.GetString(buff);
            }
            switch (detect.DetectWithoutBom(buff, buff.Length > 1000 ? 1000 : buff.Length))//自动检测。
            {

                case TextEncodingDetect.Encoding.Utf8Nobom:
                    return Encoding.UTF8.GetString(buff);

                case TextEncodingDetect.Encoding.UnicodeNoBom:
                    return Encoding.Unicode.GetString(buff);

                case TextEncodingDetect.Encoding.BigEndianUnicodeNoBom:
                    return Encoding.BigEndianUnicode.GetString(buff);

                case TextEncodingDetect.Encoding.Utf32NoBom:
                    return Encoding.UTF32.GetString(buff);

                case TextEncodingDetect.Encoding.Ansi:
                    if (IsChineseEncoding(DefaultEncoding) && !IsChineseEncoding(encoding))
                    {
                        if (detect.IsChinese)
                        {
                            return Encoding.GetEncoding("gbk").GetString(buff);
                        }
                        else//非中文时,默认选一个。
                        {
                            return Encoding.Unicode.GetString(buff);
                        }
                    }
                    else
                    {
                        return encoding.GetString(buff);
                    }

                case TextEncodingDetect.Encoding.Ascii:
                    return Encoding.ASCII.GetString(buff);

                default:
                    return encoding.GetString(buff);
            }
            // }
        }
        private static bool IsChineseEncoding(Encoding encoding)
        {
            return encoding == Encoding.GetEncoding("gb2312") || encoding == Encoding.GetEncoding("gbk") || encoding == Encoding.GetEncoding("big5");
        }
    }
    internal class IOInfo : FileSystemInfo
    {
        public IOInfo(string fileName)
        {
            base.FullPath = fileName;
        }
        public override void Delete()
        {
        }

        public override bool Exists
        {
            get
            {
                return File.Exists(base.FullPath);
            }
        }

        public override string Name
        {
            get
            {
                return null;
            }
        }
    }
    /// <summary>
    /// 字节文本编码检测
    /// </summary>
    internal class TextEncodingDetect
    {
        private readonly byte[] _UTF8Bom =
        {
            0xEF,
            0xBB,
            0xBF
        };
        //utf16le _UnicodeBom
        private readonly byte[] _UTF16LeBom =
        {
            0xFF,
            0xFE
        };

        //utf16be _BigUnicodeBom
        private readonly byte[] _UTF16BeBom =
        {
            0xFE,
            0xFF
        };

        //utf-32le
        private readonly byte[] _UTF32LeBom =
        {
            0xFF,
            0xFE,
            0x00,
            0x00
        };
        //utf-32Be
        //private readonly byte[] _UTF32BeBom =
        //{
        //    0x00,
        //    0x00,
        //    0xFE,
        //    0xFF
        //};
        /// <summary>
        /// 是否中文
        /// </summary>
        public bool IsChinese = false;

        public enum Encoding
        {
            None, // Unknown or binary
            Ansi, // 0-255
            Ascii, // 0-127
            Utf8Bom, // UTF8 with BOM
            Utf8Nobom, // UTF8 without BOM
            UnicodeBom, // UTF16 LE with BOM
            UnicodeNoBom, // UTF16 LE without BOM
            BigEndianUnicodeBom, // UTF16-BE with BOM
            BigEndianUnicodeNoBom, // UTF16-BE without BOM

            Utf32Bom,//UTF-32LE with BOM
            Utf32NoBom //UTF-32 without BOM

        }

        public Encoding DetectWithBom(byte[] buffer)
        {
            if (buffer != null)
            {
                int size = buffer.Length;
                // Check for BOM
                if (size >= 2 && buffer[0] == _UTF16LeBom[0] && buffer[1] == _UTF16LeBom[1])
                {
                    return Encoding.UnicodeBom;
                }

                if (size >= 2 && buffer[0] == _UTF16BeBom[0] && buffer[1] == _UTF16BeBom[1])
                {
                    if (size >= 4 && buffer[2] == _UTF32LeBom[2] && buffer[3] == _UTF32LeBom[3])
                    {
                        return Encoding.Utf32Bom;
                    }
                    return Encoding.BigEndianUnicodeBom;
                }

                if (size >= 3 && buffer[0] == _UTF8Bom[0] && buffer[1] == _UTF8Bom[1] && buffer[2] == _UTF8Bom[2])
                {
                    return Encoding.Utf8Bom;
                }
            }
            return Encoding.None;
        }

        /// <summary>
        ///     Automatically detects the Encoding type of a given byte buffer.
        /// </summary>
        /// <param name="buffer">The byte buffer.</param>
        /// <param name="size">The size of the byte buffer.</param>
        /// <returns>The Encoding type or Encoding.None if unknown.</returns>
        public Encoding DetectWithoutBom(byte[] buffer, int size)
        {
            // Now check for valid UTF8
            Encoding encoding = CheckUtf8(buffer, size);
            if (encoding != Encoding.None)
            {
                return encoding;
            }

            // ANSI or None (binary) then 一个零都没有情况。
            if (!ContainsZero(buffer, size))
            {
                CheckChinese(buffer, size);
                return Encoding.Ansi;
            }

            // Now try UTF16  按寻找换行字符先进行判断
            encoding = CheckByNewLineChar(buffer, size);
            if (encoding != Encoding.None)
            {
                return encoding;
            }

            // 没办法了,只能按0出现的次数比率,做大体的预判
            encoding = CheckByZeroNumPercent(buffer, size);
            if (encoding != Encoding.None)
            {
                return encoding;
            }

            // Found a null, return based on the preference in null_suggests_binary_
            return Encoding.None;
        }

        /// <summary>
        ///     Checks if a buffer contains text that looks like utf16 by scanning for
        ///     newline chars that would be present even in non-english text.
        ///     以检测换行符标识来判断。
        /// </summary>
        /// <param name="buffer">The byte buffer.</param>
        /// <param name="size">The size of the byte buffer.</param>
        /// <returns>Encoding.none, Encoding.Utf16LeNoBom or Encoding.Utf16BeNoBom.</returns>
        private static Encoding CheckByNewLineChar(byte[] buffer, int size)
        {
            if (size < 2)
            {
                return Encoding.None;
            }

            // Reduce size by 1 so we don't need to worry about bounds checking for pairs of bytes
            size--;

            int le16 = 0;
            int be16 = 0;
            int le32 = 0;//检测是否utf32le。
            int zeroCount = 0;//utf32le 每4位后面多数是0
            uint pos = 0;
            while (pos < size)
            {
                byte ch1 = buffer[pos++];
                byte ch2 = buffer[pos++];

                if (ch1 == 0)
                {
                    if (ch2 == 0x0a || ch2 == 0x0d)//\r \t 换行检测。
                    {
                        ++be16;
                    }
                }
                if (ch2 == 0)
                {
                    zeroCount++;
                    if (ch1 == 0x0a || ch1 == 0x0d)
                    {
                        ++le16;
                        if (pos + 1 <= size && buffer[pos] == 0 && buffer[pos + 1] == 0)
                        {
                            ++le32;
                        }

                    }
                }

                // If we are getting both LE and BE control chars then this file is not utf16
      



  

相关教程
关于我们--广告服务--免责声明--本站帮助-友情链接--版权声明--联系我们       黑ICP备07002182号