`
stephen830
  • 浏览: 2962641 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

关于java生成UTF-8编码格式文件的诡异问题

    博客分类:
  • java
阅读更多
★★★ 本篇为原创,需要引用转载的朋友请注明:《 http://stephen830.iteye.com/blog/259350 》 谢谢支持!★★★

用java生成一个UTF-8文件:

如果文件内容中没有中文内容,则生成的文件为ANSI编码格式;
如果文件内容中有中文内容,则生成的文件为UTF-8编码格式。

也就是说,如果你的文件内容没有中文内容的话,你生成的文件是ANSI编码的。

/**
     * 生成UTF-8文件.
     * 如果文件内容中没有中文内容,则生成的文件为ANSI编码格式;
     * 如果文件内容中有中文内容,则生成的文件为UTF-8编码格式。
     * @param fileName 待生成的文件名(含完整路径)
     * @param fileBody 文件内容
     * @return
     */
    public static boolean writeUTFFile(String fileName,String fileBody){
        FileOutputStream fos = null;
        OutputStreamWriter osw = null;
        try {
            fos = new FileOutputStream(fileName);
            osw = new OutputStreamWriter(fos, "UTF-8");
            osw.write(fileBody);
            return true;
        } catch (Exception e) {
            e.printStackTrace();
            return false;
        }finally{
            if(osw!=null){
                try {
                    osw.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
            if(fos!=null){
                try {
                    fos.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
        }
    }

//main()
public static void main(String[] argc){
    	writeUTFFile("C:\\test1.txt","aaa");//test1.txt为ANSI格式文件
    	writeUTFFile("C:\\test2.txt","中文aaa");//test2.txt为UTF-8格式文件
}


感谢朋友[Liteos]关于(utf-8 bom)的建议,贴出这2个文件的Hex内容图,大家参考。

用UltraEdit(UltraEdit版本10.0)的Hex模式查看(见下图):

(test1.txt ANSI格式 a:61)


(test2.txt UTF-8格式 2D 4E:中,87 65:文,61 00:a)

如果在你的UltraEdit也看到上面的2个图,那么请马上升级你的UltraEdit软件吧,低版本的UltraEdit对utf-8文件的Hex模式查看有问题。

感谢Liteos朋友的指出,将我原来的UltraEdit10卸了装上了最新的14.20版本,对test2.txt按Hex模式得到下面的图:

UltraEdit版本14.20 看test2.txt文件 E4B8AD表示“中”,E69687表示“文”,61表示“a”)

至此可以发现,UTF-8对中文的处理很夸张,需要3个字节才能表示一个中文字!!这让我想起在另外一篇文章中看到的关于utf-8的一个批评(UTF-8对亚洲语言的一种歧视),具体的文章地址:《为什么用Utf-8编码?》 http://stephen830.iteye.com/blog/258929



还是回到本文的问题,为啥没有生成UTF-8文件?估计是JAVA内部I/O处理时如果遇到都是单字节字符,则只生成ANSI格式文件(但程序中已经设定了要UTF-8,为什么不给我生成UTF-8,一个bug吗?),只有遇到多字节的字符时才根据设定的编码(例如UTF-8)来生成文件。

下面引用一段w3c组织关于utf-8的bom描述:(原文地址:http://www.w3.org/International/questions/qa-utf8-bom)

引用

FAQ: Display problems caused by the UTF-8 BOM

on this page:  Question - Background - Answer - By the way - Further reading

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, XSLT developers, Web project managers, and anyone who is trying to diagnose why blank lines or other strange items are displayed on their UTF-8 page.
Question

When using UTF-8 encoded pages in some user agents, I get an extra line or unwanted characters at the top of my web page or included file. How do I remove them?
Answer

If you are dealing with a file encoded in UTF-8, your display problems may be caused by the presence of a UTF-8 signature (BOM) that the user agent doesn't recognize.

The BOM is always at the beginning of the file, and so you would normally expect to see the display issues at the top of a page. However, you may also find blank lines appearing within the page if you include text from a separate file that begins with a UTF-8 signature.

We have a set of test pages and a summary of results for various recent browser versions that explore this behaviour.

This article will help you determine whether the UTF-8 is causing the problem. If there is no evidence of a UTF-8 signature at the beginning of the file, then you will have to look elsewhere for a solution.
What is a UTF-8 signature (BOM)?

Some applications insert a particular combination of bytes at the beginning of a file to indicate that the text contained in the file is Unicode. This combination of bytes is known as a signature or Byte Order Mark (BOM). Some applications - such as a text editor or a browser - will display the BOM as an extra line in the file, others will display unexpected characters, such as .

See the side panel for more detailed information about the BOM.

The BOM is the Unicode codepoint U+FEFF, corresponding to the Unicode character 'ZERO WIDTH NON-BREAKING SPACE' (ZWNBSP).

In UTF-16 and UTF-32 encodings, unless there is some alternative indicator, the BOM is essential to ensure correct interpretation of the file's contents. Each character in the file is represented by 2 or 4 bytes of data and the order in which these bytes are stored in the file is significant; the BOM indicates this order.

In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 or UTF-32 encodings, there is no alternative sequence of bytes in a character. The BOM may still occur in UTF-8 encoding text, however, either as a by-product of an encoding conversion or because it was added by an editor.
Detecting the BOM

First, we need to check whether there is indeed a BOM at the beginning of the file.

You can try looking for a BOM in your content, but if your editor handles the UTF-8 signature correctly you probably won't be able to see it. An editor which does not handle the UTF-8 signature correctly displays the bytes that compose that signature according to its own character encoding setting. (With the Latin 1 (ISO 8859-1) character encoding, the signature displays as characters .) With a binary editor capable of displaying the hexadecimal byte values in the file, the UTF-8 signature displays as EF BB BF.

Alternatively, your editor may tell you in a status bar or a menu what encoding your file is in, including information about the presence or not of the UTF-8 signature.

If not, some kind of script-based test (see below) may help. Alternatively, you could try this small web-based utility. (Note, if it’s a file included by PHP or some other mechanism that you think is causing the problem, type in the URI of the included file.)
Removing the BOM

If you have an editor which shows the characters that make up the UTF-8 signature you may be able to delete them by hand. Chances are, however, that the BOM is there in the first place because you didn't see it.

Check whether your editor allows you to specify whether a UTF-8 signature is added or kept during a save. Such an editor provides a way of removing the signature by simply reading the file in then saving it out again. For example, if Dreamweaver detects a BOM the Save As dialogue box will have a check mark alongside the text "Include Unicode Signature (BOM)". Just uncheck the box and save.

One of the benefits of using a script is that you can remove the signature quickly, and from multiple files. In fact the script could be run automatically as part of your process. If you use Perl, you could use a simple script created by Martin Dürst.

Note: You should check the process impact of removing the signature. It may be that some part of your content development process relies on the use of the signature to indicate that a file is in UTF-8. Bear in mind also that pages with a high proportion of Latin characters may look correct superficially but that occasional characters outside the ASCII range (U+0000 to U+007F) may be incorrectly encoded.
By the way

You will find that some text editors such as Windows Notepad will automatically add a UTF-8 signature to any file you save as UTF-8.

A UTF-8 signature at the beginning of a CSS file can sometimes cause the initial rules in the file to fail on certain user agents.

In some browsers, the presence of a UTF-8 signature will cause the browser to interpret the text as UTF-8 regardless of any character encoding declarations to the contrary.





在上面的方法中用到了[类 sun.nio.cs.StreamEncoder],下面贴出类的内容,供大家参考:
/*
 * Copyright 2001-2005 Sun Microsystems, Inc.  All Rights Reserved.
 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
 *
 * This code is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 only, as
 * published by the Free Software Foundation.  Sun designates this
 * particular file as subject to the "Classpath" exception as provided
 * by Sun in the LICENSE file that accompanied this code.
 *
 * This code is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 * version 2 for more details (a copy is included in the LICENSE file that
 * accompanied this code).
 *
 * You should have received a copy of the GNU General Public License version
 * 2 along with this work; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
 *
 * Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,
 * CA 95054 USA or visit www.sun.com if you need additional information or
 * have any questions.
 */

/*
 */

package sun.nio.cs;

import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;

public class StreamEncoder extends Writer
{

    private static final int DEFAULT_BYTE_BUFFER_SIZE = 8192;

    private volatile boolean isOpen = true;

    private void ensureOpen() throws IOException {
        if (!isOpen)
            throw new IOException("Stream closed");
    }

    // Factories for java.io.OutputStreamWriter
    public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                      Object lock,
                                                      String charsetName)
        throws UnsupportedEncodingException
    {
        String csn = charsetName;
        if (csn == null)
            csn = Charset.defaultCharset().name();
        try {
            if (Charset.isSupported(csn))
                return new StreamEncoder(out, lock, Charset.forName(csn));
        } catch (IllegalCharsetNameException x) { }
        throw new UnsupportedEncodingException (csn);
    }

    public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                      Object lock,
                                                      Charset cs)
    {
        return new StreamEncoder(out, lock, cs);
    }

    public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                      Object lock,
                                                      CharsetEncoder enc)
    {
        return new StreamEncoder(out, lock, enc);
    }


    // Factory for java.nio.channels.Channels.newWriter

    public static StreamEncoder forEncoder(WritableByteChannel ch,
                                           CharsetEncoder enc,
                                           int minBufferCap)
    {
        return new StreamEncoder(ch, enc, minBufferCap);
    }


    // -- Public methods corresponding to those in OutputStreamWriter --

    // All synchronization and state/argument checking is done in these public
    // methods; the concrete stream-encoder subclasses defined below need not
    // do any such checking.

    public String getEncoding() {
        if (isOpen())
            return encodingName();
        return null;
    }

    public void flushBuffer() throws IOException {
        synchronized (lock) {
            if (isOpen())
                implFlushBuffer();
            else
                throw new IOException("Stream closed");
        }
    }

    public void write(int c) throws IOException {
        char cbuf[] = new char[1];
        cbuf[0] = (char) c;
        write(cbuf, 0, 1);
    }

    public void write(char cbuf[], int off, int len) throws IOException {
        synchronized (lock) {
            ensureOpen();
            if ((off < 0) || (off > cbuf.length) || (len < 0) ||
                ((off + len) > cbuf.length) || ((off + len) < 0)) {
                throw new IndexOutOfBoundsException();
            } else if (len == 0) {
                return;
            }
            implWrite(cbuf, off, len);
        }
    }

    public void write(String str, int off, int len) throws IOException {
        /* Check the len before creating a char buffer */
        if (len < 0)
            throw new IndexOutOfBoundsException();
        char cbuf[] = new char[len];
        str.getChars(off, off + len, cbuf, 0);
        write(cbuf, 0, len);
    }

    public void flush() throws IOException {
        synchronized (lock) {
            ensureOpen();
            implFlush();
        }
    }

    public void close() throws IOException {
        synchronized (lock) {
            if (!isOpen)
                return;
            implClose();
            isOpen = false;
        }
    }

    private boolean isOpen() {
        return isOpen;
    }


    // -- Charset-based stream encoder impl --

    private Charset cs;
    private CharsetEncoder encoder;
    private ByteBuffer bb;

    // Exactly one of these is non-null
    private final OutputStream out;
    private WritableByteChannel ch;

    // Leftover first char in a surrogate pair
    private boolean haveLeftoverChar = false;
    private char leftoverChar;
    private CharBuffer lcb = null;

    private StreamEncoder(OutputStream out, Object lock, Charset cs) {
        this(out, lock,
         cs.newEncoder()
         .onMalformedInput(CodingErrorAction.REPLACE)
         .onUnmappableCharacter(CodingErrorAction.REPLACE));
    }

    private StreamEncoder(OutputStream out, Object lock, CharsetEncoder enc) {
        super(lock);
        this.out = out;
        this.ch = null;
        this.cs = enc.charset();
        this.encoder = enc;

        // This path disabled until direct buffers are faster
        if (false && out instanceof FileOutputStream) {
                ch = ((FileOutputStream)out).getChannel();
        if (ch != null)
                    bb = ByteBuffer.allocateDirect(DEFAULT_BYTE_BUFFER_SIZE);
        }
            if (ch == null) {
        bb = ByteBuffer.allocate(DEFAULT_BYTE_BUFFER_SIZE);
        }
    }

    private StreamEncoder(WritableByteChannel ch, CharsetEncoder enc, int mbc) {
        this.out = null;
        this.ch = ch;
        this.cs = enc.charset();
        this.encoder = enc;
        this.bb = ByteBuffer.allocate(mbc < 0
                                  ? DEFAULT_BYTE_BUFFER_SIZE
                                  : mbc);
    }

    private void writeBytes() throws IOException {
        bb.flip();
        int lim = bb.limit();
        int pos = bb.position();
        assert (pos <= lim);
        int rem = (pos <= lim ? lim - pos : 0);

            if (rem > 0) {
        if (ch != null) {
            if (ch.write(bb) != rem)
                assert false : rem;
        } else {
            out.write(bb.array(), bb.arrayOffset() + pos, rem);
        }
        }
        bb.clear();
        }

    private void flushLeftoverChar(CharBuffer cb, boolean endOfInput)
        throws IOException
    {
        if (!haveLeftoverChar && !endOfInput)
            return;
        if (lcb == null)
            lcb = CharBuffer.allocate(2);
        else
            lcb.clear();
        if (haveLeftoverChar)
            lcb.put(leftoverChar);
        if ((cb != null) && cb.hasRemaining())
            lcb.put(cb.get());
        lcb.flip();
        while (lcb.hasRemaining() || endOfInput) {
            CoderResult cr = encoder.encode(lcb, bb, endOfInput);
            if (cr.isUnderflow()) {
                if (lcb.hasRemaining()) {
                    leftoverChar = lcb.get();
                    if (cb != null && cb.hasRemaining())
                        flushLeftoverChar(cb, endOfInput);
                    return;
                }
                break;
            }
            if (cr.isOverflow()) {
                assert bb.position() > 0;
                writeBytes();
                continue;
            }
            cr.throwException();
        }
        haveLeftoverChar = false;
    }

    void implWrite(char cbuf[], int off, int len)
        throws IOException
    {
        CharBuffer cb = CharBuffer.wrap(cbuf, off, len);

        if (haveLeftoverChar)
        flushLeftoverChar(cb, false);

        while (cb.hasRemaining()) {
        CoderResult cr = encoder.encode(cb, bb, false);
        if (cr.isUnderflow()) {
           assert (cb.remaining() <= 1) : cb.remaining();
           if (cb.remaining() == 1) {
                haveLeftoverChar = true;
                leftoverChar = cb.get();
            }
            break;
        }
        if (cr.isOverflow()) {
            assert bb.position() > 0;
            writeBytes();
            continue;
        }
        cr.throwException();
        }
    }

    void implFlushBuffer() throws IOException {
        if (bb.position() > 0)
        writeBytes();
    }

    void implFlush() throws IOException {
        implFlushBuffer();
        if (out != null)
        out.flush();
    }

    void implClose() throws IOException {
        flushLeftoverChar(null, true);
        try {
            for (;;) {
                CoderResult cr = encoder.flush(bb);
                if (cr.isUnderflow())
                    break;
                if (cr.isOverflow()) {
                    assert bb.position() > 0;
                    writeBytes();
                    continue;
                }
                cr.throwException();
            }

            if (bb.position() > 0)
                writeBytes();
            if (ch != null)
                ch.close();
            else
                out.close();
        } catch (IOException x) {
            encoder.reset();
            throw x;
        }
    }

    String encodingName() {
        return ((cs instanceof HistoricallyNamedCharset)
            ? ((HistoricallyNamedCharset)cs).historicalName()
            : cs.name());
    }
}



哪位朋友如果能发现原因,请留下您的答案哦!

-------------------------------------------------------------
分享知识,分享快乐,希望文章能给需要的朋友带来小小的帮助。
  • 大小: 1.9 KB
  • 大小: 2.3 KB
  • 大小: 14.1 KB
9
1
分享到:
评论
22 楼 forai 2014-06-28  
我遇到这个问题了,并且无法解决 。。。悲催啊。。。
21 楼 mylazygirl 2010-03-22  
Liteos 写道
stephen830 写道

感谢Liteos的回复。但java为啥自作主张该生成utf-8文件的却生成个ansi文件。


那只是你思维定式的问题,既然UTF8与ANSI在ASCII上的编码相同,那换个角度你也可以理解为"Java生成了个只含ASCII字符的UTF-8文件,被UltraEdit识别为了ANSI文件"。

至于BOM,对UTF-8来说不是必须的,BOM的作用是为Unicode提供字节序识别。UTF-8不存在字节序问题,这也是为什么网络上推荐使用UTF-8的原因之一,另一个原因是UTF-8将双字节码转为多字节码避免了字节丢失造成的信息丢失,如某段文字"我们中国大地",如果使用Unicode,当双字节的"中"丢失了一个字节,则会造成"中"及其后的所有信息都变成乱码。
Wiki上有UTF8的优劣分析http://zh.wikipedia.org/wiki/UTF-8

你自己贴的文档里也写了,BOM在很多程序里处理都有问题,象JDK1.6之前的版本就不支持BOM读取,所以要慎用。手工在Java里添加BOM参考http://hibernate.blogdriver.com/hibernate/1138141.html


与UE有什么关系,你拿txt打开试试,都不做实验就在这乱喷,我现在刚好碰到这个问题,谢谢楼主。
20 楼 rmfish 2008-11-02  
UTF-8字节组合(二进制)
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

在UTF-8内,

1.   如果一个字节,最高位(第8位)为0,表示这是一个ASCII字符(00 - 7F)。可见,所有ASCII编码已经是UTF-8了。

2.   如果一个字节,以11开头,连续的1的个数暗示这个字符的字节数,例如:110xxxxx代表它是双字节UTF-8字符的首字节。

3.   如果一个字节,以10开始,表示它不是首字节,需要向前查找才能得到当前字符的首字节。

可见UTF-8可以有效地保证数据的完整性,避免出现编码的错位。即使偶然出现“坏字”,也不会影响到后续的文本。
对于在BMP中的中文字来说,需要用3个字节才能表示,比使用UTF-16或直接使用双字节的GB2312编码大了0.5倍。



19 楼 stephen830 2008-11-02  
sdh5724 写道

亚洲语言: 是统一编码的, 同时能处理CJK区域, 如果没有3个字节, 你给我算算看, 别拿点E文就说老外写。这么庞大的字符数量, 你排的出? 每个语言不预留位置? UTF8组织的人不是傻子。

hhhhkkkk 写道

如果没有BOM的情况下,全英文的文件,用ANSI或utf8本来就没有区别,认为为个文件是ansi,还是utf8文件,是编辑器的自动编码检测顺序而已。我用vim就会自动检测成utf8。因为我的vim设置的检测顺序就是ucs-bom.utf-8,cp936。如果你能制造两个全英文,不用BOM,分别使用ANSI与UTF8编码,内容却不一样的文件,你再说是jdk的问题。


本以为把问题贴出来,可以和大家一起研究下这个问题。想不到却演变成这样的情况。看来只有真正遇到这个问题的朋友才能体会我的感受。

不过还是所有参与评论的朋友,尤其是Liteos朋友的留言,让我学习到不少的知识,谢谢!

18 楼 hhhhkkkk 2008-11-02  
如果没有BOM的情况下,全英文的文件,用ANSI或utf8本来就没有区别,认为为个文件是ansi,还是utf8文件,是编辑器的自动编码检测顺序而已。我用vim就会自动检测成utf8。因为我的vim设置的检测顺序就是ucs-bom.utf-8,cp936。如果你能制造两个全英文,不用BOM,分别使用ANSI与UTF8编码,内容却不一样的文件,你再说是jdk的问题。
17 楼 sdh5724 2008-11-02  
亚洲语言: 是统一编码的, 同时能处理CJK区域, 如果没有3个字节, 你给我算算看, 别拿点E文就说老外写。这么庞大的字符数量, 你排的出? 每个语言不预留位置? UTF8组织的人不是傻子。
16 楼 stephen830 2008-11-01  
hax 写道

lz不懂就不要瞎咋呼。什么对亚洲语言的歧视,屁话。自己心理不健康,怪别人。


这种话并不是无事生非,从事实角度出发确实如此。以下是引用前面《为什么用Utf-8编码?》 http://stephen830.iteye.com/blog/258929中的话,好好看看吧。

utf-8是一种歧视性的编码,采用gb2312一个汉字只需要两个字节,而utf-8要三个字节,平白无故的就要多出一个字节来,你想想这样中文文档的存储,网络传输平白又要多出多少浪费来。老外自己都承认了:
Let’s address the problem first: UTF-8 is kind of racist. It allows us
round-eye paleface anglophone types to tuck our characters neatly into one byte, lets most people whose languages are headquartered west of the Indus river get away with two bytes per, and penalizes India and points east by requiring them to use three bytes per character.
15 楼 hax 2008-11-01  
utf-8是一个伟大的发明,它允许平滑的从ascii支持迁移到unicode支持,从这一点来说,对使用非英文的人们是非常有益的,因为许多英文软件只要稍加变通就可以支持unicode了。

至于程序如何分辨这是一个纯ansi还是utf-8,只能靠bom,如果没有bom,那就靠猜。然而也不限于中文,超出ascii范围的欧洲各国文字、其他国家的拼音文字都是如此。

楼主如果还懂得什么叫知耻而后勇,就赶紧改改你那可笑的代码注释了。
14 楼 hax 2008-11-01  
lz不懂就不要瞎咋呼。

什么对亚洲语言的歧视,屁话。自己心理不健康,怪别人。
13 楼 stephen830 2008-11-01  
Liteos 写道

stephen830 写道
感谢Liteos的回复。但java为啥自作主张该生成utf-8文件的却生成个ansi文件。


那只是你思维定式的问题,既然UTF8与ANSI在ASCII上的编码相同,那换个角度你也可以理解为"Java生成了个只含ASCII字符的UTF-8文件,被UltraEdit识别为了ANSI文件"。

至于BOM,对UTF-8来说不是必须的,BOM的作用是为Unicode提供字节序识别。UTF-8不存在字节序问题,这也是为什么网络上推荐使用UTF-8的原因之一,另一个原因是UTF-8将双字节码转为多字节码避免了字节丢失造成的信息丢失,如某段文字"我们中国大地",如果使用Unicode,当双字节的"中"丢失了一个字节,则会造成"中"及其后的所有信息都变成乱码。
Wiki上有UTF8的优劣分析http://zh.wikipedia.org/wiki/UTF-8

你自己贴的文档里也写了,BOM在很多程序里处理都有问题,象JDK1.6之前的版本就不支持BOM读取,所以要慎用。手工在Java里添加BOM参考http://hibernate.blogdriver.com/hibernate/1138141.html


其实,我觉得这和思维并无关系,我只是根据事实提出这个问题,从知识的严谨性出发,JAVA这么做确实会带来后续的一系列问题。例如:就上面的test1.txt而言,以后当我在里面添加了中文内容,如果以utf-8方式就会读出乱码。
12 楼 Liteos 2008-11-01  
stephen830 写道

感谢Liteos的回复。但java为啥自作主张该生成utf-8文件的却生成个ansi文件。


那只是你思维定式的问题,既然UTF8与ANSI在ASCII上的编码相同,那换个角度你也可以理解为"Java生成了个只含ASCII字符的UTF-8文件,被UltraEdit识别为了ANSI文件"。

至于BOM,对UTF-8来说不是必须的,BOM的作用是为Unicode提供字节序识别。UTF-8不存在字节序问题,这也是为什么网络上推荐使用UTF-8的原因之一,另一个原因是UTF-8将双字节码转为多字节码避免了字节丢失造成的信息丢失,如某段文字"我们中国大地",如果使用Unicode,当双字节的"中"丢失了一个字节,则会造成"中"及其后的所有信息都变成乱码。
Wiki上有UTF8的优劣分析http://zh.wikipedia.org/wiki/UTF-8

你自己贴的文档里也写了,BOM在很多程序里处理都有问题,象JDK1.6之前的版本就不支持BOM读取,所以要慎用。手工在Java里添加BOM参考http://hibernate.blogdriver.com/hibernate/1138141.html
11 楼 stephen830 2008-10-31  
Liteos 写道

E4 B8 AD为"中"的UTF-8码,E6 96 87为"文"的UTF-8码,61为"a"的UTF-8码(与ASCII相同),此文件没有BOM,选"另存为...",设格式为"UTF-8"即可加上BOM,UTF-8对中文用三字节编码,对ASCII则照搬(单字节),稍花些时间看看UTF-8规范就没这些疑问了。

感谢Liteos的回复。但java为啥自作主张该生成utf-8文件的却生成个ansi文件。
10 楼 Liteos 2008-10-31  
E4 B8 AD为"中"的UTF-8码,E6 96 87为"文"的UTF-8码,61为"a"的UTF-8码(与ASCII相同),此文件没有BOM,选"另存为...",设格式为"UTF-8"即可加上BOM,UTF-8对中文用三字节编码,对ASCII则照搬(单字节),稍花些时间看看UTF-8规范就没这些疑问了。
9 楼 stephen830 2008-10-31  
Liteos 写道

stephen830 写道
我用的就是ascii转UTF-8(Unicode)。


http://tieba.baidu.com/f?kz=185433774
UltraEdit老版本才有此问题,我用的13.20+2无此问题,更新你的UltraEdit

更新版本后,确实不一样。不过,显示的信息更加奇怪了。
8 楼 Liteos 2008-10-31  
stephen830 写道

我用的就是ascii转UTF-8(Unicode)。


http://tieba.baidu.com/f?kz=185433774
UltraEdit老版本才有此问题,我用的13.20+2无此问题,更新你的UltraEdit
7 楼 stephen830 2008-10-31  
Liteos 写道

stephen830 写道
你可以自己在本地看看,所谓的bom头是什么?我这里不管是utf-8还是unicode都是 FF FE.并没有EF BB BF。


UltraEdit是新建不了UTF-8文件的,新建一个ASCII或Unicode文件,在菜单里选"文件--&gt;转换--&gt;****到UTF-8(Unicode)编辑"。在状态栏里显示U8-DOS,区别于Unicode的U-DOS。


我用的就是ascii转UTF-8(Unicode)。
6 楼 Liteos 2008-10-31  
stephen830 写道

你可以自己在本地看看,所谓的bom头是什么?我这里不管是utf-8还是unicode都是 FF FE.并没有EF BB BF。


UltraEdit是新建不了UTF-8文件的,新建一个ASCII或Unicode文件,在菜单里选"文件-->转换-->****到UTF-8(Unicode)编辑"。在状态栏里显示U8-DOS,区别于Unicode的U-DOS。
5 楼 stephen830 2008-10-30  
Liteos 写道

stephen830 写道
utf-8中的ascii码均以双字节表示。


FF FE是Unicode的Bom,Unicode当然是双字节,UTF-8的Bom是EF BB BF,Bom是允许含但通常不含。UTF-8和ASCII下的英文完全相同,Java按规范操作而已,不是什么Bug


你可以自己在本地看看,所谓的bom头是什么?我这里不管是utf-8还是unicode都是 FF FE.并没有EF BB BF。
4 楼 Liteos 2008-10-30  
stephen830 写道

utf-8中的ascii码均以双字节表示。


FF FE是Unicode的Bom,Unicode当然是双字节,UTF-8的Bom是EF BB BF,Bom是允许含但通常不含。UTF-8和ASCII下的英文完全相同,Java按规范操作而已,不是什么Bug
3 楼 stephen830 2008-10-30  
Liteos 写道

是博主对UTF-8的理解问题,UTF-8对ASCII是不转换的,你加一个汉字生成的文件和原文件比只是三个字节不同而已。

utf-8中的ascii码均以双字节表示。

相关推荐

Global site tag (gtag.js) - Google Analytics