Thursday, December 23, 2010

Accents becoming garbage on form submits and Character Encoding

How non-ASCII data like Accents and Apostrophes gets displayed in HTML, retrieving such data through form fields and how data gets stored in the database with correct format is handled in few place. The trick is, in all these places the encoding format should be the same, which is typically UTF-8.

Following are the locations where encoding format is defined and how its done.

Page content format:
Set the HTML file's character encoding format through HTTP headers or meta tags.
Eg: < meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />  
If this is not set browsers assume  ISO-8859-1 to be the default character encoding format.

Form submits:
GET and POST request parameter are also encoded according to the page encoding format.
This can be overridden by using the accept-charset="UTF-8" attribute in the form tag. 


Server Request Parameters:
In Servelts, JSPs and Portlets, request paramter encoding format can be set by the following statement.
request.setCharacterEncoding("UTF-8");
If this is not set the web servers assumes the default format as ISO-8859-1.
To make the things more generic, the encoding format can be set
in doFilter method of a Servlet Filter.

public void doFilter(ServletRequest request, ServletResponse response,
                         FilterChain chain)
   throws IOException, ServletException {
        if (request.getCharacterEncoding() == null) {
            String encoding = "UTF-8";
            if (encoding != null)
                request.setCharacterEncoding(encoding);
        }

chain.doFilter(request, response);       

//do it again, since JSPs will set it to the default       
     if (encoding != null)
        request.setCharacterEncoding(encoding);

}

One tricky point is if the form encryption type is "multipart/form-data" then each value should be
read by specifying the encoding type like below.

FileItem item = (FileItem) iter.next();
if (item.isFormField())
value = item.getString("UTF-8").trim();

The final point the data encoding format should be handled is the Database it self. This should be
set at the time the database is created. If this format is changed after creating the database, existing
data will be corrupted.


Reference : http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/

No comments:

Post a Comment

Subscribe