Ticket #331 (new defect)
Detecting text encoding tmprsrc on upload
| Reported by: | svr | Owned by: | mpo |
|---|---|---|---|
| Priority: | minor | Milestone: | 0.5 |
| Component: | < Upload Control | Version: | trunk |
| Keywords: | utf-8 upload check text encoding | Cc: | kauri-discuss@… |
Description
Hello,
I've encountered some issues while working with the upload control, more specific the tmp rsrc. Users can upload CSV files for processing, but here I don't have any control on what charset is being used (standard it's utf-8, but ms excel uses
iso-5589-1).
So I thought I could use the tmp-rsrc to find out what charset the file exactly is, but apparently it says it's always utf-8. So when I have to process a iso-5589-1 file, I get strange characters...
When I look at the source code (UploadNewDataResource.java), I see a todo about detecting encodings. Default encoding is being used.
//TODO how can we be sure of the character-set of uploaded text files? // needs further investigation and test-cases. (there seems no way to really know) // might eventually need some service that actually does detection of real charsets uploadRepresentation.setCharacterSet(CharacterSet.DEFAULT);
At the moment I've implemented a simple check at application level whether or not a file has charset utf-8 (by checking the BOM), as I'm quite sure utf-8 and iso-5589-1 are the only ones I can encounter in my case.
More advanced checks are needed at the kauri level though.
Steven
Attachments
Change History
Changed 13 months ago by anonymous
- Attachment 4° TRIMESTRE 2011 PRI12T16_Riass_vita.TXT added
riass
might be interesting when investigating charset detection: