Debugging corrupted character encoding issues
It may happen that you see corrupted text: scrambled, garbled, or displayed as "garbage" characters. Let's say, your application (server) receives a JSON request with some corrupted chars. For example, it could use a different character encoding (UTF-8 vs. ISO-8859-1) than the client it is communicating with.
It can be useful to generate a hexadecimal view of such request, usually referred to as hex dump, (using commands like hexdump, od
or xxd
) in order to debug such issues looking at the lower-level representation of a text:
:~$ xxd -p <<<{"badv":[],"bcat":[],"device":{"dnt":0,"geo":{"city":"WICHITA","country":"USA","lat":37.68978,"lon":-97.34148,"metro":"678","region":"KS","type":2,"zip":"67212"},"ip":"68.107.183.251","language":"es","os":"WINDOWS","osv":"WINDOWS7","ua":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0"},"id":"WpkAxTdf5ThP","imp":[{"banner":{"battr":[],"btype":[],"ext":{},"h":250,"pos":3,"topframe":0,"w":300},"bidfloorcur":"USD","id":"WpkAxTdf5ThP","instl":0,"secure":0,"tagid":"302868"}],"site":{"cat":[],"domain":"diario.mx","id":"100017","keywords":"diari,hrs,york,local,tim,new,nacional,chihuahu,paso,vide,the,reform,duart,ciudad,servic,puent,news,mexican,medi,espectacul,york_tim,new_york,the_new,tim_news,servic_21,news_servic,septiembr_2015,ju�_rez,diari_18,associated_press,victor_orozc","page":"http://diario.mx/Nvo_Casas_Grandes/2015-10-01_f2c710fc/realizan-el-segundo-computo-de-las-candidatas/","pagecat":["IAB20"],"publisher":{"domain":"diario.mx","id":"558393"},"ref":"http://diario.mx/Nvo_Casas_Grandes/"},"tmax":125,"user":{"ext":{},"id":"xutRfr-GoxxU-RAOQ_ehzA"}}| tr -d '\n'
7b626164763a5b5d2c626361743a5b5d2c6465766963653a7b646e743a302c67656f3a7b636974793a574943484954412c636f756e7472793a5553412c6c61743a33372e36383937382c6c6f6e3a2d39372e33343134382c6d6574726f3a3637382c726567696f6e3a4b532c747970653a322c7a69703a36373231327d2c69703a36382e3130372e3138332e3235312c6c616e67756167653a65732c6f733a57494e444f57532c6f73763a57494e444f5753372c75613a4d6f7a696c6c612f352e30202857696e646f7773204e5420362e313b20574f5736343b2072763a33392e3029204765636b6f2f32303130303130312046697265666f782f33392e307d2c69643a57706b4178546466355468502c696d703a5b7b62616e6e65723a7b62617474723a5b5d2c62747970653a5b5d2c6578743a7b7d2c683a3235302c706f733a332c746f706672616d653a302c773a3330307d2c626964666c6f6f726375723a5553442c69643a57706b4178546466355468502c696e73746c3a302c7365637572653a302c74616769643a3330323836387d5d2c736974653a7b6361743a5b5d2c646f6d61696e3a64696172696f2e6d782c69643a3130303031372c6b6579776f7264733a64696172692c6872732c796f726b2c6c6f63616c2c74696d2c6e65772c6e6163696f6e616c2c63686968756168752c7061736f2c766964652c7468652c7265666f726d2c64756172742c6369756461642c7365727669632c7075656e742c6e6577732c6d65786963616e2c6d6564692c6573706563746163756c2c796f726b5f74696d2c6e65775f796f726b2c7468655f6e65772c74696d5f6e6577732c7365727669635f32312c6e6577735f7365727669632c7365707469656d62725f323031352c6a75efbfbd5f72657a2c64696172695f31382c6173736f6369617465645f70726573732c766963746f725f6f726f7a632c706167653a687474703a2f2f64696172696f2e6d782f4e766f5f43617361735f4772616e6465732f323031352d31302d30315f66326337313066632f7265616c697a616e2d656c2d736567756e646f2d636f6d7075746f2d64652d6c61732d63616e646964617461732f2c706167656361743a5b49414232305d2c7075626c69736865723a7b646f6d61696e3a64696172696f2e6d782c69643a3535383339337d2c7265663a687474703a2f2f64696172696f2e6d782f4e766f5f43617361735f4772616e6465732f7d2c746d61783a3132352c757365723a7b6578743a7b7d2c69643a7875745266722d476f7878552d52414f515f65687a417d7d0a
In this way it's also easier to share or manipulate a piece of text without any additional loss of information.
You can read it back to plain text by using the -r
flag:
:~$ xxd -p -r <<< 7b2262616476223a5b5d2c2262636174223a5b5d2c22646576696365223a7b22646e74223a302c2267656f223a7b2263697479223a2257494348495441222c22636f756e747279223a22555341222c226c6174223a33372e36383937382c226c6f6e223a2d39372e33343134382c226d6574726f223a22363738222c22726567696f6e223a224b53222c2274797065223a322c227a6970223a223637323132227d2c226970223a2236382e3130372e3138332e323531222c226c616e6775616765223a226573222c226f73223a2257494e444f5753222c226f7376223a2257494e444f575337222c227561223a224d6f7a696c6c612f352e30202857696e646f7773204e5420362e313b20574f5736343b2072763a33392e3029204765636b6f2f32303130303130312046697265666f782f33392e30227d2c226964223a2257706b417854646635546850222c22696d70223a5b7b2262616e6e6572223a7b226261747472223a5b5d2c226274797065223a5b5d2c22657874223a7b7d2c2268223a3235302c22706f73223a332c22746f706672616d65223a302c2277223a3330307d2c22626964666c6f6f72637572223a22555344222c226964223a2257706b417854646635546850222c22696e73746c223a302c22736563757265223a302c227461676964223a22333032383638227d5d2c2273697465223a7b22636174223a5b5d2c22646f6d61696e223a2264696172696f2e6d78222c226964223a22313030303137222c226b6579776f726473223a2264696172692c6872732c796f726b2c6c6f63616c2c74696d2c6e65772c6e6163696f6e616c2c63686968756168752c7061736f2c766964652c7468652c7265666f726d2c64756172742c6369756461642c7365727669632c7075656e742c6e6577732c6d65786963616e2c6d6564692c6573706563746163756c2c796f726b5f74696d2c6e65775f796f726b2c7468655f6e65772c74696d5f6e6577732c7365727669635f32312c6e6577735f7365727669632c7365707469656d62725f323031352c6a75e35f72657a2c64696172695f31382c6173736f6369617465645f70726573732c766963746f725f6f726f7a63222c2270616765223a22687474703a2f2f64696172696f2e6d782f4e766f5f43617361735f4772616e6465732f323031352d31302d30315f66326337313066632f7265616c697a616e2d656c2d736567756e646f2d636f6d7075746f2d64652d6c61732d63616e646964617461732f222c2270616765636174223a5b224941423230225d2c227075626c6973686572223a7b22646f6d61696e223a2264696172696f2e6d78222c226964223a22353538333933227d2c22726566223a22687474703a2f2f64696172696f2e6d782f4e766f5f43617361735f4772616e6465732f227d2c22746d6178223a3132352c2275736572223a7b22657874223a7b7d2c226964223a227875745266722d476f7878552d52414f515f65687a41227d7d
{"badv":[],"bcat":[],"device":{"dnt":0,"geo":{"city":"WICHITA","country":"USA","lat":37.68978,"lon":-97.34148,"metro":"678","region":"KS","type":2,"zip":"67212"},"ip":"68.107.183.251","language":"es","os":"WINDOWS","osv":"WINDOWS7","ua":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0"},"id":"WpkAxTdf5ThP","imp":[{"banner":{"battr":[],"btype":[],"ext":{},"h":250,"pos":3,"topframe":0,"w":300},"bidfloorcur":"USD","id":"WpkAxTdf5ThP","instl":0,"secure":0,"tagid":"302868"}],"site":{"cat":[],"domain":"diario.mx","id":"100017","keywords":"diari,hrs,york,local,tim,new,nacional,chihuahu,paso,vide,the,reform,duart,ciudad,servic,puent,news,mexican,medi,espectacul,york_tim,new_york,the_new,tim_news,servic_21,news_servic,septiembr_2015,ju�_rez,diari_18,associated_press,victor_orozc","page":"http://diario.mx/Nvo_Casas_Grandes/2015-10-01_f2c710fc/realizan-el-segundo-computo-de-las-candidatas/","pagecat":["IAB20"],"publisher":{"domain":"diario.mx","id":"558393"},"ref":"http://diario.mx/Nvo_Casas_Grandes/"},"tmax":125,"user":{"ext":{},"id":"xutRfr-GoxxU-RAOQ_ehzA"}}
Please, note the corrupted char in ju�_rez
.
You can test that this JSON is encoded in ISO-8859-1 by using file
:
:~$ xxd -p -r <<< 7b2262616476223a5b5d2c2262636174223a5b5d2c22646576696365223a7b22646e74223a302c2267656f223a7b2263697479223a2257494348495441222c22636f756e747279223a22555341222c226c6174223a33372e36383937382c226c6f6e223a2d39372e33343134382c226d6574726f223a22363738222c22726567696f6e223a224b53222c2274797065223a322c227a6970223a223637323132227d2c226970223a2236382e3130372e3138332e323531222c226c616e6775616765223a226573222c226f73223a2257494e444f5753222c226f7376223a2257494e444f575337222c227561223a224d6f7a696c6c612f352e30202857696e646f7773204e5420362e313b20574f5736343b2072763a33392e3029204765636b6f2f32303130303130312046697265666f782f33392e30227d2c226964223a2257706b417854646635546850222c22696d70223a5b7b2262616e6e6572223a7b226261747472223a5b5d2c226274797065223a5b5d2c22657874223a7b7d2c2268223a3235302c22706f73223a332c22746f706672616d65223a302c2277223a3330307d2c22626964666c6f6f72637572223a22555344222c226964223a2257706b417854646635546850222c22696e73746c223a302c22736563757265223a302c227461676964223a22333032383638227d5d2c2273697465223a7b22636174223a5b5d2c22646f6d61696e223a2264696172696f2e6d78222c226964223a22313030303137222c226b6579776f726473223a2264696172692c6872732c796f726b2c6c6f63616c2c74696d2c6e65772c6e6163696f6e616c2c63686968756168752c7061736f2c766964652c7468652c7265666f726d2c64756172742c6369756461642c7365727669632c7075656e742c6e6577732c6d65786963616e2c6d6564692c6573706563746163756c2c796f726b5f74696d2c6e65775f796f726b2c7468655f6e65772c74696d5f6e6577732c7365727669635f32312c6e6577735f7365727669632c7365707469656d62725f323031352c6a75e35f72657a2c64696172695f31382c6173736f6369617465645f70726573732c766963746f725f6f726f7a63222c2270616765223a22687474703a2f2f64696172696f2e6d782f4e766f5f43617361735f4772616e6465732f323031352d31302d30315f66326337313066632f7265616c697a616e2d656c2d736567756e646f2d636f6d7075746f2d64652d6c61732d63616e646964617461732f222c2270616765636174223a5b224941423230225d2c227075626c6973686572223a7b22646f6d61696e223a2264696172696f2e6d78222c226964223a22353538333933227d2c22726566223a22687474703a2f2f64696172696f2e6d782f4e766f5f43617361735f4772616e6465732f227d2c22746d6178223a3132352c2275736572223a7b22657874223a7b7d2c226964223a227875745266722d476f7878552d52414f515f65687a41227d7d | file -i -
/dev/stdin: text/plain; charset=iso-8859-1
You can read correctly the corrupted char by executing:
:~$ xxd -p -r <<< 7b2262616476223a5b5d2c2262636174223a5b5d2c22646576696365223a7b22646e74223a302c2267656f223a7b2263697479223a2257494348495441222c22636f756e747279223a22555341222c226c6174223a33372e36383937382c226c6f6e223a2d39372e33343134382c226d6574726f223a22363738222c22726567696f6e223a224b53222c2274797065223a322c227a6970223a223637323132227d2c226970223a2236382e3130372e3138332e323531222c226c616e6775616765223a226573222c226f73223a2257494e444f5753222c226f7376223a2257494e444f575337222c227561223a224d6f7a696c6c612f352e30202857696e646f7773204e5420362e313b20574f5736343b2072763a33392e3029204765636b6f2f32303130303130312046697265666f782f33392e30227d2c226964223a2257706b417854646635546850222c22696d70223a5b7b2262616e6e6572223a7b226261747472223a5b5d2c226274797065223a5b5d2c22657874223a7b7d2c2268223a3235302c22706f73223a332c22746f706672616d65223a302c2277223a3330307d2c22626964666c6f6f72637572223a22555344222c226964223a2257706b417854646635546850222c22696e73746c223a302c22736563757265223a302c227461676964223a22333032383638227d5d2c2273697465223a7b22636174223a5b5d2c22646f6d61696e223a2264696172696f2e6d78222c226964223a22313030303137222c226b6579776f726473223a2264696172692c6872732c796f726b2c6c6f63616c2c74696d2c6e65772c6e6163696f6e616c2c63686968756168752c7061736f2c766964652c7468652c7265666f726d2c64756172742c6369756461642c7365727669632c7075656e742c6e6577732c6d65786963616e2c6d6564692c6573706563746163756c2c796f726b5f74696d2c6e65775f796f726b2c7468655f6e65772c74696d5f6e6577732c7365727669635f32312c6e6577735f7365727669632c7365707469656d62725f323031352c6a75e35f72657a2c64696172695f31382c6173736f6369617465645f70726573732c766963746f725f6f726f7a63222c2270616765223a22687474703a2f2f64696172696f2e6d782f4e766f5f43617361735f4772616e6465732f323031352d31302d30315f66326337313066632f7265616c697a616e2d656c2d736567756e646f2d636f6d7075746f2d64652d6c61732d63616e646964617461732f222c2270616765636174223a5b224941423230225d2c227075626c6973686572223a7b22646f6d61696e223a2264696172696f2e6d78222c226964223a22353538333933227d2c22726566223a22687474703a2f2f64696172696f2e6d782f4e766f5f43617361735f4772616e6465732f227d2c22746d6178223a3132352c2275736572223a7b22657874223a7b7d2c226964223a227875745266722d476f7878552d52414f515f65687a41227d7d | iconv -f iso-8859-1 -t utf-8 -
{"badv":[],"bcat":[],"device":{"dnt":0,"geo":{"city":"WICHITA","country":"USA","lat":37.68978,"lon":-97.34148,"metro":"678","region":"KS","type":2,"zip":"67212"},"ip":"68.107.183.251","language":"es","os":"WINDOWS","osv":"WINDOWS7","ua":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0"},"id":"WpkAxTdf5ThP","imp":[{"banner":{"battr":[],"btype":[],"ext":{},"h":250,"pos":3,"topframe":0,"w":300},"bidfloorcur":"USD","id":"WpkAxTdf5ThP","instl":0,"secure":0,"tagid":"302868"}],"site":{"cat":[],"domain":"diario.mx","id":"100017","keywords":"diari,hrs,york,local,tim,new,nacional,chihuahu,paso,vide,the,reform,duart,ciudad,servic,puent,news,mexican,medi,espectacul,york_tim,new_york,the_new,tim_news,servic_21,news_servic,septiembr_2015,juã_rez,diari_18,associated_press,victor_orozc","page":"http://diario.mx/Nvo_Casas_Grandes/2015-10-01_f2c710fc/realizan-el-segundo-computo-de-las-candidatas/","pagecat":["IAB20"],"publisher":{"domain":"diario.mx","id":"558393"},"ref":"http://diario.mx/Nvo_Casas_Grandes/"},"tmax":125,"user":{"ext":{},"id":"xutRfr-GoxxU-RAOQ_ehzA"}}
Please, note now the a with tilde in juã_rez
correctly encoded.
This conversion is possible because UTF-8 (which is multi-byte) is capable of encoding any Unicode code points, while ISO-8859-1 (which is single-byte) can handle only a subset of them. So, transcoding from ISO-8859-1 to UTF-8 is not a problem. Instead, going backwards from UTF-8 to ISO-8859-1 will cause replacement characters (e.g. �) to appear in the text when unsupported characters are found.
Debugging this kind of issues can be tricky: make sure you know at least the absolute minimum about character encoding.