utf 8 - Japanese SRT files garbled, can't determine encoding to fix with iconv -

i have srt file, excerpt:

2 00:00:36,208 --> 00:00:39,667 Èá óå óêïôþóù, ÃïõÜéíôæåëóôéí!  3 00:00:57,917 --> 00:01:00,917 Ãéáôß ôñÝ÷åéò, ÃïõÜéíôæåëóôéí; Óïõ ðÞñá äþñï ãåíåèëßùí.  4 00:01:00,958 --> 00:01:03,208 Äåí ðåéñÜæåé, äåí ÷ñåéáæüôáí íá ìïõ ðÜñåéò êÜôé.  5 00:01:03,250 --> 00:01:06,375 Óïõ ðÞñá ëßãï êïñìü äÝíôñïõ. Êáé èá ôï öáò.  6 00:01:06,417 --> 00:01:08,875 Ùñáßá. ¸ôóé êé áëëéþò èá Ýôñùãá êïñìü.  7 00:01:08,917 --> 00:01:10,208 Äåí èá Ýôñùãåò.  8 00:01:10,208 --> 00:01:11,000 Íáé. ÂëÝðåéò...  9 00:01:11,000 --> 00:01:12,417 ...üëá ôá ðñÜãìáôá ðïõ Þèåëåò íá ìïõ êÜíåéò...  10 00:01:12,417 --> 00:01:13,958 ...ó÷åäßáæá íá ôá êÜíù ìüíïò ìïõ.

supposedly these japanese subtitles, garbled encoding issue. trying figure out how right , convert utf-8 ultimately. have ideas?

file output: utf-8 unicode (with bom) text, crlf line terminators

file can obtained here testing: http://www.opensubtitles.org/en/subtitles/5040215/the-incredible-burt-wonderstone-ja

what have document has been transcoded iso-8859-1 character set utf-8 encoding scheme, document source coded in iso-8859-7 character set. after transcoding utf-8, u+feff byte order mark (bom) has been added , few quotation marks (u+201c, u+201d).

the language greek , 2nd subtitle sequence when corrected is:

2 00:00:36,208 --> 00:00:39,667 Θα σε σκοτώσω, Γουάιντζελστιν!

the english language translation "i'll kill you, gouaintzelstin!".

to reverse/correct it:

decode document utf-8 encoding scheme remove code-points greater u+00ff encode document using iso-8859-1 encoding transcode document using iso-8859-7 encoding utf-8 encoding scheme.

an implementation of above in perl:

#!/usr/bin/perl  utilize strict;  utilize warnings;   utilize encode qw[];  (@argv == 1 && -f $argv[0])   or die qq[usage: $0 <file>];  $file = shift @argv;  ($octets, $string);  # read octets file $octets = {     open $fh, '<:raw', $file       or die qq[could not open '$file' reading: '$!'];     local $/; <$fh> };  # decode octets using utf-8 encoding scheme $string = encode::decode('utf-8', $octets, encode::fb_croak);  # remove code points greater u+00ff $string =~ s/[^\x00-\xff]//g;   # encode string using iso-8859-1 encoding $octets = encode::encode('iso-8859-1', $string);  # decode octets using iso-8859-7 encoding $string = encode::decode('iso-8859-7', $octets);  # encode string using utf-8 encoding $octets = encode::encode('utf-8', $string);  # output octets on standard output print $octets;

utf-8 character-encoding iconv mojibake srt

Search This Blog

Jaimee

utf 8 - Japanese SRT files garbled, can't determine encoding to fix with iconv -

Comments

Post a Comment

Popular posts from this blog

javascript - THREE.js reposition vertices for RingGeometry -

javascript - I need to update the text of a paragraph by inline edit -

assembly - What is the addressing mode for ld, add, and rjmp instructions? -