utf 8 - Japanese SRT files garbled, can't determine encoding to fix with iconv -



utf 8 - Japanese SRT files garbled, can't determine encoding to fix with iconv -

i have srt file, excerpt:

2 00:00:36,208 --> 00:00:39,667 Èá óå óêïôþóù, ÃïõÜéíôæåëóôéí! 3 00:00:57,917 --> 00:01:00,917 Ãéáôß ôñÝ÷åéò, ÃïõÜéíôæåëóôéí; Óïõ ðÞñá äþñï ãåíåèëßùí. 4 00:01:00,958 --> 00:01:03,208 Äåí ðåéñÜæåé, äåí ÷ñåéáæüôáí íá ìïõ ðÜñåéò êÜôé. 5 00:01:03,250 --> 00:01:06,375 Óïõ ðÞñá ëßãï êïñìü äÝíôñïõ. Êáé èá ôï öáò. 6 00:01:06,417 --> 00:01:08,875 Ùñáßá. ¸ôóé êé áëëéþò èá Ýôñùãá êïñìü. 7 00:01:08,917 --> 00:01:10,208 Äåí èá Ýôñùãåò. 8 00:01:10,208 --> 00:01:11,000 Íáé. ÂëÝðåéò... 9 00:01:11,000 --> 00:01:12,417 ...üëá ôá ðñÜãìáôá ðïõ Þèåëåò íá ìïõ êÜíåéò... 10 00:01:12,417 --> 00:01:13,958 ...ó÷åäßáæá íá ôá êÜíù ìüíïò ìïõ.

supposedly these japanese subtitles, garbled encoding issue. trying figure out how right , convert utf-8 ultimately. have ideas?

file output: utf-8 unicode (with bom) text, crlf line terminators

file can obtained here testing: http://www.opensubtitles.org/en/subtitles/5040215/the-incredible-burt-wonderstone-ja

what have document has been transcoded iso-8859-1 character set utf-8 encoding scheme, document source coded in iso-8859-7 character set. after transcoding utf-8, u+feff byte order mark (bom) has been added , few quotation marks (u+201c, u+201d).

the language greek , 2nd subtitle sequence when corrected is:

2 00:00:36,208 --> 00:00:39,667 Θα σε σκοτώσω, Γουάιντζελστιν!

the english language translation "i'll kill you, gouaintzelstin!".

to reverse/correct it:

decode document utf-8 encoding scheme remove code-points greater u+00ff encode document using iso-8859-1 encoding transcode document using iso-8859-7 encoding utf-8 encoding scheme.

an implementation of above in perl:

#!/usr/bin/perl utilize strict; utilize warnings; utilize encode qw[]; (@argv == 1 && -f $argv[0]) or die qq[usage: $0 <file>]; $file = shift @argv; ($octets, $string); # read octets file $octets = { open $fh, '<:raw', $file or die qq[could not open '$file' reading: '$!']; local $/; <$fh> }; # decode octets using utf-8 encoding scheme $string = encode::decode('utf-8', $octets, encode::fb_croak); # remove code points greater u+00ff $string =~ s/[^\x00-\xff]//g; # encode string using iso-8859-1 encoding $octets = encode::encode('iso-8859-1', $string); # decode octets using iso-8859-7 encoding $string = encode::decode('iso-8859-7', $octets); # encode string using utf-8 encoding $octets = encode::encode('utf-8', $string); # output octets on standard output print $octets;

utf-8 character-encoding iconv mojibake srt

Comments

Popular posts from this blog

assembly - What is the addressing mode for ld, add, and rjmp instructions? -

vowpalwabbit - Interpreting Vowpal Wabbit results: Why are some lines appended by "h"? -

Is there a way to convert an HTML page styled with Bootstrap CSS into email-compatible html? -