Jump to content

subtitle coding


sumquodsum

Recommended Posts

Hi,
i am newly registered because i have a question :D

I sometimes downloaded subtitles from addic7ed and wondered while watching them in movies with MPC: why are there weird characters around the lyric of the playing song? (now i still wondered which characters they could be?)

Last time i accidently downloaded a german subtitle for V Season 1 Episode 99 (i know... they would take the infinity character if it was available) and the characters like ü, ö are garbled. After opening it with notepad i got the problem. I have to save them in unicode in order to display the right characters.

Anyway, this is not a tutorial how to get the right coding for your subtitle. The question is why do the players not support UTF-8 coding, why are the subtitles not in unicode coding or why is there no standard?

I am not in the subtitle scene, hopefulle anyone can the me the reason or take action if there is none.

Thank you

Link to comment
Share on other sites

This is a bit of debate here. Some people think we should pander to the lowest common denominator, others of us have upgraded our copies of Windows 95 and are embracing the future. :)

As I understand it, MPC uses an ancient unsupported srt decoded called vobsub which doesn't support UTF-8 subs.
You can either upgrade your player to something which uses a sub decoder which is still maintained... like VLC, or XBMC or MPlayer or... I'm sure there are other suggestions.

Or you can get a tool like Subtitle Edit which understands encoding well.
http://www.nikse.dk/se/

Open the SRT. Change the encoding on the left side to something besides UTF-8 like ANSI (or is it ASCII, i forget) and see if that doesn't fix it.

Link to comment
Share on other sites


Hi,
i am newly registered because i have a question :D

I sometimes downloaded subtitles from addic7ed and wondered while watching them in movies with MPC: why are there weird characters around the lyric of the playing song? (now i still wondered which characters they could be?)

Last time i accidently downloaded a german subtitle for V Season 1 Episode 99 (i know... they would take the infinity character if it was available) and the characters like ü, ö are garbled. After opening it with notepad i got the problem. I have to save them in unicode in order to display the right characters.

Anyway, this is not a tutorial how to get the right coding for your subtitle. The question is why do the players not support UTF-8 coding, why are the subtitles not in unicode coding or why is there no standard?

I am not in the subtitle scene, hopefulle anyone can the me the reason or take action if there is none.

Thank you


I posted a small tutorial for this very same problem some time ago:

http://www.sub-talk.net/problems-displaying-downloaded-subs-change-charset-encoding-t-164.html

The musical notes issue it's because the characted ? it's not ANSI,
so change ? for ¶ using Notepad (or other SW) and you're ready to go.

hope this help :)
Link to comment
Share on other sites

I get this effect with Subtitle Workshop, WMP, and also when burned to a DVD - a strange mixture of a-circumflex, followed by superscripted "TM", followed, in turn, by a super-superscripted "a" (♪)

(Interesting - in the above cut 'n paste example, the "a" and the "TM" have reversed formatting, in re the super-superscripting.)

It doesn't bother me particularly, because as part of my checking the sync. and spelling, etc., I routinely strip out all extraneous nonsense (such as the words to second-rate pop songs) from subs that I'm going to use; they are irrelevant and themselves a distraction.

However, it is an annoyance for those who actually want, for some strange reason, to see the meaningless drivel, but do not wish to watch TV programmes on a computer - or do not wish to replace the software we are used to with some gimmicky modern doo-dah that looks like a traffic cone! (Even if we got rid of Win95 12 years ago...)

s.

PS: on a point of trivial interest - and certainly pedantry - although universally used, the phrase "lowest common denominator" is erroneous as the statistical/mathematical characteristic usually being referred to is actually the highest common factor.

Important links: Forum Help.


United%20Kingdom.gif

Never look before you leap, it'll spoil the surprise.

Link to comment
Share on other sites

I'm not a SW user. But i've worked heavily with the author of SE to make sure his app handles encoding properly and clearly.

If you're finding applications which don't support UTF8, i suggest you download it, open the subs, change the encoding to what you want (ANSI or ASCII) and see if it doesn't pick the proper charset to make sure things look great.

http://www.nikse.dk/se/

Get his beta version as they're stable and have the latest additions.
Rob

Link to comment
Share on other sites

Hi, Rob.

Thank you for your suggestion - but as I said, the strange characters in place of the quaver don't bother me, personally, not least because I tend to strip out all the song references almost by habit, and even those that I miss don't really bother me as much as they seem to offend others.

As for SE - I have v.2.9 on the PC, and I am impressed with it, in spite of the inclusion of Hunspell, which like all spell checkers I have tried, is a poor substitute for the Microsoft spell checker. This negative comparison is largely evened out by the very desirable facility of custom dictionaries - especially the exclude dictionary.

SE is unusable, for me, however, because the blink of the spell panel as it refreshes at each step quickly brings on a migraine. Clear that up, and SE could possibly oust SW with me.

Cheers,
s.

Important links: Forum Help.


United%20Kingdom.gif

Never look before you leap, it'll spoil the surprise.

Link to comment
Share on other sites

I'll have to see if I can reproduce what you're talking about. it's not my application, but an email explaining what you don't like to Nikolav might get it cleared up quickly.

Is it that the window closes and reopens each time it finds a new mistake?
Rob

Link to comment
Share on other sites

"Is it that the window closes and reopens each time it finds a new mistake?"

That's it, exactly!

When a word is queried, pressing "Skip", or "Change", etc, briefly closes the spell check panel, exposing a blank, white portion of the main text window. Effectively, a white flash fills the centre of the monitor screen.

s.

Important links: Forum Help.


United%20Kingdom.gif

Never look before you leap, it'll spoil the surprise.

Link to comment
Share on other sites

"an email explaining what you don't like to Nikolav might get it cleared up quickly"

It did indeed - within minutes, in fact!
I am now exploring the facilities of the program...

s.

Important links: Forum Help.


United%20Kingdom.gif

Never look before you leap, it'll spoil the surprise.

Link to comment
Share on other sites

Hi,

sorry for not answering in a while, i was quite busy.
I think most of you missed the point... or my question was not clear, sorry for that.
The thing is i am a lazy student like the other people, so i wish there is a standard for subtile coding, even how to handle them on different platforms. With that i only need to download the subtitle and watch a movie happily without changing anything. Maybe it will change in the near future

@dny238: i use MPC to display the subtitles and not vobsub. i think it is integrated and uses hardware implementation for that. It is very nice because the subtitles can be displayed in the bottom unused area in wide format movie. It displays unicode but not UTF-8
@others who can live with that "bug": the music note was just an example. English is not my native language so supporting unicode is a big topic to me. For the one who doesnt know how to change the coding or how to google to find the solution is it a big problem, too. Believe me, i know them personally :D

Thank you again for your attention

Link to comment
Share on other sites

Yeah, it's a pain, technical the SRT is the file format, we're talking about encoding.

There are multiple encoding styles out there in the world for dealing with different languages, until everyone starts using UTF8, we'll all suffer.

Link to comment
Share on other sites

Actually, Unicode is a lot more "global" than UTF-8 (because it's more similar to UTF-32... or 16, I don't remember exactly).

Maybe it wasn't such a bad idea to store the subs in that language...

and yeah, DirectVobSub CAN (if i remember correctly) display subtitles encoded in Unicode :P

Link to comment
Share on other sites

I'm no expect on the subject yet.
The U in UTF is unicode... My understanding is that UTF-8 is one of the encoding standards for Unicode. It's small size makes it desireable, and it's backwards compatibility with ASCII makes it mostly compatible with the majority of Apps.

Wikipedia says: "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard), and UTF-16 (which extends UCS-2 to handle code points beyond the scope of UCS-2)."

According to http://unicode.org/faq/utf_bom.html
"is considered one of the three equivalent Unicode Encoding Forms and therefore standard."
UTF-16 is a java and windows standard, and UTF-32 is used by Unix apparently.

Considering the majority of foreign language websites are encoded in UTF-8, it seems to do the job in the most compact format possible.

I wonder what Unicode encoding format DirectOldSub supports.?
Ideas to allow users to select their preferred format are in the works.
Dny

Link to comment
Share on other sites

DirectOldSub? I just tested with DirectVobSub, it supports both unicode and UTF-8 (saved with notepad).
Selection of different formats may cause confusing, i think. It is better if MPC implements this support.

Link to comment
Share on other sites

I don't use those apps, VLC and XBMC do the job for me. As a result I don't know exactly which part is failing. my understanding is that because of the overlap with ANSI encoding, UTF works until it differs from ANSI and then you get junk on the screen. Musical notes are a typical short fall of The DirectVobSub/MPC combo. Those require UTF-8 and weren't supported by ANSI.

Link to comment
Share on other sites

VLC failed for me because it doesnt support hardware acceleration yet.
you have to see UTF-x and Unicode as different coding. It has nothing todo with ANSI. Unicode has variable bit length for characters while UTF-x have a fix bit length of x for characters.
Are we still talking about the same thing? :D

Link to comment
Share on other sites

  • 3 months later...

Guys… I use Linux and work with UTF-8 encoded subtitles. (But even when I used Windows XP, I also used UTF-8 subtitles with MPC and VobSub, I just converted them first to UTF-8). There is an issue when uploading UTF-8 encoded subtitles to the site; for example, when uploading subtitles using Language English, the site thinks that they are "cp1252" encoded, so "…" becomes "â?¦", "?" becomes "â?ª" etc.
When downloading subtitles from addic7ed, I have a python script that tries to bring everything back to normal using various tricks; sometimes the file is UTF-8 encoded as UTF-8 itself (two levels). But I believe the site code has to be updated, and allow uploading UTF-8 subtitles without further decoding/encoding, since by default it sends UTF-8 encoded files (at least for me).
See some example code here: http://bpaste.net/show/9480/

Link to comment
Share on other sites


Guys… I use Linux and work with UTF-8 encoded subtitles. (But even when I used Windows XP, I also used UTF-8 subtitles with MPC and VobSub, I just converted them first to UTF-8). There is an issue when uploading UTF-8 encoded subtitles to the site; for example, when uploading subtitles using Language English, the site thinks that they are "cp1252" encoded, so "…" becomes "â?¦", "?" becomes "â?ª" etc.
When downloading subtitles from addic7ed, I have a python script that tries to bring everything back to normal using various tricks; sometimes the file is UTF-8 encoded as UTF-8 itself (two levels). But I believe the site code has to be updated, and allow uploading UTF-8 subtitles without further decoding/encoding, since by default it sends UTF-8 encoded files (at least for me).
See some example code here: http://bpaste.net/show/9480/



What program created the file? Is the BOM correct?
Can you put the SRT somewhere I can download it? (dropbox.com public folders are easy)
D
Link to comment
Share on other sites

What program created the file? Is the BOM correct? Can you put the SRT somewhere I can download it? (dropbox.com public folders are easy) D

The file was created by my Python script; its name is "check_timing.py", but I don't know why you thought that specifying the program would help you. Yes, there is a BOM at the beginning of the file (for Windows users), although the line endings are *nix style (just LF, not CR+LF), because I never met a media player that fails to read the file because it has *nix-style line endings, so why store them with CR+LF on my Linux computer?

As for the dropbox.com folders: there are tons of subtitle sites, why not use one? Get the subtitle from here. (In case there is an issue with links to other subtitle sites, the previous link points to opensubtitles dot org, and the path to the page is en/subtitles/3828375/true-blood-en ).
Link to comment
Share on other sites

VLC failed for me because it doesnt support hardware acceleration yet. you have to see UTF-x and Unicode as different coding. It has nothing todo with ANSI. Unicode has variable bit length for characters while UTF-x have a fix bit length of x for characters. Are we still talking about the same thing? :D

Unicode is not a “coding”; UTF-8 is a Unicode encoding, and UTF-16LE (the one Microsoft Windows calls “Unicode”) is another Unicode encoding.

Note that both UTF-8 and UTF-16 (LE stands for Little Endian like all x86 processors) are variable-length encodings; UTF-8 needs 1-4 bytes for each Unicode character (it can go up to 6, but ATM the maximum Unicode codepoint is 1114111 and 4 bytes are enough), while UTF-16 needs either 2 (for all Unicode characters in the BMP, i.e. almost any character we would meet in a subtitle file) or 4 bytes (using surrogate pairs). Perhaps the MS Windows UTF-16LE implementation does not understand surrogate pairs; if that is the case, then yes, Windows UTF-16LE needs a fixed count of bytes for every Unicode character it understands.

UTF-32 is the only definitely fixed-length encoding; each character takes 4 bytes. UTF-32 is used as the internal Unicode representation in the libc of many *nix (Linux/Unix/OS-X) operating systems.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Member Statistics

    26306
    Total Members
    6268
    Most Online
    Charlotte
    Newest Member
    Charlotte
    Joined
×
×
  • Create New...