Strang characters in sub

May 19, 2010

The sub for CSI Miami - 08x23 - Time Bomb.LOL.English.org contains some strange characters in the following time ref's.

00:00:42,632 --> 00:00:44,900
00:01:14,597 --> 00:01:16,865

May 19, 2010

¶ ¶?

These are some characters to suggest music :)

May 19, 2010

¶ ¶?

These are some characters to suggest music

No these are more like a capital T with an extended top bar and 2 lines right net to it.

May 19, 2010

Lol they don't look like that to me )

May 19, 2010

Lol they don't look like that to me )

Ok I'll bite what do they look like to you?

BTW Thanks for recommending VLC program, it's saving me from a lot of head aches caused by the DivX player.

May 20, 2010

¶ ¶ like this. I posted them above I wonder how you see this.

May 20, 2010

¶ ¶ like this. I posted them above I wonder how you see this.

I see em as there displayed here, and as they display in VLC, what i was looking at (the attachment above) was the "raw" (so to speak) character in the sub, thus I was thinking it was a "unprintable" (again so to speak) character, as I hadn't gotten around to watching that episode yet.

Now having seen the episode I saw that indeed (as you said) that it was what some of you folks use to denote music.

Question, Why not use the actual music note symbol?

May 20, 2010

You'd have to ask the guys at the captioning companies We just sync the scripts.

June 3, 2010

lol

September 13, 2010

You'd have to ask the guys at the captioning companies We just sync the scripts.

This is an old thread but I was wondering about this myself. Watching the True Blood finale in VLC, I notice the songs all have odd characters. I know they're supposed to be the characters signifying music but they don't appear that way in VLC. I'm guessing if I made a DVD they might be fine. In VLC they look just like what's in the subtitle itself.

<i>â?ª And every shadow â?ª</i>

September 14, 2010

u can replace them with * before watching. the encoding does not always work.

September 15, 2010

u can replace them with * before watching. the encoding does not always work.

(See also here: )

Guys, a little Unicode primer here.

ASCII characters (bytes 0-127) are stored the same way in a file with all currently-used encodings. The issue begins when wanting to store non-ASCII characters; everything put to a file MUST be converted to bytes, and those bytes can mean many things, depending on the encoding used.

Now, I understand the site stores the files internally as UTF8. That means that:
“é” (Unicode character 233) as UTF8 becomes “Ã©” (2 bytes with values [195, 169]). When uploading through the web interface, I've seen that there is no way to tell the form that the subtitle is already encoded as UTF8, so choosing e.g. English as the language, the site interprets the incoming data as CP1252 (Windows Western). In CP1252, every character takes exactly one byte, so the site converts every byte to UTF8, and the two bytes “Ã©” are thought to be interpreted as two Unicode characters, so they are UTF8 converted to “ÃƒÂ©” (4 bytes: [195, 131, 194, 169]) and stored like this. When you download the subtitle and use it, the player understands that this is a UTF8 encoded file, so it decodes UTF8 and the 4 bytes of my example become 2 Unicode characters: “Ã©” which you see on your screen.

In the example of the character “?”:
this is the Unicode character 9834, named “EIGHTH NOTE”. Stored as UTF8, it becomes “â™ª” (3 bytes: [226, 153, 170]), and quite possibly the uploader sees the subtitle correctly in their player. If the file was stored exactly like that in the site, everything would be fine; however the process that I described occurs during the upload, so the site interprets the 3 bytes as CP1252, decodes them into 3 separate Unicode characters (“â™ª”) instead of 1 Unicode character (“?”) and encodes them as UTF8 into 6 bytes: “Ã¢â„¢Âª” (bytes: [195, 162, 226, 132, 162, 194, 170]). You download the file, the media player understands it's a UTF8 encoded file, so it decodes UTF8 (and it does this decoding ONCE) and the 6 bytes become 3 characters “â™ª”, which are the ones shown on your screen.

Now, sometimes the input files are not UTF8 encoded, but GBK encoded (another way to store Unicode in a file, which applies to Chinese), and this happens often to subtitles acquired from yyets.net (something like that); there the thing becomes more troublesome, since during the upload the GBK-encoded bytes are decoded as CP1252 and then encoded and stored as UTF8. Hell broke loose.

Confusing? Sorry

I have a Python script that fixes these things automatically (95% of the time) and produces a correct UTF8 file; I can make that script available to uploaders and editors, who hopefully upload not using the web interface.
However, a way to upload raw bytes to the site (without any encoding/decoding process) MUST be created for us lame uploaders, so that these issues can be solved much easier, or become non-existent in the first place.

Strang characters in sub

Recommended Posts

Dawg

Link to comment

Share on other sites

honeybunny

Link to comment

Share on other sites

Dawg

Link to comment

Share on other sites

honeybunny

Link to comment

Share on other sites

Dawg

Link to comment

Share on other sites

honeybunny

Link to comment

Share on other sites

Dawg

Link to comment

Share on other sites

honeybunny

Link to comment

Share on other sites

Adriano_CSI

Link to comment

Share on other sites

egk

Link to comment

Share on other sites

honeybunny

Link to comment

Share on other sites

tzot

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Member Statistics

Browse

Activity