Jump to content

Subtitle Edit (open source subtitle editor)


Nikse

Recommended Posts


Is there any way to improve the word/OCR detection? I want to use "OCR via image compare", because with the other
two options there are way too many OCR errors afterwards in German subtitles...


Hi extreme!

You really should use Tesseract - if you can make it work and do a bit of "training", I really think it will be the best ocr tool today!
1. First download the latest German hunspell dictionary (Spell check -> Get dictionaries), then
2. download german tesseract 3 language file: http://tesseract-ocr.googlecode.com/files/deu.traineddata.gz - Unpack it to Tesseract\tessdata
3. Download attachment to this post "deu_OCRFixReplaceList.xml" - and save it in the "Dictionaries" folder.
4. Run tesseract on some sub/idx files - when you click "Change all" in the OCR spell check dialog it will be saved in the "deu_OCRFixReplaceList.xml" file.

If you could email me one or two German sub/idx files I can perhaps fix the most common errors!

deu_OCRFixReplaceList.xml

  • Like 1
Link to comment
Share on other sites

Hi. Thanks a lot for the tip. I think it's actually working better now,
but I still need to test and train it some more I guess to get even better results.
I've sent you an e-mail with two German sub/idx files. Hope you can use it. :)

Link to comment
Share on other sites

I have to say you something (OH, NO, not again, nikse said right now :) )

I noticed something trying to edit some lines.

2m7ferq.jpg

and

34dkkjq.jpg


If you take a close look, what right mouse click show, seems to be changed between, because I need italic, bold and so on options in the field where I edit line.
(I realy hope to understand what I want to say :D ).

doc

103iruh.gif

 

The greatest pleasure in life is doing what people say you cannot do!

 

 

IMPORTANT LINK:

 

Link to comment
Share on other sites


Hi arnozet!

I'm sorry, but I don't know what causes this error :(
NHunspell is a third party component which is used for spell checking.


Also, Subtitle Edit 3.0 final is out: http://code.google.com/p/subtitleedit/downloads/list


Great!! I'll download it.
Thanks!!
Link to comment
Share on other sites


If you take a close look, what right mouse click show, seems to be changed between, because I need italic, bold and so on options in the field where I edit line.
(I realy hope to understand what I want to say :D ).


Hi doc!

I rarely understand anything at all, so you have high expectations :)

I've added a custom context menu to the text field (the other was windows auto-generated) - it's exe only: http://www.nikse.dk/se/SubtitleEdit.rar (also has new "Auto-backup" feature in settings).


And where did all that snow come from!?
Link to comment
Share on other sites

Hey. I don't know if it's a bug or i do smth wrong but when i change case and i have to review the words that will be capitalized, i can uncheck one word so that will not be capitalized but in the lower section where the text is, unchecking one line with a word will uncheck all with that word, not only that line. But maybe agent, for example, needs capitalized in one and doesn't in another.

bunnyblog.png


facebook.png

Link to comment
Share on other sites


Hey. I don't know if it's a bug or i do smth wrong but when i change case and i have to review the words that will be capitalized, i can uncheck one word so that will not be capitalized but in the lower section where the text is, unchecking one line with a word will uncheck all with that word, not only that line. But maybe agent, for example, needs capitalized in one and doesn't in another.


"Change casing - names" contains two list views. The upper list view contains names found, and the lower list view contains lines where these names are used.
If you un-check a names in the upper list view, then lines with this name will no longer appear in the lower list view (unless it contains another name) and will not have casing changed.
So to un-check some lines only, use the lower list view check-boxes.

Perhaps it would be better to run "Change casing" separately for troublesome names, so lines with multiple names will still be correct.


I hope I understood what you meant, and visa versa ;)
Link to comment
Share on other sites


"Change casing - names" contains two list views. The upper list view contains names found, and the lower list view contains lines where these names are used.
If you un-check a names in the upper list view, then lines with this name will no longer appear in the lower list view (unless it contains another name) and will not have casing changed.
So to un-check some lines only, use the lower list view check-boxes.

Perhaps it would be better to run "Change casing" separately for troublesome names, so lines with multiple names will still be correct.


I hope I understood what you meant, and visa versa ;)


Nevermind, it seems it's solved. it used to be different. [if I unchecked just one of the lines in the lower section, all the lines containing that word found in the line i unchecked were automatically unchecked.]

Thanks :)

bunnyblog.png


facebook.png

Link to comment
Share on other sites


[...]
I rarely understand anything at all, so you have high expectations :)
[...]
And where did all that snow come from!?



You rarely understand, I can't make myself understood, what a team :D

Which snow?
:)
doc.

103iruh.gif

 

The greatest pleasure in life is doing what people say you cannot do!

 

 

IMPORTANT LINK:

 

Link to comment
Share on other sites

  • 4 weeks later...

You rarely understand, I can't make myself understood, what a team :D

Unbeatable ;)


Which snow?
:)

I don't remember the last time I've seen so much snow here in Denmark! But at least it's no so dark here...



I was working on a file when suddenly got an error...

Might work better in the 3.1 beta 1 update (with a newer version of NHunspell)


SE 3.1 beta 1 update: http://www.nikse.dk/se/SE31Beta1Update.rar
Change log;
* NEW:
* Collaboration via the internet ("Networking", also has chat)
* Auto-backup (never, every minute, every 5th minute, or every 15th minute)
* Ability to remember the last selected line when re-opening subtitles
* Support for the subtitle format "Quicktime text" (two variations)
* Added "Chars/sec" info to textbox in main window
* Options to choose font color and background color (for list view/text-boxes)
* Can now import VobSub subtitles embedded in Matroska (.mkv) files.
* IMPROVED:
* Context menu for subtitle textbox now has italic, bold, underline, font name, and color
* Updated NHunspell (spell check component) to latest version (0.9.6)
* Synchronization Show earlier/later changed a bit, also added short cut (Ctrl+Shift+A)
* Main window: Video player will now automatically move up beside subtitle if waveform is displayed + some re-sizing of controls allowed
* FIXED:
* OCR Fix Engine: Lines after "..." will no longer be changed to start with uppercase letter
* Fixed missing line break in Sony Dvd Architecht (w line numbers) - thx Rosa
* Fixed a minor bug in initialization of waveform - thx Frederic!
* Fixed a minor bug in Visual Sync, if end scene was after video length


Oh, and merry Christmas!
Link to comment
Share on other sites

  • 4 weeks later...

I think someone missed me. :D

First things first.
1.
SE has some very useful features, ie, "Insert new subtitle at the video position".

After I marveled to myself how many things know SE how to do them, I start to think how it can do, especially that feature.
And I thought it was possible by allocate time to display that line, counting the characters that compose it.
I could be wrong, but it would be logical to happen in that way.

Good ....

Now, as a lazy synchronizer subtitles that I am, I encountered a problem.

In many transcripts, I met lines that need to be splitted, to be fulfil some certain requirements such as maximum length (maximum number of characters per line).
The program makes that, and it make it very well: it takes that line and it split it, giving equal time to both line that results, regarding the punctuation marks, and other program settings.

BUT ...

Special Case:
imagine a line like:

-How to split? - Splitting long line is made considering punctuation,
line is split in two, and displaying time is properly allocated.

By using the option in the program, the result will be something like:

-How to split? Splitting long line is made considering punctuation,

line is split in two, and displayed time is properly allocated.

or (this is most encounter)

-How to divide?

-Splitting long line is made considering punctuation,
line is split in two, and diplaying time is properly allocated.

and time is splited equally between the two lines that results.

Here's the problem!
The first line is needed less time to display, right?

So, is it possible that when I'm using the option of split, the output lines having displaying times corresponding to the number of characters displayed?

2.
In so many cases, transcripts are full of "Awe", "Oh", "Whew", "Pfui" and so many other very representative words.

Is it posible to add those words to Remove text for HI option?
( I guess here we need to have more "talk" about).

doc.

103iruh.gif

 

The greatest pleasure in life is doing what people say you cannot do!

 

 

IMPORTANT LINK:

 

Link to comment
Share on other sites


I think someone missed me. :D

Always :)
Especially now that you are the only one from Stargate not cancelled...


1 ) Now, as a lazy synchronizer subtitles that I am, I encountered a problem.

Yes, lazy ppl use good software ;)
In SE 3.1 beta 9 the splitting should work better: http://www.nikse.dk/se/se31beta9.zip

Other SE news:
- Use MS Word for spell check (EDIT: It's an option)
- Edit original sub at the same time as new translation (EDIT: It's an option)

2) I will look at it later.
Link to comment
Share on other sites


Yes, lazy ppl use good software ;)


That's for sure, without any doubt!.

Thank you Nik, and keep on your good work.

doc.

103iruh.gif

 

The greatest pleasure in life is doing what people say you cannot do!

 

 

IMPORTANT LINK:

 

Link to comment
Share on other sites

In so many cases, transcripts are full of "Awe", "Oh", "Whew", "Pfui" and so many other very representative words.
Is it posible to add those words to Remove text for HI option?
( I guess here we need to have more "talk" about).

This kind of words are not considered as HI annotations, IMHO.

It's not the years in the life, but the life in the years dQVXh.gif

Link to comment
Share on other sites


This kind of words are not considered as HI annotations, IMHO.


He, what would you call it then?

Just found a couple of similar examples:
W-wait!
l-impossible!
Aah! Here it comes!
Link to comment
Share on other sites


This kind of words are not considered as HI annotations, IMHO.



He, what would you call it then?
[...]


Seems Nikse was getting my point.
In a non HI subtitle version, those types of interjections are useless.

Especially as, for a translator, these interjections are nothing more than things that gives him big headaches, like:

What to do?
To keep a line with Ahh!, Ohh, Uhh!, Ehh! and translate it, :D or to remove it?

How is more easy?
To remove line by line or word by word, or to use a small interjection's dictionary an remove them all with one click?


Ohh!:D , we can discuss about where to put that option in SE, that's correct!
But that option from TOOLS-Remove text for HI seems to be the most appropriate.


doc.

P.S. That split option, right now, it is just...PERFECT!

103iruh.gif

 

The greatest pleasure in life is doing what people say you cannot do!

 

 

IMPORTANT LINK:

 

Link to comment
Share on other sites


In a non HI subtitle version, those types of interjections are useless.

OK, so they are called "Interjections" - English is obviously not my native language ;)

Should Subtitle Edit have this feature in "Remove text for Hearing Impaired" or in "Fix common errors" ?


P.S. That split option, right now, it is just...PERFECT!

Super :)
Link to comment
Share on other sites


Should Subtitle Edit have this feature in "Remove text for Hearing Impaired" or in "Fix common errors" ?


I don't think those interjections are errors. :)
More of that, Remove... has a feature that allow an user to choose what to do, right?
doc.

103iruh.gif

 

The greatest pleasure in life is doing what people say you cannot do!

 

 

IMPORTANT LINK:

 

Link to comment
Share on other sites

I was referring to the sounds/words that a character uses to verbally express various feelings (surpise, disgust, happiness, pain, etc).
Since the actor is actually saying them out loud, they shouldn't be considered part of the HI text.

Anyway, doc is right, they could be added to the list for "Removing text for hearing impaired" and if necessary, the corrector can chose to keep them in the text, since there's already that option ;)

It's not the years in the life, but the life in the years dQVXh.gif

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Member Statistics

    26332
    Total Members
    6268
    Most Online
    kristal11111
    Newest Member
    kristal11111
    Joined
×
×
  • Create New...