I want to extract certain type of text from images of ID cards:
As you can see, they have various lighting and sharpness conditions. Ultimate goal is to recognize the black texts. If they're well separated, I've managed to do it well with Tesseract OCR (this is VIE language by the way, in case you'd like to try it yourself with Tesseract). However, in the above examples, there are overlapped of the black texts and the blue texts, which confused Tesseract. So my current goal is to cleanly remove them, while not heavily distort the black blurry pixels so that Tesseract still works.
What are the most robust ways to do this? (Code examples in Python would be appreciated if possible.)




