How to detect the text above lines using OpenCV in Python

Question

I am interested in detecting lines (which I managed to figure out using hough transform) and the text above it.

My test image is below:

The code I have written is below. ( I have edited so that I can loop through the coordinates of each line)

import cv2
import numpy as np

img=cv2.imread('test3.jpg')
#img=cv2.resize(img,(500,500))
imgGray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
imgEdges=cv2.Canny(imgGray,100,250)
imgLines= cv2.HoughLinesP(imgEdges,1,np.pi/180,230, minLineLength = 700, maxLineGap = 100)
imgLinesList= list(imgLines)

a,b,c=imgLines.shape
line_coords_list = []
for i in range(a):
    line_coords_list.append([(int(imgLines[i][0][0]), int(imgLines[i][0][1])), (int(imgLines[i][0][2]), int(imgLines[i][0][3]))])

print(line_coords_list)#[[(85, 523), (964, 523)], [(85, 115), (964, 115)], [(85, 360), (964, 360)], [(85, 441), (964, 441)], [(85, 278), (964, 278)], [(85, 197), (964, 197)]]

roi= img[int(line_coords_list[0][0][1]): int(line_coords_list[0][1][1]), int(line_coords_list[0][0][0]) : int(line_coords_list[0][1][0])]
print(roi) # why does this print an empty list?
cv2.imshow('Roi NEW',roi)

Now I just don't know how to detect the region of interest above those lines. Is it possible to say crop out each line and have images say roi_1 , roi_2 , roi_n where each roi is the text above the first line, the text above the second line etc?

I would like the output to be something like this.

Apply morphology to a thresholded image and get the contours. Use the contours to extract each line of text. If long lines remain from the dotted lines on the page, then filter the contours by width or by height. See for example, stackoverflow.com/questions/61198983/… — fmw42
– fmw42, Commented Apr 14, 2020 at 3:43
@fmw42 - Thanks for that, however it detects all text. How do i go about detecting only the above the dotted lines? — Alan Jones
– Alan Jones, Commented Apr 14, 2020 at 4:08
Yes, I just need the text above the lines? Also how do I go about filtering the contours by width or height? I know how to find contours and filtering the lengths — Alan Jones
– Alan Jones, Commented Apr 14, 2020 at 6:11

fmw42 · Accepted Answer · 2020-04-14 17:24:24Z

5

Here is one way to do that in Python/OpenCV.

Read the input
Convert to gray
Threshold (OTSU) so that text is white on black background
Apply morphology dilate with horizontal kernel to blur text in a line together
Apply morphology open with a vertical kernel to remove the thin lines from the dotted lines
Get the contours
Find the contour that has the lowest Y bounding box value (top-most box)
Draw all the bounding boxes on the input except for the topmost one
Save results

Input:

import cv2
import numpy as np

# load image
img = cv2.imread("text_above_lines.jpg")

# convert to gray
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# threshold the grayscale image
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# use morphology erode to blur horizontally
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (151, 3))
morph = cv2.morphologyEx(thresh, cv2.MORPH_DILATE, kernel)

# use morphology open to remove thin lines from dotted lines
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 17))
morph = cv2.morphologyEx(morph, cv2.MORPH_OPEN, kernel)

# find contours
cntrs = cv2.findContours(morph, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cntrs = cntrs[0] if len(cntrs) == 2 else cntrs[1]

# find the topmost box
ythresh = 1000000
for c in cntrs:
    box = cv2.boundingRect(c)
    x,y,w,h = box
    if y < ythresh:
        topbox = box
        ythresh = y

# Draw contours excluding the topmost box
result = img.copy()
for c in cntrs:
    box = cv2.boundingRect(c)
    if box != topbox:
        x,y,w,h = box
        cv2.rectangle(result, (x, y), (x+w, y+h), (0, 0, 255), 2)

# write result to disk
cv2.imwrite("text_above_lines_threshold.png", thresh)
cv2.imwrite("text_above_lines_morph.png", morph)
cv2.imwrite("text_above_lines_lines.jpg", result)

#cv2.imshow("GRAY", gray)
cv2.imshow("THRESH", thresh)
cv2.imshow("MORPH", morph)
cv2.imshow("RESULT", result)
cv2.waitKey(0)
cv2.destroyAllWindows()

Thresholded image:

Morphology image:

Result:

answered Apr 14, 2020 at 17:24

fmw42

54.1k10 gold badges80 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alan Jones Over a year ago

Thanks alot fmw42. Quick question: I am having trouble understanding where the code or parameters are in order to control the text above the line and the text without the line. How does it know that "What planet do we live in? does not have a dotted line?

fmw42 Over a year ago

I simply discarded the line closest to the top. I assumed each page had a question at the top. So the first line would always be the question.

Alan Jones Over a year ago

Ahh, that explains why it fails when I added two question on top of each other. Nevertheless, I've learnt a lot from your help. Now i just need to figure out how to create bounding boxes for text above the line without discarding the line closest to the top.

lenik · Accepted Answer · 2020-04-14 14:05:11Z

1

You have detected the lines. Now you have to split your image into regions between the lines using y coordinate and then search for the black pixels (words) on the white background (paper).

Building a histogram along the x and y axes will likely give you the area of interest you're looking for.

Just to answer your questions in the comments, for example, if you have an image img and area of the interest with y coordinates (100,200) spanning the whole width of the image, you may crop that area down and search for anything there like this:

cropped = img[100:200,5:-5]  # crop a few pixels off in x-direction just in case

Now the search:

top, left = 10000, 10000
bottom, right = 0, 0
for i in range(cropped.shape[0]) :
    for j in range(cropped.shape[1]) :
        if cropped[i][j] < 200 :    # black?
            top = min( i, top)
            bottom = max( i, bottom)
            left = min( j, left)
            right = max( j, right)

Or something along the lines...

edited Apr 14, 2020 at 14:05

answered Apr 14, 2020 at 2:25

lenik

23.6k4 gold badges38 silver badges44 bronze badges

6 Comments

Alan Jones Over a year ago

@ lenik - Can i have a little more guidance, I am new to using opencv and i've never used the hisogram.

Alan Jones Over a year ago

print(imgLines)=  [[[ 38 255 437 255]]   [[ 38 253 437 253]]   [[ 38 330 437 330]]   [[ 38 328 437 328]]   [[ 38 404 437 404]]   [[ 38 402 437 402]]   [[ 38 181 437 181]]   [[ 38 477 437 477]]   [[ 38 479 437 479]]   [[ 38 179 437 179]]   [[ 38 104 437 104]]]

These describe the lines, which I can extract specific regions, but shouldnt there be only 6?

lenik Over a year ago

@AlanJones there are actually only 6, but some of them are repeated for the upper and lower side, like 253 and 255, 402 and 404 etc.

lenik Over a year ago

@AlanJones in imgLines there are clearly 6 lines with the y coordinates: 255, 330, 404, 477, 104, 179 -- once you sort them you have an average width of the space allocated for the writing and 6 potential regions to crop and analyze.

lenik Over a year ago

@AlanJones added a few lines of the code to the answer

|

Collectives™ on Stack Overflow

How to detect the text above lines using OpenCV in Python

2 Answers 2

3 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related