Finding text between two lines using Python OpenCV

Question

I want to identify and highlight / crop the text between two lines using Python (cv2).

One line is a wavy line at the top, and the second line somewhere in the page. This line can appear at any height on the page, ranging from just after 1 line to just before the last line.

An example,

I believe I need to use HoughLinesP() somehow with proper parameters for this. I've tried some examples involving a combination of erode + dilate + HoughLinesP.

e.g.


    img = cv2.imread(image)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    kernel_size = 5
    blur_gray = cv2.GaussianBlur(gray, (kernel_size, kernel_size), 0)

    # erode / dilate
    erode_kernel_param = (5, 200)   # (5, 50)
    dilate_kernel_param = (5, 5)  # (5, 75)

    img_erode = cv2.erode(blur_gray, np.ones(erode_kernel_param))
    img_dilate = cv2.dilate(img_erode, np.ones(dilate_kernel_param))

    # %% Second, process edge detection use Canny.

    low_threshold = 50
    high_threshold = 150
    edges = cv2.Canny(img_dilate, low_threshold, high_threshold)

    # %% Then, use HoughLinesP to get the lines.
    # Adjust the parameters for better performance.

    rho = 1  # distance resolution in pixels of the Hough grid
    theta = np.pi / 180  # angular resolution in radians of the Hough grid
    threshold = 15  # min number of votes (intersections in Hough grid cell)
    min_line_length = 600  # min number of pixels making up a line
    max_line_gap = 20  # max gap in pixels between connectable line segments
    line_image = np.copy(img) * 0  # creating a blank to draw lines on

    # %%  Run Hough on edge detected image
    # Output "lines" is an array containing endpoints of detected line segments

    lines = cv2.HoughLinesP(edges, rho, theta, threshold, np.array([]),
                            min_line_length, max_line_gap)

    if lines is not None:
        for line in lines:
            for x1, y1, x2, y2 in line:
                cv2.line(line_image, (x1, y1), (x2, y2), (255, 0, 0), 5)

    # %% Draw the lines on the  image

    lines_edges = cv2.addWeighted(img, 0.8, line_image, 1, 0)

However, in many cases the lines dont get identified propery. Some examples of errors being,

Too many lines being identified (ones in the text as well)
Lines not being identified completely
Lines not being identified at all

Am I on the right track? Do I just need to hit the correct combination of parameters for this purpose? or is there a simpler way / trick which will let me reliably crop the text between these two lines?

In case it's relevant, I need to do this for ~450 pages. Here's the link to the book, in case someone wants to examine more examples of pages. https://archive.org/details/in.ernet.dli.2015.553713/page/n13/mode/2up

Thank you.

Solution

I've made minor modifications to the answer by Ari (Thank you), and made the code a bit more comprehensible for my own sake, here's my code.

The core idea is,

Find contours and their bounding rectangles.
Two "widest" contours would represent the two lines.
Thereafter, take the lower side of the top rectangle and upper side of the bottom rectangle to bound the area (text) we are interested in.


for image in images:
    base_img = cv2.imread(image)
    height, width, channels = base_img.shape

    img = cv2.cvtColor(base_img, cv2.COLOR_BGR2GRAY)
    ret, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    img = cv2.bitwise_not(img)

    contours, hierarchy = cv2.findContours(
        img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE
    )

    # Get rectangle bounding contour
    rects = [cv2.boundingRect(contour) for contour in contours]

    # Rectangle is (x, y, w, h)
    # Top-Left point of the image is (0, 0), rightwards X, downwards Y

    # Sort the contours bigger width first
    rects.sort(key=lambda r: r[2], reverse=True)

    # Get the 2 "widest" rectangles
    line_rects = rects[:2]
    line_rects.sort(key=lambda r: r[1])

    # If at least two rectangles (contours) were found
    if len(line_rects) >= 2:
        top_x, top_y, top_w, top_h = line_rects[0]
        bot_x, bot_y, bot_w, bot_h = line_rects[1]

        # Cropping the img
        # Crop between bottom y of the upper rectangle (i.e. top_y + top_h)
        # and the top y of lower rectangle (i.e. bot_y)
        crop_img = base_img[top_y+top_h:bot_y]

        # Highlight the area by drawing the rectangle
        # For full width, 0 and width can be used, while
        # For exact width (erroneous) top_x and bot_x + bot_w can be used
        rect_img = cv2.rectangle(
            base_img,
            pt1=(0, top_y + top_h),
            pt2=(width, bot_y),
            color=(0, 255, 0),
            thickness=2
        )
        cv2.imwrite(image.replace('.jpg', '.rect.jpg'), rect_img)
        cv2.imwrite(image.replace('.jpg', '.crop.jpg'), crop_img)
    else:
        print(f"Insufficient contours in {image}")

The top line is always wavy, the bottom line is always straight. (There are two images one after the other in the post example, which might be the source of your confusion) — Hrishikesh
– Hrishikesh, Commented Jul 27, 2021 at 18:25

Ari · Accepted Answer · 2021-07-27 19:45:48Z

1

You can find the Contours, and then take the two with the biggest width.

base_img = cv2.imread('a.png')

img = cv2.cvtColor(base_img, cv2.COLOR_BGR2GRAY)
ret, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
img = cv2.bitwise_not(img)

cnts, hierarchy = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

# sort the cnts bigger width first
cnts.sort(key=lambda c: cv2.boundingRect(c)[2], reverse=True)

# get the 2 big lines
lines = [cv2.boundingRect(cnts[0]), cv2.boundingRect(cnts[1])]
# higher line first
lines.sort(key=lambda c: c[1])
# croping the img
crop_img = base_img[lines[0][1]:lines[1][1]]

edited Jul 27, 2021 at 19:45

answered Jul 26, 2021 at 18:47

Ari

2141 silver badge13 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Hrishikesh Over a year ago

Could you please explain how the cropping works? If I understand correctly, you're using y part of the "rectangle" (which you are treating as a line)?

Hrishikesh Over a year ago

Also, if we want to get "higher" line first, we should check c[1] right?

Hrishikesh Over a year ago

Thank you, your method seems to work for most of the images, I've examined more than 200 images, and only 1 error so far. (There were a few more errors, but causes for those were presence of more horizontal lines, can't fault the algorithm for that)

Ari Over a year ago

Cropin works by taking both the first line and second line y. and save data between those two lines. If you don't want the line to be in the image you can crop like this crop_img = base_img[lines[0][1] + lines[0][3] : lines[1][1]] adding the height to the first line.

Hrishikesh Over a year ago

Thanks, I actually figured out the cropping. I updated my question with the solution code that I am using (which is essentially your code with some more modifications and documentation for my own sake when I return to this code after a couple of months)

Collectives™ on Stack Overflow

Finding text between two lines using Python OpenCV

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related