2

I want to identify and highlight / crop the text between two lines using Python (cv2).

One line is a wavy line at the top, and the second line somewhere in the page. This line can appear at any height on the page, ranging from just after 1 line to just before the last line.

An example,

Page 1

I believe I need to use HoughLinesP() somehow with proper parameters for this. I've tried some examples involving a combination of erode + dilate + HoughLinesP.

e.g.


    img = cv2.imread(image)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    kernel_size = 5
    blur_gray = cv2.GaussianBlur(gray, (kernel_size, kernel_size), 0)

    # erode / dilate
    erode_kernel_param = (5, 200)   # (5, 50)
    dilate_kernel_param = (5, 5)  # (5, 75)

    img_erode = cv2.erode(blur_gray, np.ones(erode_kernel_param))
    img_dilate = cv2.dilate(img_erode, np.ones(dilate_kernel_param))

    # %% Second, process edge detection use Canny.

    low_threshold = 50
    high_threshold = 150
    edges = cv2.Canny(img_dilate, low_threshold, high_threshold)

    # %% Then, use HoughLinesP to get the lines.
    # Adjust the parameters for better performance.

    rho = 1  # distance resolution in pixels of the Hough grid
    theta = np.pi / 180  # angular resolution in radians of the Hough grid
    threshold = 15  # min number of votes (intersections in Hough grid cell)
    min_line_length = 600  # min number of pixels making up a line
    max_line_gap = 20  # max gap in pixels between connectable line segments
    line_image = np.copy(img) * 0  # creating a blank to draw lines on

    # %%  Run Hough on edge detected image
    # Output "lines" is an array containing endpoints of detected line segments

    lines = cv2.HoughLinesP(edges, rho, theta, threshold, np.array([]),
                            min_line_length, max_line_gap)

    if lines is not None:
        for line in lines:
            for x1, y1, x2, y2 in line:
                cv2.line(line_image, (x1, y1), (x2, y2), (255, 0, 0), 5)

    # %% Draw the lines on the  image

    lines_edges = cv2.addWeighted(img, 0.8, line_image, 1, 0)

However, in many cases the lines dont get identified propery. Some examples of errors being,

  1. Too many lines being identified (ones in the text as well)
  2. Lines not being identified completely
  3. Lines not being identified at all

Am I on the right track? Do I just need to hit the correct combination of parameters for this purpose? or is there a simpler way / trick which will let me reliably crop the text between these two lines?

In case it's relevant, I need to do this for ~450 pages. Here's the link to the book, in case someone wants to examine more examples of pages. https://archive.org/details/in.ernet.dli.2015.553713/page/n13/mode/2up

Thank you.


Solution

I've made minor modifications to the answer by Ari (Thank you), and made the code a bit more comprehensible for my own sake, here's my code.

The core idea is,

  • Find contours and their bounding rectangles.
  • Two "widest" contours would represent the two lines.
  • Thereafter, take the lower side of the top rectangle and upper side of the bottom rectangle to bound the area (text) we are interested in.

for image in images:
    base_img = cv2.imread(image)
    height, width, channels = base_img.shape

    img = cv2.cvtColor(base_img, cv2.COLOR_BGR2GRAY)
    ret, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    img = cv2.bitwise_not(img)

    contours, hierarchy = cv2.findContours(
        img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE
    )

    # Get rectangle bounding contour
    rects = [cv2.boundingRect(contour) for contour in contours]

    # Rectangle is (x, y, w, h)
    # Top-Left point of the image is (0, 0), rightwards X, downwards Y

    # Sort the contours bigger width first
    rects.sort(key=lambda r: r[2], reverse=True)

    # Get the 2 "widest" rectangles
    line_rects = rects[:2]
    line_rects.sort(key=lambda r: r[1])

    # If at least two rectangles (contours) were found
    if len(line_rects) >= 2:
        top_x, top_y, top_w, top_h = line_rects[0]
        bot_x, bot_y, bot_w, bot_h = line_rects[1]

        # Cropping the img
        # Crop between bottom y of the upper rectangle (i.e. top_y + top_h)
        # and the top y of lower rectangle (i.e. bot_y)
        crop_img = base_img[top_y+top_h:bot_y]

        # Highlight the area by drawing the rectangle
        # For full width, 0 and width can be used, while
        # For exact width (erroneous) top_x and bot_x + bot_w can be used
        rect_img = cv2.rectangle(
            base_img,
            pt1=(0, top_y + top_h),
            pt2=(width, bot_y),
            color=(0, 255, 0),
            thickness=2
        )
        cv2.imwrite(image.replace('.jpg', '.rect.jpg'), rect_img)
        cv2.imwrite(image.replace('.jpg', '.crop.jpg'), crop_img)
    else:
        print(f"Insufficient contours in {image}")
2
  • are the lines always identical? Commented Jul 26, 2021 at 18:20
  • The top line is always wavy, the bottom line is always straight. (There are two images one after the other in the post example, which might be the source of your confusion) Commented Jul 27, 2021 at 18:25

1 Answer 1

1

You can find the Contours, and then take the two with the biggest width.

base_img = cv2.imread('a.png')

img = cv2.cvtColor(base_img, cv2.COLOR_BGR2GRAY)
ret, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
img = cv2.bitwise_not(img)

cnts, hierarchy = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

# sort the cnts bigger width first
cnts.sort(key=lambda c: cv2.boundingRect(c)[2], reverse=True)

# get the 2 big lines
lines = [cv2.boundingRect(cnts[0]), cv2.boundingRect(cnts[1])]
# higher line first
lines.sort(key=lambda c: c[1])
# croping the img
crop_img = base_img[lines[0][1]:lines[1][1]]
Sign up to request clarification or add additional context in comments.

5 Comments

Could you please explain how the cropping works? If I understand correctly, you're using y part of the "rectangle" (which you are treating as a line)?
Also, if we want to get "higher" line first, we should check c[1] right?
Thank you, your method seems to work for most of the images, I've examined more than 200 images, and only 1 error so far. (There were a few more errors, but causes for those were presence of more horizontal lines, can't fault the algorithm for that)
Cropin works by taking both the first line and second line y. and save data between those two lines. If you don't want the line to be in the image you can crop like this crop_img = base_img[lines[0][1] + lines[0][3] : lines[1][1]] adding the height to the first line.
Thanks, I actually figured out the cropping. I updated my question with the solution code that I am using (which is essentially your code with some more modifications and documentation for my own sake when I return to this code after a couple of months)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.