BeautifulSoup - Extracting content blocks after specific subheadings within a larger section, ignoring document introduction

Question

I am scraping the Dead by Daylight Fandom wiki (specifically TOME pages, e.g., https://deadbydaylight.fandom.com/wiki/Tome_1_-_Awakening) to extract memory logs.

The goal is to extract the Memory Title (mw-headline <span>) and its corresponding Memory Body (text contained in subsequent elements like <td>, <p>, etc.) as separate records, while strictly ignoring the main TOME introductory text at the top of the page.

The Problem

My current script successfully identifies all memory titles, but the function designed to extract the body content often incorrectly includes the general TOME introductory text (the large overview paragraph at the very top of the article) into the body of the first memory log extracted. This results in duplicate, incorrect body text for many subsequent memory records.

The core issue is scoping the content extraction correctly: I need to ensure that when searching for Memory Body content after a Memory Title, it only looks at the elements until the next Memory Title.

My Current Approach (Simplified)

I have two main functions:

crawl_and_extract_tags: Finds the main "Memories and Logs" section and iterates over individual memory titles (mw-headline <span>).
extract_content_after_headline: Takes a memory title tag and traverses its next siblings to find the body content until the next major heading.

The main logic of extract_content_after_headline (where the issue likely lies):

def extract_content_after_headline(headline_tag):
    body_content = []
    
    # Finds the immediate parent heading (e.g., <h2>) of the specific memory title (<span>)
    parent_heading = headline_tag.find_parent(['h2', 'h3', 'h4']) 
    if not parent_heading:
        return "Parent tag not found", ""

    # Start searching from the next sibling of the parent heading
    current_element = parent_heading.next_sibling
    
    # Loop until the next major heading (h2, h3, h4) is found
    while current_element and current_element.name not in ['h2', 'h3', 'h4']:
        if current_element.name in ['td', 'p', 'div', 'blockquote', 'li']:
            element_text = current_element.get_text(separator=' ', strip=True)
            if element_text:
                body_content.append(element_text)
        
        current_element = current_element.next_sibling
    
    return "\n\n".join(body_content), "" # Omitted italics content for brevity

The Request

How can I modify extract_content_after_headline to reliably capture only the content belonging to that specific memory log, without pulling in the general page introduction?
Is there a better way to structure the extraction flow (e.g., finding the boundaries of the main "Memories and Logs" section more strictly) to prevent the introductory text from being seen as the first memory's body?

Any suggestions on using BeautifulSoup's selectors or traversal methods (find_next_sibling, etc.) more effectively in this Fandom wiki structure would be greatly appreciated.

Full code:

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
# pandas가 Excel 파일 읽기/쓰기를 위해 openpyxl을 사용합니다.

# --- 상수 설정 ---
# 파일 확장자를 .xlsx로 변경
DEFAULT_TEXT_FILENAME = "DbD_TOME_Extracted_Data.xlsx" 

def extract_content_after_headline(headline_tag):
    """
    주어진 mw-headline <span> 태그 뒤에 오는 메모리 본문 (모든 텍스트 요소)과 
    이탤릭체 내용 (<i>)을 다음 헤드라인이 나타날 때까지 추출합니다.
    
    핵심 개선: 텍스트 노드와 함께 <img> 태그 뒤에 따라오는 <figcaption>도 본문으로 포함합니다.
    """
    body_content = []
    italics_content = []
    
    # 1. 헤드라인의 부모 <h2> 또는 <h3> 태그를 찾습니다.
    parent_heading = headline_tag.find_parent(['h2', 'h3', 'h4'])
    if not parent_heading:
        return "본문 태그 찾기 실패", ""

    # 2. 다음 형제 요소들을 탐색합니다. (Next Siblings)
    # 다음 <h2>, <h3>, <h4> 태그가 나타날 때까지 반복
    current_element = parent_heading.next_sibling
    
    while current_element and current_element.name not in ['h2', 'h3', 'h4', 'script', 'style']:
        if current_element.name:
            # 2.1. <i> 태그 내용 추출 (이탤릭체 내용)
            i_tags = current_element.find_all('i')
            for i_tag in i_tags:
                i_text = i_tag.get_text(separator=' ', strip=True)
                if i_text:
                    italics_content.append(i_text)
            
            # 2.2. 일반 텍스트 내용 추출: 
            # <td>, <p>, <div>, <blockquote>, <li>와 같은 주요 블록 요소를 메모리 본문에 추가합니다.
            if current_element.name in ['td', 'p', 'div', 'blockquote', 'li', 'dd', 'dt']:
                element_text = current_element.get_text(separator=' ', strip=True)
                if element_text:
                    # 불필요한 공백 제거 후 텍스트만 추가
                    body_content.append(element_text)
                    
            # 2.3. 이미지 캡션 (figure/figcaption) 내용도 본문에 추가 (DbD 위키 구조 고려)
            figcaption_tags = current_element.find_all('figcaption')
            for figcaption in figcaption_tags:
                caption_text = figcaption.get_text(separator=' ', strip=True)
                if caption_text:
                    body_content.append(caption_text)
        
        current_element = current_element.next_sibling

    # 결과를 통합하여 반환
    # 여러 블록 요소들을 줄바꿈 두 개로 구분하여 본문으로 통합
    return "\n\n".join(body_content), "\n---\n".join(italics_content)


def crawl_and_extract_tags(url):
    """
    웹페이지에 접속하여 'Memories and Logs' 섹션 내부의 'mw-headline <span>'을 기준으로 
    메모리 블록 단위로 데이터를 추출하고, 개별 행으로 구성된 리스트를 반환합니다.
    """
    
    tome_title = "N/A"
    list_of_data_rows = []
    
    try:
        print(f"\n[작업 시작] URL: {url} 크롤링을 시작합니다...")
        
        # 1. HTTP 요청 보내기 및 파싱
        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status() 
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # 2. TOME 제목 추출 (H1)
        h1_tag = soup.find('h1', {'id': 'firstHeading'})
        if h1_tag:
            tome_title = h1_tag.get_text(strip=True)
            
        # 3. **탐색 범위 설정: 'Memories and Logs' 섹션 찾기**
        # Fandom 위키의 목차 구조를 보면 'Memories and Logs' 또는 'Memories' 섹션이 있습니다.
        
        # 3.1. 'Memories' 또는 'Memories and Logs' 헤드라인의 <span> 태그를 찾습니다.
        memory_section_span = soup.find('span', class_='mw-headline', string=['Memories', 'Memories and Logs'])
        
        # 3.2. 해당 섹션의 상위 Heading 태그(예: <h2>)를 탐색의 시작점으로 설정합니다.
        search_start_element = None
        if memory_section_span:
            search_start_element = memory_section_span.find_parent(['h2', 'h3', 'h4'])
        
        # 3.3. 탐색 시작 요소가 없으면 (Memories 섹션이 없으면) 전체 문서에서 검색합니다.
        if not search_start_element:
            print("⚠️ 'Memories and Logs' 섹션을 찾을 수 없습니다. 전체 문서에서 메모리 헤드라인을 검색합니다.")
            
        # 4. **메모리 헤드라인 (mw-headline <span>) 목록 추출**
        # 탐색 범위를 'Memories and Logs' 섹션 내부로 제한합니다.
        
        memory_headline_tags = []
        if search_start_element:
            # 'Memories' 섹션 뒤에 나오는 모든 요소를 탐색하여 그 안에 있는 mw-headline을 수집
            current_element = search_start_element.next_sibling
            while current_element and current_element.name not in ['h2', 'h3']: # 다음 큰 섹션까지 탐색
                if current_element.name:
                    # 현재 요소 또는 그 자식 요소에서 mw-headline을 찾습니다.
                    mw_headlines = current_element.find_all('span', class_='mw-headline')
                    memory_headline_tags.extend(mw_headlines)
                    
                current_element = current_element.next_sibling
                
            # 만약 섹션 내부에서 직접 메모리 헤드라인을 찾지 못했다면, 헤드라인 태그 자체를 순회합니다.
            if not memory_headline_tags:
                 memory_headline_tags = soup.find_all('span', class_='mw-headline')
        else:
            # Memories 섹션을 찾지 못하면, 전체 문서에서 검색합니다.
            memory_headline_tags = soup.find_all('span', class_='mw-headline')

        
        # 5. 각 메모리 헤드라인을 순회하며 데이터 추출 및 행 생성
        if memory_headline_tags:
            for span_tag in memory_headline_tags:
                span_content = span_tag.get_text(strip=True)
                
                # 5.1. 해당 <span> 뒤의 본문과 이탤릭체 내용 추출
                memo_body, memo_italics = extract_content_after_headline(span_tag)
                
                # 5.2. 행 데이터 구성
                row = {
                    'TOME 제목': tome_title,
                    'mw-headline <span> 제목': span_content, 
                    '메모리 본문 (주요 텍스트)': memo_body, 
                    '메모리 이탤릭체 내용 (<i>)': memo_italics
                }
                list_of_data_rows.append(row)
        else:
            # 메모리 헤드라인을 찾지 못했을 경우 (이전 로직 유지)
            td_content = ""
            td_tags = soup.find_all('td')
            if td_tags:
                td_content = td_tags[0].get_text(separator=' ', strip=True)
                
            i_content = "\n---\n".join([i.get_text(separator=' ', strip=True) for i in soup.find_all('i') if i.get_text(separator=' ', strip=True)])
            
            row = {
                'TOME 제목': tome_title,
                'mw-headline <span> 제목': "mw-headline <span> 없음",
                '메모리 본문 (주요 텍스트)': td_content, 
                '메모리 이탤릭체 내용 (<i>)': i_content
            }
            list_of_data_rows.append(row)


        print(f"✅ TOME: {tome_title} 데이터 추출 완료. 생성된 행 수: {len(list_of_data_rows)}")
        return list_of_data_rows # 딕셔너리 리스트 반환

    except requests.exceptions.RequestException as e:
        print(f"❌ 웹사이트 접속 또는 데이터 요청 실패: {e}")
        return None
    except Exception as e:
        print(f"❌ 크롤링 중 알 수 없는 오류 발생: {e}")
        return None


def append_to_excel_file(new_data, output_filename):
    """
    새로운 데이터를 기존 엑셀 파일에 추가(Append)하거나 새 파일을 생성합니다.
    (new_data는 딕셔너리 리스트일 수 있습니다.)
    """
    if not new_data:
        print("⚠️ 추출된 데이터가 없으므로 저장하지 않습니다.")
        return
        
    if not output_filename.lower().endswith('.xlsx'):
        output_filename += '.xlsx'
    
    # 엑셀 파일 저장을 위해 pandas DataFrame으로 변환
    new_df = pd.DataFrame(new_data)

    try:
        # 파일이 이미 존재하는지 확인
        if os.path.exists(output_filename):
            # 기존 데이터를 불러와 새 데이터를 행으로 추가합니다.
            existing_df = pd.read_excel(output_filename, engine='openpyxl')
            # Pandas가 열 이름을 기준으로 데이터를 정렬하여 행을 추가합니다.
            combined_df = pd.concat([existing_df, new_df], ignore_index=True)
        else:
            # 파일이 없으면 새 데이터프레임을 사용합니다.
            combined_df = new_df
            
        # 결합된 DataFrame을 Excel 파일에 저장합니다. (index=False로 불필요한 인덱스 열 제거)
        combined_df.to_excel(output_filename, index=False, engine='openpyxl')
        
        print(f"⭐ 데이터가 엑셀 파일 '{output_filename}'에 성공적으로 추가되었습니다.")
        
    except FileNotFoundError:
        print(f"❌ 오류: 엑셀 파일 '{output_filename}'을 찾을 수 없습니다. (경로 확인 필요)")
    except Exception as e:
        print(f"❌ 오류: 엑셀 파일 저장 중 문제가 발생했습니다. ({e})")
        
        
def main():
    print("\n=======================================================")
    print(" 🔗 DbD TOME (mw-headline 기준 일대일 매칭) 엑셀 크롤러")
    print("=======================================================")
    
    # 1. 라이브러리 설치 안내 (pandas, openpyxl 추가됨)
    print("💡 이 코드는 'requests', 'bs4', 'pandas', 'openpyxl'을 사용합니다. ")
    print("   설치: pip install requests beautifulsoup4 pandas openpyxl")
    
    # 2. 파일명 입력
    filename_input = input(f"\n[필수] 저장할 엑셀 파일 이름 (기본값: {DEFAULT_TEXT_FILENAME}): ").strip()
    text_filename = filename_input if filename_input else DEFAULT_TEXT_FILENAME

    # 3. 수동 URL 입력 및 반복 실행 루프
    while True:
        # URL 입력 시 이전 예시 URL을 표시하여 편의를 제공
        # 예시 URL은 DbD TOME으로 유지합니다.
        url_input = input(f"\nhttp://www.kpedia.jp/w/42385 크롤링할 TOME URL을 입력하세요 (예: https://deadbydaylight.fandom.com/wiki/Tome_1_-_Awakening | 종료하려면 '종료' 입력): ")
        
        if url_input.strip().lower() == '종료':
            print("\n프로그램을 종료합니다. 감사합니다. 👋")
            break
            
        if not url_input.strip():
            print("❌ 유효한 URL을 입력해 주세요.")
            continue
            
        # 4. 크롤링 및 엑셀 파일 추가 저장
        # extracted_data는 이제 딕셔너리들의 리스트를 반환할 수 있습니다.
        extracted_data = crawl_and_extract_tags(url_input) 
        
        if extracted_data:
            append_to_excel_file(extracted_data, text_filename)


if __name__ == "__main__":
    main()

enter image description here

In this manner, the main text sentences are not properly aligned in Excel.

ZortaZert · Accepted Answer · 2025-11-29 17:12:10Z

0

I would recommend using the mediawiki api which is supported by fandom to do this task. Perhaps wikitext is easier to parse than html for this usecase you're talking about. I can't really tell what opperation you're trying to do but whatever it is it's probably easier to do it on the markdown format of wikitext instead of plain html.

https://deadbydaylight.fandom.com/api.php?action=parse&page=Tome_1_-_Awakening&prop=wikitext&formatversion=2

I got the api endpoint above from reading this page (method 3) on the mediawiki api documentation https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page

answered 2 hours ago

Collectives™ on Stack Overflow

BeautifulSoup - Extracting content blocks after specific subheadings within a larger section, ignoring document introduction

The Problem

My Current Approach (Simplified)

The Request

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

The Problem

My Current Approach (Simplified)

The Request

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related