2

I have a movie script, "The Departed", and I want to parse out the data by names of the character. The format of the text file has no delimiters, but it will have the characters name BILLY in all caps. The only identifier I have are the names that are in all caps. I read through regex and other threads, but I am unsure where to start.....

file = open("Departed.txt","r")
data = file.read()
pattern = re.compile(r'BILLY')
matches = pattern.finditer(data)
for match in matches:
    print(match)

This returns the whole script still... https://pastebin.com/226VzLWu

5
  • Can you post a sample of the text file, and any attempts you've made yourself to accomplish this task? Commented Nov 10, 2017 at 0:46
  • 1
    figshare.com/articles/scripts_D-E/4630171 When I load it into a dataframe it come back as 0X1128 with no rows. Commented Nov 10, 2017 at 0:54
  • 1
    Please edit your question to include the file itself Commented Nov 10, 2017 at 0:56
  • Not sure if this helps, but this is some sample text. YOUNG COLIN Yeah. COSTELLO tells the Proprietor to takes three loaves of bread and some soup off the shelves and puts them in Colin's bag. COSTELLO Get him three loaves of bread. And a couple of half gallons of milk. And some soup. He goes over to the fridge and puts two half gallons of milk in the bag. Some soup. Costello turns to Colin. Commented Nov 10, 2017 at 0:57
  • pastebin.com/226VzLWu Commented Nov 10, 2017 at 1:22

3 Answers 3

2

Python got already built in split in regex module so try :

import re
re.split(r"\W(?=\b[A-Z ]+\b)", str(data), 0 , re.X )

My output basing at You comment (I am using Python3) :

Result list : ['Not sure if this helps, but this is some sample text.', 'YOUNG', 'COLIN Yeah.', "COSTELLO tells the Proprietor to takes three loaves of bread and some soup off the shelves and puts them in Colin's bag.", 'COSTELLO Get him three loaves of bread. And a couple of half gallons of milk. And some soup. He goes over to the fridge and puts two half gallons of milk in the bag. Some soup. Costello turns to Colin.']

Sign up to request clarification or add additional context in comments.

2 Comments

That added a ', ' before every capitalized Word, not I was thinking about doing key value to create two dictionaries that will log the specific lines for each character. Is that something you would advise?
Lol, Its not added coma before every word its just list with every sentence in other list element
1

Here's one way to do it quickly (you'll still need to do cleanup on this, but I think your answer is here):

import re
with open('Departed.txt', 'r') as f:
    data = f.read()

# match all words or sequences of words that are all caps
scene_or_character_re = re.compile(r'\b([A-Z][A-Z\W]+)\W')

groupings = scene_or_character_re.split(data)
# groupings is a list of strings, alternating caps, normal, caps, normal

def cleanup_spaces(s):
    '''helper function to replace whitespace with single spaces'''
    return re.sub(r'\s\s+', ' ', s).strip()

# split list into tuples of length two (caps, normal) 
pointer = iter(groupings)
groups = []
for p in pointer:
    if cleanup_spaces(p) == '':
        continue  # skip blank lines
    actor = cleanup_spaces(p)
    line = cleanup_spaces(next(pointer))  # this also increments the iterator used by the `for` loop
    groups.append((actor, line))

this gives you:

groups[100:110]

[('THE GOLD DOME OF BEACON HILL.',
  'The terraces of fine townhouses. Aqueous golden light behind. Misty golden beauty. ('),
 ('CONTINUED)', '7.'),
 ('CONTINUED: BARRIGAN',
  "What are you looking at? Forget it. Your father was a janitor, and his son's only a cop."),
 ('COLIN',
  'not vainglorious, but innocently stretching for the idea) You\'re in trouble if you\'re "only" anything.'),
 ('BARRIGAN',
  "Don't tell me I'm looking at the first dickhead-American president of the United States."),
 ('COLIN',
  "doesn't have a great sense of humor but he knows how to pretend that he does. He smiles."),
 ('EXT. STATE POLICE GRADUATION CEREMONY. DAY',
  'Bagpipes and bullshit. Flags cracking. Line after line of paramilitary-looking graduates, among them'),
 ('COLIN. SPEAKER (V.O.)',
  'The Massachusetts State Police has a long tradition of excellence. Your graduation today solidifies your acceptance into one of the finest law enforcement agencies in our nation. As the Governor of the Commonwealth of Massachusetts, I am confident each and every one of you will serve with distinction, honor and integrity.'),
 ('CAMERA', 'swirls around'),
 ('COLIN',
  'as he moves, a lone person, through the breaking up crowd. Other graduates are hugged by family.')]

Comments

0

The formatting is a bit of a mess, but just relying on uppercase causes some issues as the names are uppercase when another character mentions them in the script. The best result I've found seems to be if you convert 8 spaces to tabs, then split the lines after every tab group. Results in this:

     THE DEPARTED
 Written by
     William Monahan
  Based on Infernal Affairs
      SCRIPT AS SHOT COMPILED SEPTEMBER 2006
FADE UP ON
    THE SOUTH BOSTON HOUSING PROJECTS. A MAZE OF BUILDINGS
  AGAINST THE HARBOR.
   COSTELLO (V.O.)
       I don't want to be a product of my
       environment. I want my environment
       to be a product...of me.
    YELLOW RIPPLES PAST THE CAMERA AND WHEN IT CLEARS WE SEE
  THROUGH DIESEL SMOKE: A BUSING PROTEST IN PROGRESS. THE
  SCHOOL-BUS, FULL OF BLACK KIDS, IS HIT WITH BRICKS, ROCKS.
  N.B.: (THIS IS NOT SETTING THE LIVE ACTION IN 1974; IT IS A
  HISTORICAL MONTAGE, THE BACKGROUND FOR COSTELLO'S V.O.).
    INT. THE AUTOBODY SHOP. DAY.
    COSTELLO's profile passes in a dark room.
   COSTELLO (V.O.)
       Years ago, we had the Church. That
       was only a way of saying we had
       each other. The Knights of Columbus
       were head-breakers. They took over
       their piece of the city.
    EXT. SOUTHIE. VARIOUS
    The neighborhood. 1980's. We won't be here long. This isn't
  where Costello ends up. It's where he began. Liquor stores
  with shamrocked signs. MEN FISHING near Castle Island.
  Catholic SCHOOLKIDS playing in an asphalted schoolyard.
   COSTELLO (V.O.)
       Twenty years after an Irishman
       couldn't get a job, we had the
       presidency. That's what the niggers
       don't realize. If I got one thing
       against the black chaps it's this.
       No one gives it to you. You have to
       take it.
    INT. LUNCH COUNTER. DAY
    COSTELLO comes in. The shop is one that sells papers,
  sundries, fountain drinks...and fronts a bookie operation.
   YOUNG COSTELLO
   (leaning over cluttered
    counter)
       Don't make me have to come down
       here again.
 (CONTINUED)
  2.
  CONTINUED:
     PROPRIETOR
       Won't happen again, Mr. C.
    The frightened proprietor hands over money. Fifty bucks, a
  hundred, doesn't matter. COSTELLO is never the threatener.
  His demeanor is gentle, philosophical. Almost a shrink's
  probing bedside manner. He has great interest in the world
  as he moves through it. As if he originally came from a
  different world and his survival in this one depends on close
  continual observation and analysis.
    YOUNG COLIN looks up. CLOSE ON his eyes. He is fourteen or
  fifteen, but small for his age. Bookish.
    COSTELLO eyes the proprietor's TEENAGE DAUGHTER, working
  behind the counter. He takes a propane lighter, and,
  strangely, pays for it (the proprietor startled) and waits
  for change. He lights a MORE cigarette with the lighter.
   YOUNG COSTELLO
       Carmen's developing into a fine
       young lady. You should be proud.
       You get your period yet, Carmen?
    The PROPRIETOR is uneasy. COSTELLO turns to YOUNG COLIN
  (about 14) staring at the local hero. Costello reaches up
  above and behind the counter and takes down some cigarettes.
   YOUNG COSTELLO (CONT'D)
       You Johnny Sullivan's kid?
    COLIN nods.
   YOUNG COSTELLO (CONT'D)
       You live with your grandmother?
    COLIN nods.
   YOUNG COLIN
       Yeah.
    COSTELLO tells the Proprietor to takes three loaves of bread
  and some soup off the shelves and puts them in Colin's bag.
   COSTELLO
       Get him three loaves of bread. And
       a couple of half gallons of milk.
       And some soup.
    He goes over to the fridge and puts two half gallons of milk
  in the bag. Some soup. Costello turns to Colin.
       (CONTINUED)

Here's a one liner for this transformation:

output = re.sub('(        )+', '\n', data)

You can then see some of the formatting and could potentially distinguish between lines and directions. This script attempts to guess the content type (though it does struggle a bit later on)

from __future__ import print_function
import re

filename = 'Script_DepartedThe.txt'
data = open(filename).read()
script = re.sub('(        )+', '\n', data)

leading_space_re = re.compile('^ +')

block_types = {
        7: 'LINE:      ',
        2: 'DIRECTION: ',
        4: 'DIRECTION: ',
        3: 'CHARACTER: ',
        }

for line in script.split('\n'):
    # number of leading spaces designates block type
    match = leading_space_re.match(line)
    count = 0 if match is None else len(match.group())
    current_block = block_types.get(count, 'UNKNOWN:   ')

    print(current_block, count, line)

Output:

UNKNOWN:    0 
UNKNOWN:    5      THE DEPARTED
UNKNOWN:    1  Written by
UNKNOWN:    5      William Monahan
DIRECTION:  2   Based on Infernal Affairs
UNKNOWN:    6       SCRIPT AS SHOT COMPILED SEPTEMBER 2006
UNKNOWN:    0 FADE UP ON
DIRECTION:  4     THE SOUTH BOSTON HOUSING PROJECTS. A MAZE OF BUILDINGS
DIRECTION:  2   AGAINST THE HARBOR.
CHARACTER:  3    COSTELLO (V.O.)
LINE:       7        I don't want to be a product of my
LINE:       7        environment. I want my environment
LINE:       7        to be a product...of me.
DIRECTION:  4     YELLOW RIPPLES PAST THE CAMERA AND WHEN IT CLEARS WE SEE
DIRECTION:  2   THROUGH DIESEL SMOKE: A BUSING PROTEST IN PROGRESS. THE
DIRECTION:  2   SCHOOL-BUS, FULL OF BLACK KIDS, IS HIT WITH BRICKS, ROCKS.
DIRECTION:  2   N.B.: (THIS IS NOT SETTING THE LIVE ACTION IN 1974; IT IS A
DIRECTION:  2   HISTORICAL MONTAGE, THE BACKGROUND FOR COSTELLO'S V.O.).
DIRECTION:  4     INT. THE AUTOBODY SHOP. DAY.
DIRECTION:  4     COSTELLO's profile passes in a dark room.

3 Comments

Peter Gibson: I used your advice and tried to Parse out when the character occurs(4)... I want to grab all of the lines below it until it changes off of (7).. I have been playing with some code.. any idea?
You could .append() them to a list, then do something with the list when the indent changes. How do you want it to end up?
stackoverflow.com/questions/47233906/python-parsing-with-regex The follow up question I posted with my attempt which is by no means dynamically correct

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.