0

I am trying to extract tagged entities from a csv file using python. This file contains tagged entities in multiple columns of the csv file. I only want python to process one specific column. Can anybody show me how to do this?

This is my code:

from bs4 import BeautifulSoup
import csv

input_name =  "file.csv"      # File names for input and output
output_name = "entities.csv"

def incrementEntity(entity_string, dictionary):

    try:
        dictionary[entity_string] += 1
    except KeyError:
        dictionary[entity_string] = 1

def outputResults(dictionary, entity_type, f):

    for i in sorted(dictionary, key=dictionary.get, reverse=True):
        print i, '\t', entity_type, '\t', dictionary[i]
        f.writerow([i, entity_type, dictionary[i]])

try:
    f = open(input_name, 'r')
    soup = BeautifulSoup(f)
    f.close()
except IOError, message:
    print message
    raise ValueError("Input file could not be opened")

locations = {}  
people    = {}  
orgs      = {}

for i in soup.find_all():
    entity_name = i.get_text()
    entity_type = i.name

    if (entity_type == 'i-loc' or entity_type == 'b-loc'):
        incrementEntity(entity_name, locations)
    elif (entity_type == 'b-org' or entity_type == 'i-org'):
        incrementEntity(entity_name, orgs)
    elif (entity_type == 'b-per' or entity_type == 'i-per'):
       incrementEntity(entity_name, people)
    else:
        continue

output_file = open(output_name, 'w')
f = csv.writer(output_file)
print "Entity\t\tType\t\tCount"
print "------\t\t----\t\t-----"
f.writerow(["Entity", "Type", "Count"])

outputResults(locations, 'location', f)
outputResults(people, 'person', f)
outputResults(orgs, 'organization', f)

output_file.close()
2
  • 2
    If you provide a (brief) sample of the input data and expected output, it would help. Commented Jul 28, 2014 at 12:16
  • This was already answered in [this Stack Overflow question][1]. [1]: stackoverflow.com/questions/15035660/… Commented Jul 28, 2014 at 12:35

1 Answer 1

1

By definition, a CSV is a file in which data is separated by commas. So all you have to do is use the .split() method of the string you are dealing with. Example:

csvline = 'Joe,25,M'
age = csvline.split(',')[1]

I don't know exactly what kind of data you are trying to process, but since you are trying to use BeautifulSoup I will assume your CSV file contains plain HTML-like data in some of its columns AND that you want to join the data of all those columns to process it with BeautifulSoup. That being the case you could try something like:

f = open(input_name, 'r')
htmlstring = '\n'.join([line.split(',')[1] for line in f])
soup = BeautifulSoup(htmlstring)
f.close()
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.