Extract tags from one column in CSV using Python [duplicate]

Question

I am trying to extract tagged entities from a csv file using python. This file contains tagged entities in multiple columns of the csv file. I only want python to process one specific column. Can anybody show me how to do this?

This is my code:

from bs4 import BeautifulSoup
import csv

input_name =  "file.csv"      # File names for input and output
output_name = "entities.csv"

def incrementEntity(entity_string, dictionary):

    try:
        dictionary[entity_string] += 1
    except KeyError:
        dictionary[entity_string] = 1

def outputResults(dictionary, entity_type, f):

    for i in sorted(dictionary, key=dictionary.get, reverse=True):
        print i, '\t', entity_type, '\t', dictionary[i]
        f.writerow([i, entity_type, dictionary[i]])

try:
    f = open(input_name, 'r')
    soup = BeautifulSoup(f)
    f.close()
except IOError, message:
    print message
    raise ValueError("Input file could not be opened")

locations = {}  
people    = {}  
orgs      = {}

for i in soup.find_all():
    entity_name = i.get_text()
    entity_type = i.name

    if (entity_type == 'i-loc' or entity_type == 'b-loc'):
        incrementEntity(entity_name, locations)
    elif (entity_type == 'b-org' or entity_type == 'i-org'):
        incrementEntity(entity_name, orgs)
    elif (entity_type == 'b-per' or entity_type == 'i-per'):
       incrementEntity(entity_name, people)
    else:
        continue

output_file = open(output_name, 'w')
f = csv.writer(output_file)
print "Entity\t\tType\t\tCount"
print "------\t\t----\t\t-----"
f.writerow(["Entity", "Type", "Count"])

outputResults(locations, 'location', f)
outputResults(people, 'person', f)
outputResults(orgs, 'organization', f)

output_file.close()

If you provide a (brief) sample of the input data and expected output, it would help. — wwii
– wwii, Commented Jul 28, 2014 at 12:16
This was already answered in [this Stack Overflow question][1]. [1]: stackoverflow.com/questions/15035660/… — Matheus Portela
– Matheus Portela, Commented Jul 28, 2014 at 12:35

Gustavo Bezerra · Accepted Answer · 2014-07-28 12:44:11Z

1

By definition, a CSV is a file in which data is separated by commas. So all you have to do is use the .split() method of the string you are dealing with. Example:

csvline = 'Joe,25,M'
age = csvline.split(',')[1]

I don't know exactly what kind of data you are trying to process, but since you are trying to use BeautifulSoup I will assume your CSV file contains plain HTML-like data in some of its columns AND that you want to join the data of all those columns to process it with BeautifulSoup. That being the case you could try something like:

f = open(input_name, 'r')
htmlstring = '\n'.join([line.split(',')[1] for line in f])
soup = BeautifulSoup(htmlstring)
f.close()

answered Jul 28, 2014 at 12:44

Gustavo Bezerra

11.2k4 gold badges45 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extract tags from one column in CSV using Python [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related