0

I am working on a script that parses a text file in an attempt to normalize it enough to be able to insert it in to a DB. The data represents articles written by 1 or more authors. The problem I am having is that because there is not a fixed number of authors, I get a variable number of columns in my output text file. eg.

author1, author2, author3, this is the title of the article
author1, author2, this is the title of the article
author1, author2, author3, author4, this is the title of the article

These results give me a max column number of 5. So, for the first 2 articles I will need to add blank columns so that the output has an even number of columns. What would be the best way to do this? My input text is tab delimited and I can iterate through them fairly easily by splitting on the tab.

3
  • Is it safe to assume that the article title is always the last item of the list? Also, what approach have you tried? Commented May 19, 2012 at 2:54
  • I have it working with the variable column count but this won't do. I need to have a set number of columns. I've built lists and tried adding to them but I get stuck with adding the blank items in the list. Commented May 19, 2012 at 2:54
  • This is where I stand...pastebin.com/A2CT97s9 Commented May 19, 2012 at 2:56

2 Answers 2

2

Assuming you already have the max number of columns and already have them separated into lists (which I'm going to assume you put into a list of their own), you should be able to just use list.insert(-1,item) to add empty columns:

def columnize(mylists, maxcolumns):
    for i in mylists:
        while len(i) < maxcolumns:
            i.insert(-1,None)

mylists = [["author1","author2","author3","this is the title of the article"],
           ["author1","author2","this is the title of the article"],
           ["author1","author2","author3","author4","this is the title of the article"]]

columnize(mylists,5)
print mylists

[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]

Alternative version that doesn't destroy your original list, using list comprehensions:

def columnize(mylists, maxcolumns):
    return [j[:-1]+([None]*(maxcolumns-len(j)))+j[-1:] for j in mylists]

print columnize(mylists,5)

[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]
Sign up to request clarification or add additional context in comments.

Comments

1

Forgive me if I've misunderstood, but it sounds like you're approaching the problem in a difficult way. It's quite easy to convert your text file into a dictionary that maps title to a set of authors:

>>> lines = ["auth1, auth2, auth3, article1", "auth1, auth2, article2","auth1, article3"]
>>> d = dict((x[-1], x[:-1]) for x in [line.split(', ') for line in lines])
>>> d
{'article2': ['auth1', 'auth2'], 'article3': ['auth1'], 'article1': ['auth1', 'auth2', 'auth3']}
>>> total_articles = len(d)
>>> total_articles
3
>>> max_authors = max(len(val) for val in d.values())
>>> max_authors
3
>>> for k,v in d.iteritems():
...     print k
...     print v + [None]*(max_authors-len(v))
... 
article2
['auth1', 'auth2', None]
article3
['auth1', None, None]
article1
['auth1', 'auth2', 'auth3']

Then, if you really want to, you can output this data using the csv module that's built in to python. Or, you could directly output the SQL that you're going to need.

You are opening the same file many times, and reading it many times, just to get counts that you can derive from the data in memory. Please don't read the file multiple times for these purposes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.