Python Sorting by date internally based on 1st attribute

Question

I have a data set with 8 attributes (which is sorted according to the first attribute) and is of the following format (Just an example, it is tab separated)

AX  0123  December 20, 2010  1  2  8.0  hello this
AX  2313  April 19, 2009  2  3  4.0  hi there
AX  4532  December 19, 2010  6  2  8.0  nice tie
AX  1244  January 10, 2011  3  4  8.0  king tale
BX  0214  September 10, 2009  2  3  9.0 this king
BX  0114  February 9, 2003  4  9  4.0  his brought
BX  3214  September 1, 2006  1  3  3.0 is great
MG  980   April 20, 2007  2  4  7.1  not available
MG  246   May 8, 2005  5  1  2.1  make goat

Now, that the file is sorted according first attribute, now i need to sort internally according to date based on first attribute, the output should be like this (I don't want to use the database, this is a huge file (2 GB) so I think a special python code might be required (Not sure if one can do this with a simple code)

AX  2313  April 19, 2009  2  3  4.0  hi there
AX  4532  December 19, 2010  6  2  8.0  nice tie
AX  0123  December 20, 2010  1  2  8.0  hello this
AX  1244  January 10, 2011  3  4  8.0  king tale
BX  0114  February 9, 2003  4  9  4.0  his brought
BX  3214  September 1, 2006  1  3  3.0 is great
BX  0214  September 10, 2009  2  3  9.0 this king
MG  246   May 8, 2005  5  1  2.1  make goat
MG  980   April 20, 2007  2  4  7.1  not available

Any replies are greatly appreciated. Let me know if you have any other questions.

Seems like The data which i wrote got messed up here. Assume a next line in front of attributes AX, BX, MG — jaysun
– jaysun, Commented Apr 21, 2012 at 4:34
Please use code tag. Select the part where the format is supposed to begin and end, click on code button above your editor. — CppLearner
– CppLearner, Commented Apr 21, 2012 at 4:35
No problem. I've just edited it for you. It will appear soon. — CppLearner
– CppLearner, Commented Apr 21, 2012 at 4:36
Now, what have you tried so far? You need to turn this into a huge set. In my honest opinion, with 2GB worth of data, you are still better off with a local database. — CppLearner
– CppLearner, Commented Apr 21, 2012 at 4:38
I am a beginner to python. I tried it based on lists but didn't succeed looking for an alternate solution, wondering if dictionaries or some sort of advanced techniques will work. You're right database makes life easier. I wanted to see if there is any other way that it can be done without using one. — jaysun
– jaysun, Commented Apr 21, 2012 at 4:46

joaquin · Accepted Answer · 2012-04-21 05:50:57Z

1

OK, quick and dirty. You can improve it:

from datetime import datetime as dt
from collections import defaultdict

dd = defaultdict(list)

with open('test.txt') as f:
    for line in f:
        lines = line.split('\t')
        dd[lines[0]].append(lines)

def mydate(line):
    return dt.strptime(line[2], "%B %d, %Y")

keys = sorted(dd.keys())

my_list = []
for key in keys:
    dd[key].sort(key=mydate)
    my_list.extend(dd[key])

for item in my_list:
    print item

this produces:

['AX', '2313', 'April 19, 2009', '2', '3', '4.0', 'hi there\n']
['AX', '4532', 'December 19, 2010', '6', '2', '8.0', 'nice tie\n']
['AX', '0123', 'December 20, 2010', '1', '2', '8.0', 'hello this\n']
['AX', '1244', 'January 10, 2011', '3', '4', '8.0', 'king tale\n']
['BX', '0114', 'February 9, 2003', '4', '9', '4.0', 'his brought\n']
['BX', '3214', 'September 1, 2006', '1', '3', '3.0 is great\n']
['BX', '0214', 'September 10, 2009', '2', '3', '9.0 this king\n']
['MG', '246', 'May 8, 2005', '5', '1', '2.1', 'make goat']
['MG', '980', 'April 20, 2007', '2', '4', '7.1', 'not available\n']

then you only need to string.join() all the lists

text_lines = []
for item in my_list:
    text_lines.append('\t'.join(item))

full_text = ''.join(text_lines)

answered Apr 21, 2012 at 5:50

joaquin

86k31 gold badges146 silver badges155 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jaysun Over a year ago

Thanks Joaquin, The code is correct but works for small sets. In my case, there are millions of rows (appending them to the list, might crash the memory or the program) which is what happened to me. I got a Killed error (may be OS might have did it) after trying this code. Anyways thanks again for your help, keep up the good work

bmu · Accepted Answer · 2012-04-26 17:11:18Z

pandas is a python library designed for analysing data sets with different datatypes.

If your data is in data.txt, you can read it with pandas.read_csv() and than sort the resulting DataFrame.

>>> import datetime
>>> import pandas as pd

>>> def date_converter(date_string):
...     return datetime.datetime.strptime(datestring, '%B %d, %Y').date()
>>> df = pd.read_csv('data.txt', sep='\t', header=None,
...                  converters={2:date_converter})
>>> print df
  X.1   X.2         X.3  X.4  X.5  X.6            X.7
0  AX   123  2010-12-20    1    2  8.0     hello this
1  AX  2313  2009-04-19    2    3  4.0       hi there
2  AX  4532  2010-12-19    6    2  8.0       nice tie
3  AX  1244  2011-01-10    3    4  8.0      king tale
4  BX   214  2009-09-10    2    3  9.0      this king
5  BX   114  2003-02-09    4    9  4.0    his brought
6  BX  3214  2006-09-01    1    3  3.0       is great
7  MG   980  2007-04-20    2    4  7.1  not available
8  MG   246  2005-05-08    5    1  2.1      make goat

>>> df = df.set_index(['X.1', 'X.3'])  # using a hierarchical index
>>> df = df.sort_index()
>>> print df
                 X.2  X.4  X.5  X.6            X.7
X.1 X.3                                           
AX  2009-04-19  2313    2    3  4.0       hi there
    2010-12-19  4532    6    2  8.0       nice tie
    2010-12-20   123    1    2  8.0     hello this
    2011-01-10  1244    3    4  8.0      king tale
BX  2003-02-09   114    4    9  4.0    his brought
    2006-09-01  3214    1    3  3.0       is great
    2009-09-10   214    2    3  9.0      this king
MG  2005-05-08   246    5    1  2.1      make goat
    2007-04-20   980    2    4  7.1  not available

As it is numpy based, it should be the right choice for large data sets.

Collectives™ on Stack Overflow

Python Sorting by date internally based on 1st attribute

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related