2

I have a data set with 8 attributes (which is sorted according to the first attribute) and is of the following format (Just an example, it is tab separated)

AX  0123  December 20, 2010  1  2  8.0  hello this
AX  2313  April 19, 2009  2  3  4.0  hi there
AX  4532  December 19, 2010  6  2  8.0  nice tie
AX  1244  January 10, 2011  3  4  8.0  king tale
BX  0214  September 10, 2009  2  3  9.0 this king
BX  0114  February 9, 2003  4  9  4.0  his brought
BX  3214  September 1, 2006  1  3  3.0 is great
MG  980   April 20, 2007  2  4  7.1  not available
MG  246   May 8, 2005  5  1  2.1  make goat  

Now, that the file is sorted according first attribute, now i need to sort internally according to date based on first attribute, the output should be like this (I don't want to use the database, this is a huge file (2 GB) so I think a special python code might be required (Not sure if one can do this with a simple code)

AX  2313  April 19, 2009  2  3  4.0  hi there
AX  4532  December 19, 2010  6  2  8.0  nice tie
AX  0123  December 20, 2010  1  2  8.0  hello this
AX  1244  January 10, 2011  3  4  8.0  king tale
BX  0114  February 9, 2003  4  9  4.0  his brought
BX  3214  September 1, 2006  1  3  3.0 is great
BX  0214  September 10, 2009  2  3  9.0 this king
MG  246   May 8, 2005  5  1  2.1  make goat
MG  980   April 20, 2007  2  4  7.1  not available

Any replies are greatly appreciated. Let me know if you have any other questions.

7
  • Seems like The data which i wrote got messed up here. Assume a next line in front of attributes AX, BX, MG Commented Apr 21, 2012 at 4:34
  • Please use code tag. Select the part where the format is supposed to begin and end, click on code button above your editor. Commented Apr 21, 2012 at 4:35
  • No problem. I've just edited it for you. It will appear soon. Commented Apr 21, 2012 at 4:36
  • 1
    Now, what have you tried so far? You need to turn this into a huge set. In my honest opinion, with 2GB worth of data, you are still better off with a local database. Commented Apr 21, 2012 at 4:38
  • I am a beginner to python. I tried it based on lists but didn't succeed looking for an alternate solution, wondering if dictionaries or some sort of advanced techniques will work. You're right database makes life easier. I wanted to see if there is any other way that it can be done without using one. Commented Apr 21, 2012 at 4:46

2 Answers 2

1

OK, quick and dirty. You can improve it:

from datetime import datetime as dt
from collections import defaultdict

dd = defaultdict(list)

with open('test.txt') as f:
    for line in f:
        lines = line.split('\t')
        dd[lines[0]].append(lines)

def mydate(line):
    return dt.strptime(line[2], "%B %d, %Y")

keys = sorted(dd.keys())

my_list = []
for key in keys:
    dd[key].sort(key=mydate)
    my_list.extend(dd[key])

for item in my_list:
    print item

this produces:

['AX', '2313', 'April 19, 2009', '2', '3', '4.0', 'hi there\n']
['AX', '4532', 'December 19, 2010', '6', '2', '8.0', 'nice tie\n']
['AX', '0123', 'December 20, 2010', '1', '2', '8.0', 'hello this\n']
['AX', '1244', 'January 10, 2011', '3', '4', '8.0', 'king tale\n']
['BX', '0114', 'February 9, 2003', '4', '9', '4.0', 'his brought\n']
['BX', '3214', 'September 1, 2006', '1', '3', '3.0 is great\n']
['BX', '0214', 'September 10, 2009', '2', '3', '9.0 this king\n']
['MG', '246', 'May 8, 2005', '5', '1', '2.1', 'make goat']
['MG', '980', 'April 20, 2007', '2', '4', '7.1', 'not available\n']

then you only need to string.join() all the lists

text_lines = []
for item in my_list:
    text_lines.append('\t'.join(item))

full_text = ''.join(text_lines)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Joaquin, The code is correct but works for small sets. In my case, there are millions of rows (appending them to the list, might crash the memory or the program) which is what happened to me. I got a Killed error (may be OS might have did it) after trying this code. Anyways thanks again for your help, keep up the good work
0

pandas is a python library designed for analysing data sets with different datatypes.

If your data is in data.txt, you can read it with pandas.read_csv() and than sort the resulting DataFrame.

>>> import datetime
>>> import pandas as pd

>>> def date_converter(date_string):
...     return datetime.datetime.strptime(datestring, '%B %d, %Y').date()
>>> df = pd.read_csv('data.txt', sep='\t', header=None,
...                  converters={2:date_converter})
>>> print df
  X.1   X.2         X.3  X.4  X.5  X.6            X.7
0  AX   123  2010-12-20    1    2  8.0     hello this
1  AX  2313  2009-04-19    2    3  4.0       hi there
2  AX  4532  2010-12-19    6    2  8.0       nice tie
3  AX  1244  2011-01-10    3    4  8.0      king tale
4  BX   214  2009-09-10    2    3  9.0      this king
5  BX   114  2003-02-09    4    9  4.0    his brought
6  BX  3214  2006-09-01    1    3  3.0       is great
7  MG   980  2007-04-20    2    4  7.1  not available
8  MG   246  2005-05-08    5    1  2.1      make goat

>>> df = df.set_index(['X.1', 'X.3'])  # using a hierarchical index
>>> df = df.sort_index()
>>> print df
                 X.2  X.4  X.5  X.6            X.7
X.1 X.3                                           
AX  2009-04-19  2313    2    3  4.0       hi there
    2010-12-19  4532    6    2  8.0       nice tie
    2010-12-20   123    1    2  8.0     hello this
    2011-01-10  1244    3    4  8.0      king tale
BX  2003-02-09   114    4    9  4.0    his brought
    2006-09-01  3214    1    3  3.0       is great
    2009-09-10   214    2    3  9.0      this king
MG  2005-05-08   246    5    1  2.1      make goat
    2007-04-20   980    2    4  7.1  not available

As it is numpy based, it should be the right choice for large data sets.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.