Handling empty space in reading table from text file python

Question

I need parse file as where this link is given below. http://bit.ly/1x6yzoX

I wrote this fallowing method to parse this file, but unable to read incomplete data of latest year(2014) which empty spaces in table of text file. For now I am skipping the lines which I am unable to read.

Help me getting forward to how to handle this problem?.

LINES_TO_IGNORE = 7
import collections
import csv

def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with open(data_file) as f:
        reader = csv.reader(f, delimiter="\t")
        data = islice(reader, LINES_TO_IGNORE, None, None)
        if not data:
            return result_dict
        # Get file headers
        headers = data.next()
        headers = headers[0].split()
        keys = headers[1:]

        for row in data:
            values = row[0].split()
            if len(headers) == len(values):
                year = parse_to_int(values[0])
                data_list = [parse_to_float(x) for x in values[1:]]
                # Each line becomes a dict (column_header->value)
                data_dict = collections.OrderedDict(zip(keys, data_list))
            else:
                print "Skipping"
            # result_dict is dict of dict (year->data_dict)
            result_dict[year] = data_dict
    return result_dict

Similar questions: stackoverflow.com/questions/848537/… and stackoverflow.com/questions/10686657/… — user2314737
– user2314737, Commented Nov 5, 2014 at 9:06

John Zwinck · Accepted Answer · 2014-11-05 07:59:32Z

1

You can do it easily with Pandas:

import pandas as pd
data = pd.read_fwf('UK.txt', skiprows=7, delimiter=' ')

Print the last few rows with print data[-3:]:

    Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT  \
102  2012    1.8    1.2    3.4    2.5    6.0    8.8...
103  2013    1.0   -0.1   -0.7    2.2    5.2    8.6...
104  2014    2.1    2.5    2.9    5.3    7.3    9.9...

     NOV    DEC     WIN    SPR    SUM    AUT   ANN  Unnamed: 3  Unnamed: 4  \
102  2.8    1.1    1.73   4.00  10.19   5.23  5.21         NaN         NaN
103  2.4    2.8    0.68   2.26  10.66   6.56  5.21         NaN         NaN
104                       2.48   5.17  10.46   NaN         NaN         NaN

     Unnamed: 5  Unnamed: 6  Unnamed: 7
102         NaN         NaN         NaN
103         NaN         NaN         NaN
104         NaN         NaN         NaN

I think this is not 100% right quite yet, but it's close...hopefully you can take it the rest of the way. No need to write so much code by hand if you use Pandas.

edited Nov 5, 2014 at 7:59

answered Nov 5, 2014 at 7:53

John Zwinck

252k44 gold badges347 silver badges459 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sreedhar Over a year ago

,Are you getting this output - Please give me full idea how to get it worked ?

user2314737 · Accepted Answer · 2014-11-05 09:02:40Z

0

You can use the genfromtxt function from numpy

import numpy as np
data = np.genfromtxt('UK.txt',skiprows=8,delimiter=(4,7,7,7,7,7,7,7,7,7,7,7,7,8,7,7,7,8))

This will automatically fill the missing values, but you still need to find a way of identifying the sizes of the columns and the number of lines to skip.

Here is how to get the column sizes from the header:

import re
header="Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT    NOV    DEC     WIN    SPR    SUM    AUT     ANN"
cols=re.findall("\s*[^\s]+",header)
delimiter=tuple([len(c) for c in cols])

edited Nov 5, 2014 at 9:02

answered Nov 5, 2014 at 8:05

user2314737

29.7k20 gold badges109 silver badges126 bronze badges

3 Comments

Sreedhar Over a year ago

Please can you explain how is the value of delimeter is decided in your code?

Sreedhar Over a year ago

by the way , It is handling the missing data from file . Just need how do we decide delimiter?

user2314737 Over a year ago

I showed how to get the delimiters tuple from the headers line in the second part of the answer.

cfi · Accepted Answer · 2014-11-05 08:57:10Z

0

def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with  open(data_file) as f:
        counter = 0
        headers = []
        for line in f.readlines():
            line = line.strip()
            counter += 1
            if counter == 1:
                headers = re.findall('\w+',line)
                keys = headers
            else:
                values =  re.findall('([\d\-\.]+|(?:\s){3,4})(?:(?:\s){3,4})?',line)
                year = parse_to_int(values[0])

                if len(headers) != len(values):
                    diff_list = ['NaN' for i in range(len(headers) - len(values))]
                    values.extend(diff_list)
                data_list = [parse_to_float(x) for x in values[1:]]
                data_dict = collections.OrderedDict(zip(keys, data_list))
                result_dict[year] = data_dict

    return result_dict

edited Nov 5, 2014 at 8:57

cfi

11.4k9 gold badges60 silver badges108 bronze badges

answered Nov 5, 2014 at 8:44

longzhiwen888

1

1 Comment

cfi Over a year ago

Welcome to SO! Please indent all code to highlight/format it accordingly. Answers are more likely to receive upvotes if you provide an explanation.

Collectives™ on Stack Overflow

Handling empty space in reading table from text file python

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related