0

I need parse file as where this link is given below. http://bit.ly/1x6yzoX

I wrote this fallowing method to parse this file, but unable to read incomplete data of latest year(2014) which empty spaces in table of text file. For now I am skipping the lines which I am unable to read.

Help me getting forward to how to handle this problem?.

LINES_TO_IGNORE = 7
import collections
import csv

def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with open(data_file) as f:
        reader = csv.reader(f, delimiter="\t")
        data = islice(reader, LINES_TO_IGNORE, None, None)
        if not data:
            return result_dict
        # Get file headers
        headers = data.next()
        headers = headers[0].split()
        keys = headers[1:]

        for row in data:
            values = row[0].split()
            if len(headers) == len(values):
                year = parse_to_int(values[0])
                data_list = [parse_to_float(x) for x in values[1:]]
                # Each line becomes a dict (column_header->value)
                data_dict = collections.OrderedDict(zip(keys, data_list))
            else:
                print "Skipping"
            # result_dict is dict of dict (year->data_dict)
            result_dict[year] = data_dict
    return result_dict
1

3 Answers 3

1

You can do it easily with Pandas:

import pandas as pd
data = pd.read_fwf('UK.txt', skiprows=7, delimiter=' ')

Print the last few rows with print data[-3:]:

    Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT  \
102  2012    1.8    1.2    3.4    2.5    6.0    8.8...
103  2013    1.0   -0.1   -0.7    2.2    5.2    8.6...
104  2014    2.1    2.5    2.9    5.3    7.3    9.9...

     NOV    DEC     WIN    SPR    SUM    AUT   ANN  Unnamed: 3  Unnamed: 4  \
102  2.8    1.1    1.73   4.00  10.19   5.23  5.21         NaN         NaN
103  2.4    2.8    0.68   2.26  10.66   6.56  5.21         NaN         NaN
104                       2.48   5.17  10.46   NaN         NaN         NaN

     Unnamed: 5  Unnamed: 6  Unnamed: 7
102         NaN         NaN         NaN
103         NaN         NaN         NaN
104         NaN         NaN         NaN

I think this is not 100% right quite yet, but it's close...hopefully you can take it the rest of the way. No need to write so much code by hand if you use Pandas.

Sign up to request clarification or add additional context in comments.

1 Comment

,Are you getting this output - Please give me full idea how to get it worked ?
0

You can use the genfromtxt function from numpy

import numpy as np
data = np.genfromtxt('UK.txt',skiprows=8,delimiter=(4,7,7,7,7,7,7,7,7,7,7,7,7,8,7,7,7,8))

This will automatically fill the missing values, but you still need to find a way of identifying the sizes of the columns and the number of lines to skip.

Here is how to get the column sizes from the header:

import re
header="Year    JAN    FEB    MAR    APR    MAY    JUN    JUL    AUG    SEP    OCT    NOV    DEC     WIN    SPR    SUM    AUT     ANN"
cols=re.findall("\s*[^\s]+",header)
delimiter=tuple([len(c) for c in cols])

3 Comments

Please can you explain how is the value of delimeter is decided in your code?
by the way , It is handling the missing data from file . Just need how do we decide delimiter?
I showed how to get the delimiters tuple from the headers line in the second part of the answer.
0
def parse_file(data_file):
    result_dict = collections.OrderedDict()
    if not data_file:
        return result_dict

    with  open(data_file) as f:
        counter = 0
        headers = []
        for line in f.readlines():
            line = line.strip()
            counter += 1
            if counter == 1:
                headers = re.findall('\w+',line)
                keys = headers
            else:
                values =  re.findall('([\d\-\.]+|(?:\s){3,4})(?:(?:\s){3,4})?',line)
                year = parse_to_int(values[0])

                if len(headers) != len(values):
                    diff_list = ['NaN' for i in range(len(headers) - len(values))]
                    values.extend(diff_list)
                data_list = [parse_to_float(x) for x in values[1:]]
                data_dict = collections.OrderedDict(zip(keys, data_list))
                result_dict[year] = data_dict

    return result_dict

1 Comment

Welcome to SO! Please indent all code to highlight/format it accordingly. Answers are more likely to receive upvotes if you provide an explanation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.