Read txt file for specific fields and store them in a numpy array

Question

I have a txt file (which is basically a log file) having blocks of text. Each block or paragraph has certain information about the event. What I need is to extract only a certain information from each block and save it as an array or list.

Each paragraph has following format:

id: [id] Name: [name] time: [timestamp] user: [username] ip: [ip_address of the user] processing_time: [processing time in seconds]

A sample paragraph can be:

id: 23455 Name: ymalsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05

What I need to extract from each block is:

 id:[]
 Name:[]
 processing_time: []

So that my resulting array for each block's result would be:

array = [id, name, processing_time]

An issue is that my text files are fairly large in size and have thousands of these records. What is the best way to do what I need to do in Python (2.7 to be precise). Once I have each array (corresponding to each record), I will save all of them in a single ND numpy array and that is it. Any help will be greatly appreciated.

Here is something I am using to plainly extract all the lines starting with ID:

import string

log = 'log_1.txt'
file = open(log, 'r')


name_array = []


line = file.readlines()
for a in line:
    if a.startswith('Name: '):
        ' '.join(a.split())
        host_array.append(a)

But it simply extracts all the blocks and puts them into a single array, which is kind of useless given that I am following the parameters of Id, name, etc.

Can any of the values -- I'm looking in particular at Name: -- contain whitespace? — DSM
– DSM, Commented Mar 11, 2013 at 15:04
they do! let me update my question with the snippet i am using to extract all the lines with Name parameters in them (although I am not able to remove the white spaces and line breaks yet.) — saifuddin778
– saifuddin778, Commented Mar 11, 2013 at 15:05

Bálint Aradi · Accepted Answer · 2013-03-11 15:45:11Z

1

If the Name field can contain whitespaces, you could to extract the date with regular expression. However, then you will have to convert the values to the according python type yourself. The following program:

import numpy as np
import re

PAT = re.compile(r"""id:\s*(?P<id>\d+)\s*
                     Name:\s*(?P<name>[0-9A-Za-z ]+?)\s+time:.*
                     processing_time:\s*(?P<ptime>\d+)""", re.VERBOSE)

values = []
fp = open("proba.txt", "r")
for line in fp:
    match = PAT.match(line)
    if match:
        values.append(( int(match.group("id")),
                        match.group("name"),
                        int(match.group("ptime"))))
fp.close()
print values

would print as result:

[(23455, 'y malsen', 5), (23455, 'ymalsen', 5)]

for a file "proba.txt" with the content

id: 23455 Name: y malsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05
id: 23455 Name: ymalsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05

answered Mar 11, 2013 at 15:45

Bálint Aradi

3,81220 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

saifuddin778 Over a year ago

spot on Balint! Wonderful! Now, what if I need to fetch the time: and ip: fields too? What would be the regex? (this is the most confusing part for me...to interpret/guess regex).

Bálint Aradi Over a year ago

Yes, you would have to extend the regular expression accordingly. IP would be something like ip:\s*(?P<ip>\d+\.\d+\.\d+\.\d+) and time like time:\s*(?P<time>\d+:\d+:\d+). You can consult the documentation of Python's re module for the fine details of regexps.

saifuddin778 Over a year ago

Should the regex part be like: PAT = re.compile(r"""id:\s*(?P<id>\d+)\s* name:\s*(?P<name>[0-9A-Za-z ]+?)\s+time:.* time:\s*(?P<time>\d:\d:\d) ip:\s*(?P<ip>\d+\.\d+\.\d+\.\d+) processing_time:\s*(?P<ptime>\d+)""", re.VERBOSE)...I am sorry for this, but I am not good at regular expressions that is why bothering you a little. ;-)

Bálint Aradi Over a year ago

Including time and ip, it would look

re.compile(r"""id:\s*(?P<id>\d+)\s* Name:\s*(?P<name>[0-9A-Za-z ]+?)\s* time:\s*(?P<time>\d+:\d+:\d+)\s* .* ip:\s*(?P<ip>\d+\.\d+\.\d+\.\d+)\s* processing_time:\s*(?P<ptime>\d+)""", re.VERBOSE)

. But still, if you use regular expressions, I strongly suggest to invest those 1-2 hours to read some documentations about it. See for example this howto.

Bálint Aradi · Accepted Answer · 2013-03-11 15:24:03Z

1

You could load your data using numpy's great loadtxt routine into a record array, and extract it from there:

import numpy as np

aa = np.loadtxt("proba.txt", usecols=(1, 3, 11), 
                dtype={"names": ("id", "name","proctime"),                       
                        "formats": ("i4", "a100", "i4")})
print aa["name"]
print aa["id"]
print aa["proctime"]

The example loads your data from proba.txt and stores in aa. The appropriate elements (aa["name"], aa["id"], ȧa["proctime") gives you a list for each of your column if you need them separately, otherwise, you have them already in one numpy array. The code above produces:

['ymalsen' 'ymalsen']
[23455 23455]
[5 5]

for the file proba.txt with following content:

id: 23455 Name: ymalsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05
id: 23455 Name: ymalsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05

However, please note that this assumes, that no whitespaces appear in the field contents (within the fields). Whitespaces between the fields are fine, though.

edited Mar 11, 2013 at 15:24

answered Mar 11, 2013 at 15:04

Bálint Aradi

3,81220 silver badges22 bronze badges

4 Comments

saifuddin778 Over a year ago

the text file has white spaces...which is why I think its raising a IndexError: list index out of range exception.

Bálint Aradi Over a year ago

You mean whitespaces within the fields or between them? The main point is, that numpy's routine will assume, that the columns are separated by whitespaces. If some of the data columns can contain whitespaces themselfs (for example you allow "y malsen" as a name), the above approach won't work, but otherwise it should. (The example you provided, had only data columns without whitespaces in them.)

saifuddin778 Over a year ago

yes, the thing is that some of the data values have spaces between them.. :-/

Bálint Aradi Over a year ago

OK, I made a separate answer for that case, see below.

Collectives™ on Stack Overflow

Read txt file for specific fields and store them in a numpy array

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related