1

I have a txt file (which is basically a log file) having blocks of text. Each block or paragraph has certain information about the event. What I need is to extract only a certain information from each block and save it as an array or list.

Each paragraph has following format:

id: [id] Name: [name] time: [timestamp] user: [username] ip: [ip_address of the user] processing_time: [processing time in seconds]

A sample paragraph can be:

id: 23455 Name: ymalsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05

What I need to extract from each block is:

 id:[]
 Name:[]
 processing_time: []

So that my resulting array for each block's result would be:

array = [id, name, processing_time]

An issue is that my text files are fairly large in size and have thousands of these records. What is the best way to do what I need to do in Python (2.7 to be precise). Once I have each array (corresponding to each record), I will save all of them in a single ND numpy array and that is it. Any help will be greatly appreciated.

Here is something I am using to plainly extract all the lines starting with ID:

import string

log = 'log_1.txt'
file = open(log, 'r')


name_array = []


line = file.readlines()
for a in line:
    if a.startswith('Name: '):
        ' '.join(a.split())
        host_array.append(a)

But it simply extracts all the blocks and puts them into a single array, which is kind of useless given that I am following the parameters of Id, name, etc.

2
  • Can any of the values -- I'm looking in particular at Name: -- contain whitespace? Commented Mar 11, 2013 at 15:04
  • they do! let me update my question with the snippet i am using to extract all the lines with Name parameters in them (although I am not able to remove the white spaces and line breaks yet.) Commented Mar 11, 2013 at 15:05

2 Answers 2

1

If the Name field can contain whitespaces, you could to extract the date with regular expression. However, then you will have to convert the values to the according python type yourself. The following program:

import numpy as np
import re

PAT = re.compile(r"""id:\s*(?P<id>\d+)\s*
                     Name:\s*(?P<name>[0-9A-Za-z ]+?)\s+time:.*
                     processing_time:\s*(?P<ptime>\d+)""", re.VERBOSE)

values = []
fp = open("proba.txt", "r")
for line in fp:
    match = PAT.match(line)
    if match:
        values.append(( int(match.group("id")),
                        match.group("name"),
                        int(match.group("ptime"))))
fp.close()
print values

would print as result:

[(23455, 'y malsen', 5), (23455, 'ymalsen', 5)]

for a file "proba.txt" with the content

id: 23455 Name: y malsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05
id: 23455 Name: ymalsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05
Sign up to request clarification or add additional context in comments.

4 Comments

spot on Balint! Wonderful! Now, what if I need to fetch the time: and ip: fields too? What would be the regex? (this is the most confusing part for me...to interpret/guess regex).
Yes, you would have to extend the regular expression accordingly. IP would be something like ip:\s*(?P<ip>\d+\.\d+\.\d+\.\d+) and time like time:\s*(?P<time>\d+:\d+:\d+). You can consult the documentation of Python's re module for the fine details of regexps.
Should the regex part be like: PAT = re.compile(r"""id:\s*(?P<id>\d+)\s* name:\s*(?P<name>[0-9A-Za-z ]+?)\s+time:.* time:\s*(?P<time>\d:\d:\d) ip:\s*(?P<ip>\d+\.\d+\.\d+\.\d+) processing_time:\s*(?P<ptime>\d+)""", re.VERBOSE)...I am sorry for this, but I am not good at regular expressions that is why bothering you a little. ;-)
Including time and ip, it would look re.compile(r"""id:\s*(?P<id>\d+)\s* Name:\s*(?P<name>[0-9A-Za-z ]+?)\s* time:\s*(?P<time>\d+:\d+:\d+)\s* .* ip:\s*(?P<ip>\d+\.\d+\.\d+\.\d+)\s* processing_time:\s*(?P<ptime>\d+)""", re.VERBOSE). But still, if you use regular expressions, I strongly suggest to invest those 1-2 hours to read some documentations about it. See for example this howto.
1

You could load your data using numpy's great loadtxt routine into a record array, and extract it from there:

import numpy as np

aa = np.loadtxt("proba.txt", usecols=(1, 3, 11), 
                dtype={"names": ("id", "name","proctime"),                       
                        "formats": ("i4", "a100", "i4")})
print aa["name"]
print aa["id"]
print aa["proctime"]

The example loads your data from proba.txt and stores in aa. The appropriate elements (aa["name"], aa["id"], ȧa["proctime") gives you a list for each of your column if you need them separately, otherwise, you have them already in one numpy array. The code above produces:

['ymalsen' 'ymalsen']
[23455 23455]
[5 5]

for the file proba.txt with following content:

id: 23455 Name: ymalsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05
id: 23455 Name: ymalsen time: 03:20:20 user: ymanlls ip: 230.33.45.32 processing_time: 05

However, please note that this assumes, that no whitespaces appear in the field contents (within the fields). Whitespaces between the fields are fine, though.

4 Comments

the text file has white spaces...which is why I think its raising a IndexError: list index out of range exception.
You mean whitespaces within the fields or between them? The main point is, that numpy's routine will assume, that the columns are separated by whitespaces. If some of the data columns can contain whitespaces themselfs (for example you allow "y malsen" as a name), the above approach won't work, but otherwise it should. (The example you provided, had only data columns without whitespaces in them.)
yes, the thing is that some of the data values have spaces between them.. :-/
OK, I made a separate answer for that case, see below.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.