0

I have an ascii file with 3 columns and multiple rows as following ;

col1, col2, col3
-3,cb,7.9356
-3,mo,13.0402
-2,cb,12.9229
-2,cb,13.2063
-1,va,12.7535

I want to extract an ascii file with the 3 columns but only the rows where col2="cb". I am trying to do this with python using numpy and .

import numpy as np
file="input.txt"
outfile="out.txt"
col1 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 0)
col2 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 1, dtype=np.str)
col3 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 2)

When I use the np.core.defchararray.find I get only the indexes.

cb_rows=np.core.defchararray.find(col2.astype(str),'cb')

How I could save the new file with the 3 columns only for rows that col2 = "cb" (in fact, cb_rows) ?

2
  • Is the extra column in line 5 intentional? Commented Jun 5, 2020 at 17:21
  • No, it was a mistake. I corrected it. Thank you. Commented Jun 5, 2020 at 17:22

3 Answers 3

1

If you want to achieve this by using numpy only, then try this:

import numpy as np


a = np.loadtxt('input.txt', dtype=str, delimiter=',')
msk = a[:, 1] == 'cb'
b = a[msk, :]
np.savetxt('out.txt', b, delimiter=',', fmt='%s', header='col1,col2,col3')




Edit

Since the second column of b is now redundant, you can avoid writing it by:

np.savetxt('out.txt', b[:, 0:3:2], delimiter=',', fmt='%s', header='col1,col3')

As is obvious, a and b are string arrays. If, however, you need to do some further processing of the 1st and last columns using their native types, then use the following variables:

cb_col1 = b[:, 0].astype(int)
cb_col3 = b[:, 2].astype(float)
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Is it possible to define msk as the string of column b contains the characters "cb"? That is to say, that column b may be cb_e01, cb_e02 etc
In that case you will have to write msk = ['cb' in i for i in a[:, 1]].
1

Your sample, corrected for the extra column in line 5:

In [4]: txt = """col1, col2, col3 
   ...: -3,cb,7.9356 
   ...: -3,mo,13.0402 
   ...: -2,cb,12.9229 
   ...: -2,cb,13.2063 
   ...: -1,va,12.7535"""

Using the full power of genfromtxt I can load it as a structured array:

In [5]: data = np.genfromtxt(txt.splitlines(), delimiter=',', dtype=None, names=True, encoding=None)                                                    
In [6]: data                                                                    
Out[6]: 
array([(-3, 'cb',  7.9356), (-3, 'mo', 13.0402), (-2, 'cb', 12.9229),
       (-2, 'cb', 13.2063), (-1, 'va', 12.7535)],
      dtype=[('col1', '<i8'), ('col2', '<U2'), ('col3', '<f8')])

The 'col2' field is already string dtype, so can be tested against a string like 'cb':

In [7]: data['col2']                                                            
Out[7]: array(['cb', 'mo', 'cb', 'cb', 'va'], dtype='<U2')
In [8]: idx=data['col2']=='cb'                                                  
In [9]: idx                                                                     
Out[9]: array([ True, False,  True,  True, False])

we can easily select the records with just 'cb' from data.

In [10]: data[idx]                                                              
Out[10]: 
array([(-3, 'cb',  7.9356), (-2, 'cb', 12.9229), (-2, 'cb', 13.2063)],
      dtype=[('col1', '<i8'), ('col2', '<U2'), ('col3', '<f8')])

But if all you want to do is select lines with 'cb' and write them back out, it might easiest to use plain Python file read/write.

Use readlines to fetch all lines:

In [11]: txt1 = txt.splitlines()                                                
In [12]: txt1                                                                   
Out[12]: 
['col1, col2, col3',
 '-3,cb,7.9356',
 '-3,mo,13.0402',
 '-2,cb,12.9229',
 '-2,cb,13.2063',
 '-1,va,12.7535']

and use a simple string find to identify lines that contain 'cb':

In [13]: txt1[1].find('cb')                                                     
Out[13]: 3
In [14]: txt1[0].find('cb')                                                     
Out[14]: -1
In [15]: [row for row in txt1 if row.find('cb')!=-1]                            
Out[15]: ['-3,cb,7.9356', '-2,cb,12.9229', '-2,cb,13.2063']

finally write these lines back out to a file.

Comments

1

Seems like there's something wrong with this row. There are 4 columns instead of 3:

-2,cb,1,13.2063

You can use pandas for this:

import pandas as pd

df = pd.read_csv('input.csv')
df = df[df['col2'] == 'cb']
print(df)

Output:

   col1 col2     col3
0    -3   cb   7.9356
2    -2   cb  12.9229
3    -2   cb  13.2063

To enforce string contains behavior use this:

import pandas as pd

df = pd.read_csv('input.csv')
df = df[df['col2'].str.contains('cb')]
print(df)

If the input.csv looks like this:

col1,col2,col3
-3,cb_01,7.9356
-3,mocb,13.0402
-2,1cb2,12.9229
-2,cb,13.2063
-1,va,12.7535

Output

   col1   col2     col3
0    -3  cb_01   7.9356
1    -3   mocb  13.0402
2    -2   1cb2  12.9229
3    -2     cb  13.2063

1 Comment

Thank you! Similarly to the answer above , I 'd like to ask you if it is possible to define in some way get the rows of 'col2' that contain the characters "cb"? That is to say, that column b may be cb_e01, cb_e02 etc

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.