select specific rows with numpy

Question

I have an ascii file with 3 columns and multiple rows as following ;

col1, col2, col3
-3,cb,7.9356
-3,mo,13.0402
-2,cb,12.9229
-2,cb,13.2063
-1,va,12.7535

I want to extract an ascii file with the 3 columns but only the rows where col2="cb". I am trying to do this with python using numpy and .

import numpy as np
file="input.txt"
outfile="out.txt"
col1 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 0)
col2 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 1, dtype=np.str)
col3 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 2)

When I use the np.core.defchararray.find I get only the indexes.

cb_rows=np.core.defchararray.find(col2.astype(str),'cb')

How I could save the new file with the 3 columns only for rows that col2 = "cb" (in fact, cb_rows) ?

Is the extra column in line 5 intentional?

hpaulj
– hpaulj

2020-06-05 17:21:09 +00:00
Commented Jun 5, 2020 at 17:21 — hpaulj
– hpaulj, Commented Jun 5, 2020 at 17:21
No, it was a mistake. I corrected it. Thank you.

Nat
– Nat

2020-06-05 17:22:27 +00:00
Commented Jun 5, 2020 at 17:22 — Nat
– Nat, Commented Jun 5, 2020 at 17:22

Ombrophile · Accepted Answer · 2020-06-05 20:12:32Z

1

If you want to achieve this by using numpy only, then try this:

import numpy as np


a = np.loadtxt('input.txt', dtype=str, delimiter=',')
msk = a[:, 1] == 'cb'
b = a[msk, :]
np.savetxt('out.txt', b, delimiter=',', fmt='%s', header='col1,col2,col3')

Edit

Since the second column of b is now redundant, you can avoid writing it by:

np.savetxt('out.txt', b[:, 0:3:2], delimiter=',', fmt='%s', header='col1,col3')

As is obvious, a and b are string arrays. If, however, you need to do some further processing of the 1st and last columns using their native types, then use the following variables:

cb_col1 = b[:, 0].astype(int)
cb_col3 = b[:, 2].astype(float)

edited Jun 5, 2020 at 20:12

answered Jun 5, 2020 at 17:29

Ombrophile

6931 gold badge7 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nat Over a year ago

Thank you! Is it possible to define msk as the string of column b contains the characters "cb"? That is to say, that column b may be cb_e01, cb_e02 etc

Ombrophile Over a year ago

In that case you will have to write msk = ['cb' in i for i in a[:, 1]].

hpaulj · Accepted Answer · 2020-06-05 17:24:36Z

Your sample, corrected for the extra column in line 5:

In [4]: txt = """col1, col2, col3 
   ...: -3,cb,7.9356 
   ...: -3,mo,13.0402 
   ...: -2,cb,12.9229 
   ...: -2,cb,13.2063 
   ...: -1,va,12.7535"""

Using the full power of genfromtxt I can load it as a structured array:

In [5]: data = np.genfromtxt(txt.splitlines(), delimiter=',', dtype=None, names=True, encoding=None)                                                    
In [6]: data                                                                    
Out[6]: 
array([(-3, 'cb',  7.9356), (-3, 'mo', 13.0402), (-2, 'cb', 12.9229),
       (-2, 'cb', 13.2063), (-1, 'va', 12.7535)],
      dtype=[('col1', '<i8'), ('col2', '<U2'), ('col3', '<f8')])

The 'col2' field is already string dtype, so can be tested against a string like 'cb':

In [7]: data['col2']                                                            
Out[7]: array(['cb', 'mo', 'cb', 'cb', 'va'], dtype='<U2')
In [8]: idx=data['col2']=='cb'                                                  
In [9]: idx                                                                     
Out[9]: array([ True, False,  True,  True, False])

we can easily select the records with just 'cb' from data.

In [10]: data[idx]                                                              
Out[10]: 
array([(-3, 'cb',  7.9356), (-2, 'cb', 12.9229), (-2, 'cb', 13.2063)],
      dtype=[('col1', '<i8'), ('col2', '<U2'), ('col3', '<f8')])

But if all you want to do is select lines with 'cb' and write them back out, it might easiest to use plain Python file read/write.

Use readlines to fetch all lines:

In [11]: txt1 = txt.splitlines()                                                
In [12]: txt1                                                                   
Out[12]: 
['col1, col2, col3',
 '-3,cb,7.9356',
 '-3,mo,13.0402',
 '-2,cb,12.9229',
 '-2,cb,13.2063',
 '-1,va,12.7535']

and use a simple string find to identify lines that contain 'cb':

In [13]: txt1[1].find('cb')                                                     
Out[13]: 3
In [14]: txt1[0].find('cb')                                                     
Out[14]: -1
In [15]: [row for row in txt1 if row.find('cb')!=-1]                            
Out[15]: ['-3,cb,7.9356', '-2,cb,12.9229', '-2,cb,13.2063']

finally write these lines back out to a file.

Balaji Ambresh · Accepted Answer · 2020-06-05 18:42:42Z

1

Seems like there's something wrong with this row. There are 4 columns instead of 3:

-2,cb,1,13.2063

You can use pandas for this:

import pandas as pd

df = pd.read_csv('input.csv')
df = df[df['col2'] == 'cb']
print(df)

Output:

   col1 col2     col3
0    -3   cb   7.9356
2    -2   cb  12.9229
3    -2   cb  13.2063

To enforce string contains behavior use this:

import pandas as pd

df = pd.read_csv('input.csv')
df = df[df['col2'].str.contains('cb')]
print(df)

If the input.csv looks like this:

col1,col2,col3
-3,cb_01,7.9356
-3,mocb,13.0402
-2,1cb2,12.9229
-2,cb,13.2063
-1,va,12.7535

Output

   col1   col2     col3
0    -3  cb_01   7.9356
1    -3   mocb  13.0402
2    -2   1cb2  12.9229
3    -2     cb  13.2063

edited Jun 5, 2020 at 18:42

answered Jun 5, 2020 at 17:24

Balaji Ambresh

5,0022 gold badges7 silver badges17 bronze badges

1 Comment

Nat Over a year ago

Thank you! Similarly to the answer above , I 'd like to ask you if it is possible to define in some way get the rows of 'col2' that contain the characters "cb"? That is to say, that column b may be cb_e01, cb_e02 etc

Collectives™ on Stack Overflow

select specific rows with numpy

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related