How to replace the for loop by faster option in Python

Question

I have two data frames, which have approximately 10 000 rows. They are similar as a and b below but with more rows.

 a
 Out[9]: 
     end  start
 0   4.0      3
 1   5.5      5
 2   7.5      7
 3   9.5      9
 4  11.5     11
 5  15.0     14
 6  18.0     17
 7  21.0     20
 8  26.0     25
 9  31.0     30

 b
 Out[10]: 
        status
 moment       
 8.0         o
 10.0        o
 14.5        o
 16.0        o
 19.0        o
 27.0        o
 28.0        o
 30.5        o
 35.0        o
 40.0        o
 50.0        o

I have to find all moments in dataframe b which belong to between end and start in dataframe a.

I developed for loop for that and it works well with small dataframes.

 for r in a.index:    
     for k in b.index:
         if a.ix[r,'start'] <k and k <a.ix[r,'end']:
             b.ix[k,'status']='m' # replaces m to o if moment is in between start and end

Below you can see how for loop have replaced o -> m when moment is in between start and end.

 n [12]: b
 Out[12]: 
        status
 moment       
 8.0         o
 10.0        o
 14.5        m
 16.0        o
 19.0        o
 27.0        o
 28.0        o
 30.5        m
 35.0        o
 40.0        o
 50.0        o

When I try to use it with huge dataframes (more than 10 000 rows in dataframe) it cannot get results any more within reasonable time.

Do you have any ideas how to elaborate my for loop faster and suitable for longer dataframes?

As the most minimal improvement break out of the inner loop as soon as you change the 'o' to 'm'. No need to keep looking in 'b' once you have a match. — jwpfox
– jwpfox, Commented Aug 26, 2016 at 12:57

akuiper · Accepted Answer · 2016-08-26 13:20:12Z

1

Here is an option without a for loop, which doesn't avoid vector scan but is vectorized:

b[b.index.map(lambda m: ((m > a.start) & (m < a.end)).any())] = "m"

b

#   status
# moment    
# 8.0   o
# 10.0  o
# 14.5  m
# 16.0  o
# 19.0  o
# 27.0  o
# 28.0  o
# 30.5  m
# 35.0  o
# 40.0  o
# 50.0  o

answered Aug 26, 2016 at 13:20

akuiper

216k33 gold badges363 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Gábor Erdős · Accepted Answer · 2016-08-26 12:50:45Z

0

Your solution is O(n^2) in running time. As far as i see all of the dataframes are sorted, if this is the case for the whole DF you can do a divide and conquer method to make it O(nlogn). However coding it is not easy, you will have to look it up, understand the D&C methods and write it as a recursive function. I think you can create a so called BinarySearch algorithm for this problem which is O(logn) for each element, so O(nlogn).

If i am wrong and the DF is not sorted, but you have to do this kind of search multiple times i would advise to sort it first. Sorting can also be done in D&C, its usually called MergeSort and its O(nlogn) as well.

edited Aug 26, 2016 at 12:50

answered Aug 26, 2016 at 12:44

Gábor Erdős

3,6894 gold badges28 silver badges62 bronze badges

Collectives™ on Stack Overflow

How to replace the for loop by faster option in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related