2

I have two data frames, which have approximately 10 000 rows. They are similar as a and b below but with more rows.

 a
 Out[9]: 
     end  start
 0   4.0      3
 1   5.5      5
 2   7.5      7
 3   9.5      9
 4  11.5     11
 5  15.0     14
 6  18.0     17
 7  21.0     20
 8  26.0     25
 9  31.0     30

 b
 Out[10]: 
        status
 moment       
 8.0         o
 10.0        o
 14.5        o
 16.0        o
 19.0        o
 27.0        o
 28.0        o
 30.5        o
 35.0        o
 40.0        o
 50.0        o

I have to find all moments in dataframe b which belong to between end and start in dataframe a.

I developed for loop for that and it works well with small dataframes.

 for r in a.index:    
     for k in b.index:
         if a.ix[r,'start'] <k and k <a.ix[r,'end']:
             b.ix[k,'status']='m' # replaces m to o if moment is in between start and end

Below you can see how for loop have replaced o -> m when moment is in between start and end.

 n [12]: b
 Out[12]: 
        status
 moment       
 8.0         o
 10.0        o
 14.5        m
 16.0        o
 19.0        o
 27.0        o
 28.0        o
 30.5        m
 35.0        o
 40.0        o
 50.0        o

When I try to use it with huge dataframes (more than 10 000 rows in dataframe) it cannot get results any more within reasonable time.

Do you have any ideas how to elaborate my for loop faster and suitable for longer dataframes?

2
  • Are both dataframes of equal length? Commented Aug 26, 2016 at 12:54
  • As the most minimal improvement break out of the inner loop as soon as you change the 'o' to 'm'. No need to keep looking in 'b' once you have a match. Commented Aug 26, 2016 at 12:57

2 Answers 2

1

Here is an option without a for loop, which doesn't avoid vector scan but is vectorized:

b[b.index.map(lambda m: ((m > a.start) & (m < a.end)).any())] = "m"

b

#   status
# moment    
# 8.0   o
# 10.0  o
# 14.5  m
# 16.0  o
# 19.0  o
# 27.0  o
# 28.0  o
# 30.5  m
# 35.0  o
# 40.0  o
# 50.0  o
Sign up to request clarification or add additional context in comments.

Comments

0

Your solution is O(n^2) in running time. As far as i see all of the dataframes are sorted, if this is the case for the whole DF you can do a divide and conquer method to make it O(nlogn). However coding it is not easy, you will have to look it up, understand the D&C methods and write it as a recursive function. I think you can create a so called BinarySearch algorithm for this problem which is O(logn) for each element, so O(nlogn).

If i am wrong and the DF is not sorted, but you have to do this kind of search multiple times i would advise to sort it first. Sorting can also be done in D&C, its usually called MergeSort and its O(nlogn) as well.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.