8

How do I convert column B into the transition matrix in python?

Size of the matrix is 19 which is unique values in column B. There are a total of 432 rows in the dataset.


time                A          B
2017-10-26 09:00:00  36       816
2017-10-26 10:45:00  43       816
2017-10-26 12:30:00  50       998
2017-10-26 12:45:00  51       750
2017-10-26 13:00:00  52       998
2017-10-26 13:15:00  53       998
2017-10-26 13:30:00  54       998
2017-10-26 14:00:00  56       998
2017-10-26 14:15:00  57       834
2017-10-26 14:30:00  58      1285
2017-10-26 14:45:00  59      1288
2017-10-26 23:45:00  95      1285
2017-10-27 03:00:00  12      1285
2017-10-27 03:30:00  14      1285
                             ... 
2017-11-02 14:00:00  56       998
2017-11-02 14:15:00  57       998
2017-11-02 14:30:00  58       998
2017-11-02 14:45:00  59       998
2017-11-02 15:00:00  60       816
2017-11-02 15:15:00  61       275
2017-11-02 15:30:00  62       225
2017-11-02 15:45:00  63      1288
2017-11-02 16:00:00  64      1088
2017-11-02 18:15:00  73      1285
2017-11-02 20:30:00  82      1285
2017-11-02 21:00:00  84      1088
2017-11-02 21:15:00  85      1088
2017-11-02 21:30:00  86      1088
2017-11-02 22:00:00  88      1088
2017-11-02 22:30:00  90      1088
2017-11-02 23:00:00  92      1088
2017-11-02 23:30:00  94      1088
2017-11-02 23:45:00  95      1088


The matrix should contain the number of transition between them.

 B -----------------1088------1288----------------------------
B  
.
.
1088                   8         2
.
.
.
.
.            Number of transitions between them.
..
.
.

2
  • if you use pandas then add tag pandas. Commented Apr 3, 2019 at 13:47
  • in pure Python you could use zip(B, B[1:]) to create pairs and Counter() to count them. More work would need to fill list/matrix with this data. In pandas you could use shift() to create column B[1:] and groupby to count them. Again more work need to fill new df with results. Commented Apr 3, 2019 at 13:52

2 Answers 2

4

I use your data to create DataFrame only with column B but it should work also with all columns.

text = '''time                A          B
2017-10-26 09:00:00  36       816
2017-10-26 10:45:00  43       816
2017-10-26 12:30:00  50       998
2017-10-26 12:45:00  51       750
2017-10-26 13:00:00  52       998
2017-10-26 13:15:00  53       998
2017-10-26 13:30:00  54       998
2017-10-26 14:00:00  56       998
2017-10-26 14:15:00  57       834
2017-10-26 14:30:00  58      1285
2017-10-26 14:45:00  59      1288
2017-10-26 23:45:00  95      1285
2017-10-27 03:00:00  12      1285
2017-10-27 03:30:00  14      1285
2017-11-02 14:00:00  56       998
2017-11-02 14:15:00  57       998
2017-11-02 14:30:00  58       998
2017-11-02 14:45:00  59       998
2017-11-02 15:00:00  60       816
2017-11-02 15:15:00  61       275
2017-11-02 15:30:00  62       225
2017-11-02 15:45:00  63      1288
2017-11-02 16:00:00  64      1088
2017-11-02 18:15:00  73      1285
2017-11-02 20:30:00  82      1285
2017-11-02 21:00:00  84      1088
2017-11-02 21:15:00  85      1088
2017-11-02 21:30:00  86      1088
2017-11-02 22:00:00  88      1088
2017-11-02 22:30:00  90      1088
2017-11-02 23:00:00  92      1088
2017-11-02 23:30:00  94      1088
2017-11-02 23:45:00  95      1088'''

import pandas as pd

B = [int(row[29:].strip()) for row in text.split('\n') if 'B' not in row]
df = pd.DataFrame({'B': B})

I get unique values in colum to use it later to create matrix

numbers = sorted(df['B'].unique())
print(numbers)

[225, 275, 750, 816, 834, 998, 1088, 1285, 1288]

I create shifted column C so I have both values in every row

df['C'] = df.shift(-1)
print(df)

       B       C
0    816   816.0
1    816   998.0
2    998   750.0
3    750   998.0

I group by ['B', 'C'] so I can count pairs

groups = df.groupby(['B', 'C'])
counts = {i[0]:(len(i[1]) if i[0][0] != i[0][1] else 0) for i in groups} # don't count (816,816)
# counts = {i[0]:len(i[1]) for i in groups} # count even (816,816)
print(counts)

{(225, 1288.0): 2, (275, 225.0): 2, (750, 998.0): 2, (816, 275.0): 2, (816, 816.0): 2, (816, 998.0): 2, (834, 1285.0): 2, (998, 750.0): 2, (998, 816.0): 2, (998, 834.0): 2, (998, 998.0): 12, (1088, 1088.0): 14, (1088, 1285.0): 2, (1285, 998.0): 2, (1285, 1088.0): 2, (1285, 1285.0): 6, (1285, 1288.0): 2, (1288, 1088.0): 2, (1288, 1285.0): 2}

Now I can create matrix. Using numbers and counts I create column/Series (with correct index) and I add it to matrix.

matrix = pd.DataFrame()

for x in numbers:
    matrix[x] = pd.Series([counts.get((x,y), 0) for y in numbers], index=numbers)

print(matrix)

Result

      225  275  750  816  834  998  1088  1285  1288
225     0    2    0    0    0    0     0     0     0
275     0    0    0    2    0    0     0     0     0
750     0    0    0    0    0    2     0     0     0
816     0    0    0    2    0    2     0     0     0
834     0    0    0    0    0    2     0     0     0
998     0    0    2    2    0   12     0     2     0
1088    0    0    0    0    0    0    14     2     2
1285    0    0    0    0    2    0     2     6     2
1288    2    0    0    0    0    0     0     2     0

Full example

text = '''time                A          B
2017-10-26 09:00:00  36       816
2017-10-26 10:45:00  43       816
2017-10-26 12:30:00  50       998
2017-10-26 12:45:00  51       750
2017-10-26 13:00:00  52       998
2017-10-26 13:15:00  53       998
2017-10-26 13:30:00  54       998
2017-10-26 14:00:00  56       998
2017-10-26 14:15:00  57       834
2017-10-26 14:30:00  58      1285
2017-10-26 14:45:00  59      1288
2017-10-26 23:45:00  95      1285
2017-10-27 03:00:00  12      1285
2017-10-27 03:30:00  14      1285
2017-11-02 14:00:00  56       998
2017-11-02 14:15:00  57       998
2017-11-02 14:30:00  58       998
2017-11-02 14:45:00  59       998
2017-11-02 15:00:00  60       816
2017-11-02 15:15:00  61       275
2017-11-02 15:30:00  62       225
2017-11-02 15:45:00  63      1288
2017-11-02 16:00:00  64      1088
2017-11-02 18:15:00  73      1285
2017-11-02 20:30:00  82      1285
2017-11-02 21:00:00  84      1088
2017-11-02 21:15:00  85      1088
2017-11-02 21:30:00  86      1088
2017-11-02 22:00:00  88      1088
2017-11-02 22:30:00  90      1088
2017-11-02 23:00:00  92      1088
2017-11-02 23:30:00  94      1088
2017-11-02 23:45:00  95      1088'''

import pandas as pd

B = [int(row[29:].strip()) for row in text.split('\n') if 'B' not in row]
df = pd.DataFrame({'B': B})

numbers = sorted(df['B'].unique())
print(numbers)

df['C'] = df.shift(-1)
print(df)

groups = df.groupby(['B', 'C'])
counts = {i[0]:(len(i[1]) if i[0][0] != i[0][1] else 0) for i in groups} # don't count (816,816)
# counts = {i[0]:len(i[1]) for i in groups} # count even (816,816)
print(counts)

matrix = pd.DataFrame()

for x in numbers:
    matrix[str(x)] = pd.Series([counts.get((x,y), 0) for y in numbers], index=numbers)

print(matrix)

EDIT:

counts = {i[0]:(len(i[1]) if i[0][0] != i[0][1] else 0) for i in groups} # don't count (816,816)

as normal for loop

counts = {}
for pair, group in groups:
    if pair[0] != pair[1]:  # don't count (816,816)
        counts[pair] = len(group)
    else:  
        counts[pair] = 0

Invert value when it is bigger thant 10

counts = {}
for pair, group in groups:
    if pair[0] != pair[1]:  # don't count (816,816)
        count = len(group)
        if count > 10 :
            counts[pair] = -count
        else
            counts[pair] = count
    else:  
        counts[pair] = 0

EDIT:

counts = {}
for pair, group in groups:
    if pair[0] != pair[1]:  # don't count (816,816)

        #counts[(A,B)] = len((A,B)) + len((B,A)) 
        if pair not in counts:
            counts[pair] = len(group) # put first value
        else:
            counts[pair] += len(group) # add second value

        #counts[(B,A)] = len((A,B)) + len((B,A)) 
        if (pair[1],pair[0]) not in counts:
            counts[(pair[1],pair[0])] = len(group) # put first value
        else:
            counts[(pair[1],pair[0])] += len(group) # add second value
    else:  
        counts[pair] = 0 # (816,816) gives 0

#counts[(A,B)] == counts[(B,A)]

counts_2 = {}               
for pair, count in counts.items():
    if count > 10 :
        counts_2[pair] = -count
    else:
        counts_2[pair] = count

matrix = pd.DataFrame()

for x in numbers:
    matrix[str(x)] = pd.Series([counts_2.get((x,y), 0) for y in numbers], index=numbers)

print(matrix)
Sign up to request clarification or add additional context in comments.

8 Comments

Thanks Furas. But the number of transitions in the results are more than the number of rows. I think it should be equal to the rows.
And how can we fill 0 where there is a transition between the same numbers. Example: if the transition is (1088.0, 1088.0): 411 then we should fill 0 in the place of 411.
it should be rows-1 because last row has no transition. Now I see problem - it have to be len(i[1]) instead of i[1].size because group has len(i[1]) rows but every row has two element so i[1].size = 2*len(i[1])
counts = {i[0]:len(i[1]) if i[0][0] != i[0][1] else 0 for i in groups} it should give (1088.0, 1088.0): 0
Can we create a distance matrix for the above example. If the transition count is <10 then the distance remains same. if the transition count is >10 then the distance is inverse. (((( Here column A is random points on the map. If there is more transition its means that the distance between the points is less.)))
|
3

An alternative, pandas based approach. Note I've used shift(1) which means transition is the next number:

text = '''time                A          B
2017-10-26 09:00:00  36       816
2017-10-26 10:45:00  43       816
2017-10-26 12:30:00  50       998
2017-10-26 12:45:00  51       750
2017-10-26 13:00:00  52       998
2017-10-26 13:15:00  53       998
2017-10-26 13:30:00  54       998
2017-10-26 14:00:00  56       998
2017-10-26 14:15:00  57       834
2017-10-26 14:30:00  58      1285
2017-10-26 14:45:00  59      1288
2017-10-26 23:45:00  95      1285
2017-10-27 03:00:00  12      1285
2017-10-27 03:30:00  14      1285
2017-11-02 14:00:00  56       998
2017-11-02 14:15:00  57       998
2017-11-02 14:30:00  58       998
2017-11-02 14:45:00  59       998
2017-11-02 15:00:00  60       816
2017-11-02 15:15:00  61       275
2017-11-02 15:30:00  62       225
2017-11-02 15:45:00  63      1288
2017-11-02 16:00:00  64      1088
2017-11-02 18:15:00  73      1285
2017-11-02 20:30:00  82      1285
2017-11-02 21:00:00  84      1088
2017-11-02 21:15:00  85      1088
2017-11-02 21:30:00  86      1088
2017-11-02 22:00:00  88      1088
2017-11-02 22:30:00  90      1088
2017-11-02 23:00:00  92      1088
2017-11-02 23:30:00  94      1088
2017-11-02 23:45:00  95      1088'''

import pandas as pd

B = [int(row[29:].strip()) for row in text.split('\n') if 'B' not in row]
df = pd.DataFrame({'B': B})
# alternative approach
df['C'] = df['B'].shift(1)  # shift forward so B transitions to C

df['counts'] = 1  # add an arbirtary counts column for group by

# group together the combinations then unstack to get matrix
trans_matrix = df.groupby(['B', 'C']).count().unstack()

# max the columns a bit neater
trans_matrix.columns = trans_matrix.columns.droplevel()

The result is:

enter image description here

Which I think is correct, i.e the one time you observe 225, it then transitions to 1288. You would just divide through by the sample size to get a probability transition matrix for each value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.