Mutual Information Criteria

The Mutual Information Score, often denoted as MIS, expresses the extent to which observed frequency of co-occurrence differs from what we would expect (statistically speaking). In statistically pure terms this is a measure of the strength of association between words x and y.

See below for a few resources.

  • R1: Detailed video tutorial step by step.

  • R2: Detailed tutorial with python code.

  • R3: Possible libraries in python/R and other tools.

  • R5: Efficient pairwise MIS implementation…

  • M1: Idenetification of antibiotics pairs that evade …. manuscript.

Lets import the main libraries

32 # Generic
33 import warnings
34 import numpy as np
35 import pandas as pd
36
37 # Specific
38 from itertools import combinations
39 from timeit import default_timer as timer
40 from scipy.stats.contingency import crosstab
41 from sklearn.metrics import mutual_info_score
42 from sklearn.metrics import adjusted_mutual_info_score
43 from sklearn.metrics import normalized_mutual_info_score
44
45 # Own
46 from mic import mutual_info_matrix_v3
47 from mic import mutual_info_matrix_v2
48 from mic import mutual_info_matrix_v1
49
50 warnings.filterwarnings("ignore")
51
52 def print_example_heading(n, t=''):
53     print("\n" + "=" * 80 + "\nExample %s\n"%n + "=" * 80)
54
55 try:
56     __file__
57     TERMINAL = True
58 except:
59     TERMINAL = False

a) Manual example (youtube)

Lets start with a hard coded example extracted from a very detailed youtube tutorial (R1). This video tutorial shows step by step how to compute the mutual information score using the contingency matrix defined below. Pay special attention to the following consideration when implementing the MIS:

  • only possible to compute where more than one class present

  • log(0) raises a zero division.

  • lim x->0 log(x) = 0

  • this np.nan can be filled with 0.

 76 # See: https://www.youtube.com/watch?v=eJIp_mgVLwE
 77
 78 # Contingency
 79 ct = np.array([[3/5, 1/5], [0/5, 1/5]])
 80
 81 # Compute MIS manually
 82 mi1 = (3/5)*np.log((3/5) / ((3/5)*(4/5)))
 83 #mi2 = (0/5)*np.log((0/5) / ((3/5)*(1/5))) # zero division
 84 mi3 = (1/5)*np.log((1/5) / ((2/5)*(4/5)))
 85 mi4 = (1/5)*np.log((1/5) / ((2/5)*(1/5)))
 86 m1 = np.array([[mi1, mi3], [0, mi4]])
 87 score1 = mi1 + mi3 + mi4 # 0.22
 88
 89 # Compute component information matrix
 90 m2 = mutual_info_matrix_v1(ct=ct)
 91 m3 = mutual_info_matrix_v2(ct=ct)
 92 m4 = mutual_info_matrix_v3(ct=ct)
 93
 94 # .. note: Raises a math domain error.
 95 # Compute MIS scikits
 96 #score4 = mutual_info_score(labels_true=None,
 97 #                           labels_pred=None,
 98 #                           contingency=ct)
 99
100 # Cumu
101 cumu = pd.DataFrame([
102     ['manual'] + m1.flatten().tolist(),
103     ['mutual_info_matrix_v1'] + m2.flatten().tolist(),
104     ['mutual_info_matrix_v2'] + m3.flatten().tolist(),
105     ['mutual_info_matrix_v3'] + m4.flatten().tolist()
106 ], columns=['method', 'c11', 'c12', 'c21', 'c22'])
107
108 # Compute MIS score
109 cumu['mis'] = cumu.sum(axis=1)

Lets see the contingency matrix

113 if TERMINAL:
114     print_example_heading(n=1)
115     print('\nContingency:')
116     print(ct)
117 pd.DataFrame(ct)
0 1
0 0.6 0.2
1 0.0 0.2


Lets see the results

121 if TERMINAL:
122     print("\nResults:")
123     print(cumu)
124 cumu
method c11 c12 c21 c22 mis
0 manual 0.133886 -0.094001 0.0 0.183258 0.223144
1 mutual_info_matrix_v1 NaN NaN NaN NaN 0.000000
2 mutual_info_matrix_v2 0.133886 -0.094001 0.0 0.183258 0.223144
3 mutual_info_matrix_v3 0.133886 -0.094001 0.0 0.183258 0.223144


Note

The method mutual_info_matrix_v1 does not work in this example!

b) Another two class example

In the previous example we started with the definition of the contingency matrix. However, that is not often the case. In this example we will go one step back and show how to compute the contingency matrix from the raw vectors using either scipy o pandas. Note that the contingency matrix is just a way to display the frequency distribution of the variables.

140 # Generate the dataset
141 x = np.array([
142     ['S1', 'S2'],
143     ['S1', 'R2'],
144     ['R1', 'S2'],
145     ['R1', 'R2']])
146 d = np.repeat(x, [63, 22, 15, 25], axis=0)
147 d = pd.DataFrame(data=d)
148
149 # Create variables
150 x = d[0]
151 y = d[1]
152
153 # Compute contingency
154 #ct = crosstab(d[0], d[1]).count
155 ct = pd.crosstab(x, y)
156
157 # Compute MIS
158 score0 = mutual_info_score(labels_true=x, labels_pred=y)
159
160 # Compute MIS
161 m1 = mutual_info_matrix_v1(x=x, y=y)
162 m2 = mutual_info_matrix_v2(x=x, y=y)
163 m3 = mutual_info_matrix_v3(x=x, y=y)
164
165 # Compute MIS
166 m4 = mutual_info_matrix_v1(ct=ct)
167 m5 = mutual_info_matrix_v2(ct=ct)
168 m6 = mutual_info_matrix_v3(ct=ct)
169
170 # Cumu
171 cumu = pd.DataFrame([
172     #['mutual_info_score'] + m1.flatten().tolist(),
173     ['mutual_info_matrix_v1 (x,y)'] + m1.flatten().tolist(),
174     ['mutual_info_matrix_v2 (x,y)'] + m2.flatten().tolist(),
175     ['mutual_info_matrix_v3 (x,y)'] + m3.flatten().tolist(),
176     ['mutual_info_matrix_v1 (ct)'] + m4.flatten().tolist(),
177     ['mutual_info_matrix_v2 (ct)'] + m5.flatten().tolist(),
178     ['mutual_info_matrix_v3 (ct)'] + m6.flatten().tolist(),
179 ], columns=['method', 'c11', 'c12', 'c21', 'c22'])
180
181 # Compute MIS score
182 cumu['mis'] = cumu.sum(axis=1)

Lets see the contingency matrix

186 if TERMINAL:
187     print_example_heading(n=2)
188     print('\nContingency:')
189     print(ct)
190 ct
1 R2 S2
0
R1 25 15
S1 22 63


Lets see the results

194 if TERMINAL:
195     print("\nResults:")
196     print(cumu)
197 cumu
method c11 c12 c21 c22 mis
0 mutual_info_matrix_v1 (x,y) 0.101633 -0.061107 -0.065726 0.086733 0.061532
1 mutual_info_matrix_v2 (x,y) 0.101633 -0.061107 -0.065726 0.086733 0.061532
2 mutual_info_matrix_v3 (x,y) 0.101633 -0.061107 -0.065726 0.086733 0.061532
3 mutual_info_matrix_v1 (ct) 0.101633 -0.065726 -0.061107 0.086733 0.061532
4 mutual_info_matrix_v2 (ct) 0.101633 -0.061107 -0.065726 0.086733 0.061532
5 mutual_info_matrix_v3 (ct) 0.101633 -0.061107 -0.065726 0.086733 0.061532


c) Collateral Resistance Index

Now, lets compute the MIS score as defined in the manuscript (M1). Note that the manuscript provided the cumulative data as appendix material and therefore we can use it to compare that our implementation produces the same result.

Note

The results provided by our own MIS implementation differs from the results provided in the manuscript. This discrepancy occurs for those rows in which the contingency matrix contains one of more zeros.

213 def collateral_resistance_index(m):
214     """Collateral Resistance Index
215
216     The collateral resistance index is based on the mutual
217     information matrix. This implementation assumes there
218     are two classes resistant (R) and sensitive (S).
219
220     Parameters
221     ----------
222     m: np.array
223         A numpy array with the mutual information matrix.
224
225     Returns
226     -------
227     """
228     return (m[0, 0] + m[1, 1]) - (m[0, 1] + m[1, 0])
229
230 def CRI(x, func):
231     ct = np.array([[x.S1S2, x.S1R2], [x.R1S2, x.R1R2]])
232     m = func(ct=ct)
233     return collateral_resistance_index(m)
234
235 def compare(data, x, y):
236     return data[x].round(5).compare(data[y].round(5)).index.values
237
238 # Load data
239 data = pd.read_excel('./data/mmc2.xlsx')
240
241 # .. note: MIS_v1 is inspired by the implementation in sklearn. For some
242 #          reason, when one of the values of the contigency matrix is 0
243 #          it returns an array with three values and thus raises an error.
244
245 # Compute MIC score ourselves
246 #data['MIS_v1'] = data.apply(CRI, args=(mutual_info_matrix_v1,), axis=1)
247 data['MIS_v2'] = data.apply(CRI, args=(mutual_info_matrix_v2,), axis=1)
248 data['MIS_v3'] = data.apply(CRI, args=(mutual_info_matrix_v3,), axis=1)
249
250 # Compute indexes of those that do not give same result.
251 idxs1 = compare(data, 'MIS', 'MIS_v3')

Lets see the data

255 if TERMINAL:
256     print_example_heading(n=3)
257     print("\nData:")
258     print(data)
259 data.iloc[:, 3:]
Antibiotic_2 S1S2 S1R2 R1S2 R1R2 MIS P-value MIS_v2 MIS_v3
0 AMPICILLIN 13849 5657 10 3110 0.295015 0.000000e+00 0.295015 0.295015
1 AMPICILLIN_SULBACTAM 15264 4234 43 3076 0.331826 0.000000e+00 0.331826 0.331826
2 AZTREONAM 63 22 15 25 0.315198 1.370511e-04 0.315198 0.315198
3 CEFAZOLIN 8992 264 741 712 0.236398 0.000000e+00 0.236398 0.236398
4 CEFEPIME 14272 112 2211 123 0.024380 1.073382e-44 0.024380 0.024380
... ... ... ... ... ... ... ... ... ...
875 TETRACYCLINE 703 2047 86 923 0.125936 5.609167e-34 0.125936 0.125936
876 VANCOMYCIN 3632 32 1167 57 0.029895 5.091803e-15 0.029895 0.029895
877 TETRACYCLINE 156 696 222 689 -0.060360 2.076596e-03 -0.060360 -0.060360
878 VANCOMYCIN 1064 31 1134 22 -0.009254 1.650263e-01 -0.009254 -0.009254
879 VANCOMYCIN 904 10 3307 81 0.008185 1.354200e-02 0.008185 0.008185

880 rows × 9 columns



Lets see where the results are different

263 if TERMINAL:
264     print("\nAre they equal? Show differences below:")
265     print(data.iloc[idxs1, :])
266 data.iloc[idxs1, 3:]
Antibiotic_2 S1S2 S1R2 R1S2 R1R2 MIS P-value MIS_v2 MIS_v3
54 CEFEPIME 8850 0 1032 245 0.091413 3.878054e-229 0.090710 0.090710
241 AZTREONAM 5 0 1 10 NaN NaN 0.629259 0.629259
246 CEFUROXIME 32 2 2 20 NaN NaN 0.685774 0.685774
251 MEROPENEM 25 0 3 0 NaN NaN 0.000000 0.000000
286 CEFTAZIDIME 2646 0 72 54 0.095338 3.297768e-78 0.093018 0.093018
387 MEROPENEM 602 0 14 5 0.051227 4.602569e-09 0.042614 0.042614
492 MEROPENEM 922 1 0 31 0.152671 3.238929e-56 0.145842 0.145842
502 NITROFURANTOIN 1784 1979 0 34 0.014705 4.577273e-09 0.014099 0.014099
624 AZTREONAM 4 0 0 0 NaN NaN 0.000000 0.000000
626 CEFTAZIDIME 1459 7 0 1 0.009526 1.244097e-04 0.004865 0.004865
627 CEFTRIAXONE 1459 8 1 0 0.004484 2.025597e-02 -0.000011 -0.000011
628 CEFUROXIME 48 0 3 0 NaN NaN 0.000000 0.000000
631 LEVOFLOXACIN 1272 145 1 0 0.002209 2.793473e-01 -0.000220 -0.000220
632 PIPERACILLIN_TAZOBACTAM 1456 1 1 0 0.005266 6.147526e-03 -0.000001 -0.000001
634 TOBRAMYCIN 1427 60 2 0 0.002809 1.561921e-01 -0.000164 -0.000164
713 AMPICILLIN 1850 276 0 17 0.031664 3.211012e-15 0.029332 0.029332
715 AZTREONAM 538 6 14 0 0.006651 2.074790e-01 -0.000814 -0.000814
723 PIPERACILLIN_TAZOBACTAM 1907 2 15 0 0.003155 3.483272e-02 -0.000024 -0.000024
756 PIPERACILLIN_TAZOBACTAM 1764 4 40 0 0.002473 1.314297e-01 -0.000147 -0.000147
786 PIPERACILLIN_TAZOBACTAM 1343 3 38 0 0.003181 1.362082e-01 -0.000179 -0.000179


d) Exploring the efficiency

This code is used to compare whether the implementations are more or less efficient between them. Note that the methods have itself some limitations.

276 # Generate data
277 N = 10000000
278 choices = np.arange(2)
279 vector1 = np.random.choice(choices, size=N)
280 vector2 = np.random.choice(choices, size=N)
281
282 # Compute times
283 t1 = timer()
284 m1 = mutual_info_matrix_v1(x=vector1, y=vector2)
285 t2 = timer()
286 m2 = mutual_info_matrix_v2(x=vector1, y=vector2)
287 t3 = timer()
288 m3 = mutual_info_matrix_v3(x=vector1, y=vector2)
289 t4 = timer()
290
291 # Display
292 print_example_heading(n=4)
293 print("Are the results equal (m1, m2)? %s" % np.allclose(m1, m2))
294 print("Are the results equal (m1, m3)? %s" % np.allclose(m1, m3))
295 print("time v1: %.5f" % (t2-t1))
296 print("time v2: %.5f" % (t3-t2))
297 print("time v3: %.5f" % (t4-t3))

Out:

================================================================================
Example 4
================================================================================
Are the results equal (m1, m2)? True
Are the results equal (m1, m3)? True
time v1: 1.60590
time v2: 1.37441
time v3: 1.35685

e) Edge scenarios

There are some edge scenarios which we might have or might have not considered yet. We are including some of them here for future reference and some interesting questions below.

  • What is the CRI range? (-0.7, 0.7)

  • Should we normalize this value? [-1, 1]? [0, 1]?

  • How to compute CRI if we have three outcomes R, S and I?

311 # Heading
312 print_example_heading(n=5)
313
314 # Create cases
315 data = [
316     (['R', 'R', 'R', 'R'], ['R', 'R', 'R', 'R']),
317     (['R', 'R', 'R', 'R'], ['S', 'S', 'S', 'S']),
318     (['R', 'R', 'S', 'S'], ['R', 'R', 'S', 'S']),
319     (['R', 'R', 'S', 'S'], ['S', 'S', 'R', 'R']),
320     (['R', 'I', 'S', 'S'], ['R', 'I', 'S', 'S'])
321 ]
322
323 # Results
324 cumu = []
325
326 # Loop
327 for i, (x, y) in enumerate(data):
328
329     # Compute mutual information scores
330     mis = mutual_info_score(x, y)
331     misa = adjusted_mutual_info_score(x, y)
332     misn = normalized_mutual_info_score(x, y)
333
334     # Compute mutual information matrix
335     m = mutual_info_matrix_v1(x=x, y=y)
336
337     # Compute collateral resistance index
338     try:
339         cri = collateral_resistance_index(m)
340     except Exception as e:
341         print(e)
342         cri = None
343
344     # Append
345     cumu.append([x, y, mis, misa, misn, cri])
346
347     # Show
348     print("\n%s. Contingency matrix:" % i)
349     print(m)
350
351
352 # Create the dataframe
353 df = pd.DataFrame(cumu,
354     columns=['x', 'y', 'mis', 'mis_adjusted', 'mis_normalized', 'cri'])

Out:

================================================================================
Example 5
================================================================================
'float' object is not subscriptable

0. Contingency matrix:
0.0
'float' object is not subscriptable

1. Contingency matrix:
0.0
too many indices for array: array is 1-dimensional, but 2 were indexed

2. Contingency matrix:
[0.35 0.35]
too many indices for array: array is 1-dimensional, but 2 were indexed

3. Contingency matrix:
[0.35 0.35]
too many indices for array: array is 1-dimensional, but 2 were indexed

4. Contingency matrix:
[0.35 0.35 0.35]

Lets see the summary of edge cases

358 if TERMINAL:
359     print("\nSummary of edge scenarios:")
360     print(df)
361 df
x y mis mis_adjusted mis_normalized cri
0 [R, R, R, R] [R, R, R, R] 0.000000 1.0 1.0 None
1 [R, R, R, R] [S, S, S, S] 0.000000 1.0 1.0 None
2 [R, R, S, S] [R, R, S, S] 0.693147 1.0 1.0 None
3 [R, R, S, S] [S, S, R, R] 0.693147 1.0 1.0 None
4 [R, I, S, S] [R, I, S, S] 1.039721 1.0 1.0 None


f) For continuous variables

There are several approaches, one of them is just binning. For more information just check online, there are many good resources and or implementations that might be found out there.

371 # Heading
372 print_example_heading(n=6)
373
374 bins = 5 #?
375
376 def f(X, Y, bins):
377     c_XY = np.histogram2d(X, Y, bins)[0]
378     c_X = np.histogram(X, bins)[0]
379     c_Y = np.histogram(Y, bins)[0]
380     return 1

Out:

================================================================================
Example 6
================================================================================

g) Computing pairwise score

Let’s see how we can compute the mutual information square in a pairwise fashion.

389 def f1(x, y):
390     # Compute mutual information matrix
391     m = mutual_info_matrix_v3(x=x, y=y)
392     # Compute collateral resistance index
393     cri = collateral_resistance_index(m)
394     # Return
395     return cri
396
397 # Generate data
398 data = np.random.choice(['S', 'R'], size=(100, 4))
399
400 # Convert into DataFrame
401 df = pd.DataFrame(data,
402     columns=['C%d' % i for i in range(data.shape[1])])
403
404 # Option I
405 # --------
406 # Create empty matrix
407 cols = data.shape[1]
408 matrix = np.empty((cols, cols))
409 matrix[:] = np.nan
410
411 # Compute pairwise (square matrix)
412 for ix in np.arange(cols):
413     for jx in np.arange(ix+1, cols):
414         matrix[ix,jx] = f1(data[:,ix], data[:,jx])
415
416 # Convert to DataFrame for visualisation
417 matrix = pd.DataFrame(matrix,
418     index=df.columns, columns=df.columns)

Lets see the summary of pairwise distances

422 if TERMINAL:
423     # Heading
424     print_example_heading(n=7)
425     print("\nSummary of pairwise computations:")
426     print(matrix)
427 matrix
428
429 # Option II
430 # ----------
431 #for i, j in list(combinations(df.columns, 2)):
C0 C1 C2 C3
C0 NaN -0.069964 -0.009999 -0.053950
C1 NaN NaN -0.030000 -0.224257
C2 NaN NaN NaN -0.093784
C3 NaN NaN NaN NaN


h) Example with more than 2 classes

Let’s see how it works for more than two classes.

Note

The computation using mutual_info_matrix_v2 should not work because it is designed for 2 classes. However, while it does not work for low number of samples (e.g. 5), it works for larger values (e.g. 100).

443 # .. note:: The computation using mutual_info_matrix_v3 which is inspired
444 #           by sklearn returns an array of length 5 when the number of
445 #           samples is low. However, it works when large number of samples
446 #           is used.
447
448 # Generate data
449 data = np.random.choice(['S', 'R', 'I'], size=(100, 2))
450
451 # Convert into DataFrame
452 df = pd.DataFrame(data,
453     columns=['C%d' % i for i in range(data.shape[1])])
454
455 # Compute
456 m1 = mutual_info_matrix_v1(x=df.C0, y=df.C1)
457 m2 = mutual_info_matrix_v2(x=df.C0, y=df.C1)
458 m3 = mutual_info_matrix_v3(x=df.C0, y=df.C1)
459
460 # Show
461 print_example_heading(n=8)
462 print("Result m1:")
463 print(m1)
464 print("\nResult m2:")
465 print(m2)
466 print("\nResult m3:")
467 print(m3)
468
469 print("\n")
470 #print("Are the results equal (m1, m2)? %s" % np.allclose(m1, m2))
471 print("Are the results equal (m1, m3)? %s" % np.allclose(m1, m3))

Out:

================================================================================
Example 8
================================================================================
Result m1:
[[ 0.02  0.01 -0.02]
 [ 0.01 -0.03  0.04]
 [-0.02  0.04 -0.01]]

Result m2:
[[ 0.02  0.01 -0.02]
 [ 0.01 -0.03  0.04]
 [-0.02  0.04 -0.01]]

Result m3:
[[ 0.02  0.01 -0.02]
 [ 0.01 -0.03  0.04]
 [-0.02  0.04 -0.01]]


Are the results equal (m1, m3)? True

Total running time of the script: ( 0 minutes 4.830 seconds)

Gallery generated by Sphinx-Gallery