Note
Click here to download the full example code
Mutual Information Criteria
The Mutual Information Score
, often denoted as MIS
, expresses the extent
to which observed frequency of co-occurrence differs from what we would expect
(statistically speaking). In statistically pure terms this is a measure of the
strength of association between words x and y.
See below for a few resources.
Lets import the main libraries
32 # Generic
33 import warnings
34 import numpy as np
35 import pandas as pd
36
37 # Specific
38 from itertools import combinations
39 from timeit import default_timer as timer
40 from scipy.stats.contingency import crosstab
41 from sklearn.metrics import mutual_info_score
42 from sklearn.metrics import adjusted_mutual_info_score
43 from sklearn.metrics import normalized_mutual_info_score
44
45 # Own
46 from mic import mutual_info_matrix_v3
47 from mic import mutual_info_matrix_v2
48 from mic import mutual_info_matrix_v1
49
50 warnings.filterwarnings("ignore")
51
52 def print_example_heading(n, t=''):
53 print("\n" + "=" * 80 + "\nExample %s\n"%n + "=" * 80)
54
55 try:
56 __file__
57 TERMINAL = True
58 except:
59 TERMINAL = False
a) Manual example (youtube)
Lets start with a hard coded example extracted from a very
detailed youtube tutorial (R1). This video tutorial shows step by
step how to compute the mutual information score using the contingency
matrix defined below. Pay special attention to the following consideration
when implementing the MIS
:
only possible to compute where more than one class present
log(0) raises a zero division.
lim x->0 log(x) = 0
this np.nan can be filled with 0.
76 # See: https://www.youtube.com/watch?v=eJIp_mgVLwE
77
78 # Contingency
79 ct = np.array([[3/5, 1/5], [0/5, 1/5]])
80
81 # Compute MIS manually
82 mi1 = (3/5)*np.log((3/5) / ((3/5)*(4/5)))
83 #mi2 = (0/5)*np.log((0/5) / ((3/5)*(1/5))) # zero division
84 mi3 = (1/5)*np.log((1/5) / ((2/5)*(4/5)))
85 mi4 = (1/5)*np.log((1/5) / ((2/5)*(1/5)))
86 m1 = np.array([[mi1, mi3], [0, mi4]])
87 score1 = mi1 + mi3 + mi4 # 0.22
88
89 # Compute component information matrix
90 m2 = mutual_info_matrix_v1(ct=ct)
91 m3 = mutual_info_matrix_v2(ct=ct)
92 m4 = mutual_info_matrix_v3(ct=ct)
93
94 # .. note: Raises a math domain error.
95 # Compute MIS scikits
96 #score4 = mutual_info_score(labels_true=None,
97 # labels_pred=None,
98 # contingency=ct)
99
100 # Cumu
101 cumu = pd.DataFrame([
102 ['manual'] + m1.flatten().tolist(),
103 ['mutual_info_matrix_v1'] + m2.flatten().tolist(),
104 ['mutual_info_matrix_v2'] + m3.flatten().tolist(),
105 ['mutual_info_matrix_v3'] + m4.flatten().tolist()
106 ], columns=['method', 'c11', 'c12', 'c21', 'c22'])
107
108 # Compute MIS score
109 cumu['mis'] = cumu.sum(axis=1)
Lets see the contingency matrix
113 if TERMINAL:
114 print_example_heading(n=1)
115 print('\nContingency:')
116 print(ct)
117 pd.DataFrame(ct)
Lets see the results
121 if TERMINAL:
122 print("\nResults:")
123 print(cumu)
124 cumu
Note
The method mutual_info_matrix_v1
does not work in this example!
b) Another two class example
In the previous example we started with the definition of the contingency matrix. However, that is not often the case. In this example we will go one step back and show how to compute the contingency matrix from the raw vectors using either scipy o pandas. Note that the contingency matrix is just a way to display the frequency distribution of the variables.
140 # Generate the dataset
141 x = np.array([
142 ['S1', 'S2'],
143 ['S1', 'R2'],
144 ['R1', 'S2'],
145 ['R1', 'R2']])
146 d = np.repeat(x, [63, 22, 15, 25], axis=0)
147 d = pd.DataFrame(data=d)
148
149 # Create variables
150 x = d[0]
151 y = d[1]
152
153 # Compute contingency
154 #ct = crosstab(d[0], d[1]).count
155 ct = pd.crosstab(x, y)
156
157 # Compute MIS
158 score0 = mutual_info_score(labels_true=x, labels_pred=y)
159
160 # Compute MIS
161 m1 = mutual_info_matrix_v1(x=x, y=y)
162 m2 = mutual_info_matrix_v2(x=x, y=y)
163 m3 = mutual_info_matrix_v3(x=x, y=y)
164
165 # Compute MIS
166 m4 = mutual_info_matrix_v1(ct=ct)
167 m5 = mutual_info_matrix_v2(ct=ct)
168 m6 = mutual_info_matrix_v3(ct=ct)
169
170 # Cumu
171 cumu = pd.DataFrame([
172 #['mutual_info_score'] + m1.flatten().tolist(),
173 ['mutual_info_matrix_v1 (x,y)'] + m1.flatten().tolist(),
174 ['mutual_info_matrix_v2 (x,y)'] + m2.flatten().tolist(),
175 ['mutual_info_matrix_v3 (x,y)'] + m3.flatten().tolist(),
176 ['mutual_info_matrix_v1 (ct)'] + m4.flatten().tolist(),
177 ['mutual_info_matrix_v2 (ct)'] + m5.flatten().tolist(),
178 ['mutual_info_matrix_v3 (ct)'] + m6.flatten().tolist(),
179 ], columns=['method', 'c11', 'c12', 'c21', 'c22'])
180
181 # Compute MIS score
182 cumu['mis'] = cumu.sum(axis=1)
Lets see the contingency matrix
186 if TERMINAL:
187 print_example_heading(n=2)
188 print('\nContingency:')
189 print(ct)
190 ct
Lets see the results
194 if TERMINAL:
195 print("\nResults:")
196 print(cumu)
197 cumu
c) Collateral Resistance Index
Now, lets compute the MIS
score as defined in the manuscript (M1).
Note that the manuscript provided the cumulative data as appendix
material and therefore we can use it to compare that our implementation
produces the same result.
Note
The results provided by our own MIS
implementation differs
from the results provided in the manuscript. This discrepancy
occurs for those rows in which the contingency matrix contains
one of more zeros.
213 def collateral_resistance_index(m):
214 """Collateral Resistance Index
215
216 The collateral resistance index is based on the mutual
217 information matrix. This implementation assumes there
218 are two classes resistant (R) and sensitive (S).
219
220 Parameters
221 ----------
222 m: np.array
223 A numpy array with the mutual information matrix.
224
225 Returns
226 -------
227 """
228 return (m[0, 0] + m[1, 1]) - (m[0, 1] + m[1, 0])
229
230 def CRI(x, func):
231 ct = np.array([[x.S1S2, x.S1R2], [x.R1S2, x.R1R2]])
232 m = func(ct=ct)
233 return collateral_resistance_index(m)
234
235 def compare(data, x, y):
236 return data[x].round(5).compare(data[y].round(5)).index.values
237
238 # Load data
239 data = pd.read_excel('./data/mmc2.xlsx')
240
241 # .. note: MIS_v1 is inspired by the implementation in sklearn. For some
242 # reason, when one of the values of the contigency matrix is 0
243 # it returns an array with three values and thus raises an error.
244
245 # Compute MIC score ourselves
246 #data['MIS_v1'] = data.apply(CRI, args=(mutual_info_matrix_v1,), axis=1)
247 data['MIS_v2'] = data.apply(CRI, args=(mutual_info_matrix_v2,), axis=1)
248 data['MIS_v3'] = data.apply(CRI, args=(mutual_info_matrix_v3,), axis=1)
249
250 # Compute indexes of those that do not give same result.
251 idxs1 = compare(data, 'MIS', 'MIS_v3')
Lets see the data
255 if TERMINAL:
256 print_example_heading(n=3)
257 print("\nData:")
258 print(data)
259 data.iloc[:, 3:]
Lets see where the results are different
263 if TERMINAL:
264 print("\nAre they equal? Show differences below:")
265 print(data.iloc[idxs1, :])
266 data.iloc[idxs1, 3:]
d) Exploring the efficiency
This code is used to compare whether the implementations are more or less efficient between them. Note that the methods have itself some limitations.
276 # Generate data
277 N = 10000000
278 choices = np.arange(2)
279 vector1 = np.random.choice(choices, size=N)
280 vector2 = np.random.choice(choices, size=N)
281
282 # Compute times
283 t1 = timer()
284 m1 = mutual_info_matrix_v1(x=vector1, y=vector2)
285 t2 = timer()
286 m2 = mutual_info_matrix_v2(x=vector1, y=vector2)
287 t3 = timer()
288 m3 = mutual_info_matrix_v3(x=vector1, y=vector2)
289 t4 = timer()
290
291 # Display
292 print_example_heading(n=4)
293 print("Are the results equal (m1, m2)? %s" % np.allclose(m1, m2))
294 print("Are the results equal (m1, m3)? %s" % np.allclose(m1, m3))
295 print("time v1: %.5f" % (t2-t1))
296 print("time v2: %.5f" % (t3-t2))
297 print("time v3: %.5f" % (t4-t3))
Out:
================================================================================
Example 4
================================================================================
Are the results equal (m1, m2)? True
Are the results equal (m1, m3)? True
time v1: 1.60590
time v2: 1.37441
time v3: 1.35685
e) Edge scenarios
There are some edge scenarios which we might have or might have not considered yet. We are including some of them here for future reference and some interesting questions below.
What is the CRI range? (-0.7, 0.7)
Should we normalize this value? [-1, 1]? [0, 1]?
How to compute CRI if we have three outcomes R, S and I?
311 # Heading
312 print_example_heading(n=5)
313
314 # Create cases
315 data = [
316 (['R', 'R', 'R', 'R'], ['R', 'R', 'R', 'R']),
317 (['R', 'R', 'R', 'R'], ['S', 'S', 'S', 'S']),
318 (['R', 'R', 'S', 'S'], ['R', 'R', 'S', 'S']),
319 (['R', 'R', 'S', 'S'], ['S', 'S', 'R', 'R']),
320 (['R', 'I', 'S', 'S'], ['R', 'I', 'S', 'S'])
321 ]
322
323 # Results
324 cumu = []
325
326 # Loop
327 for i, (x, y) in enumerate(data):
328
329 # Compute mutual information scores
330 mis = mutual_info_score(x, y)
331 misa = adjusted_mutual_info_score(x, y)
332 misn = normalized_mutual_info_score(x, y)
333
334 # Compute mutual information matrix
335 m = mutual_info_matrix_v1(x=x, y=y)
336
337 # Compute collateral resistance index
338 try:
339 cri = collateral_resistance_index(m)
340 except Exception as e:
341 print(e)
342 cri = None
343
344 # Append
345 cumu.append([x, y, mis, misa, misn, cri])
346
347 # Show
348 print("\n%s. Contingency matrix:" % i)
349 print(m)
350
351
352 # Create the dataframe
353 df = pd.DataFrame(cumu,
354 columns=['x', 'y', 'mis', 'mis_adjusted', 'mis_normalized', 'cri'])
Out:
================================================================================
Example 5
================================================================================
'float' object is not subscriptable
0. Contingency matrix:
0.0
'float' object is not subscriptable
1. Contingency matrix:
0.0
too many indices for array: array is 1-dimensional, but 2 were indexed
2. Contingency matrix:
[0.35 0.35]
too many indices for array: array is 1-dimensional, but 2 were indexed
3. Contingency matrix:
[0.35 0.35]
too many indices for array: array is 1-dimensional, but 2 were indexed
4. Contingency matrix:
[0.35 0.35 0.35]
Lets see the summary of edge cases
358 if TERMINAL:
359 print("\nSummary of edge scenarios:")
360 print(df)
361 df
f) For continuous variables
There are several approaches, one of them is just binning. For more information just check online, there are many good resources and or implementations that might be found out there.
371 # Heading
372 print_example_heading(n=6)
373
374 bins = 5 #?
375
376 def f(X, Y, bins):
377 c_XY = np.histogram2d(X, Y, bins)[0]
378 c_X = np.histogram(X, bins)[0]
379 c_Y = np.histogram(Y, bins)[0]
380 return 1
Out:
================================================================================
Example 6
================================================================================
g) Computing pairwise score
Let’s see how we can compute the mutual information square in a pairwise fashion.
389 def f1(x, y):
390 # Compute mutual information matrix
391 m = mutual_info_matrix_v3(x=x, y=y)
392 # Compute collateral resistance index
393 cri = collateral_resistance_index(m)
394 # Return
395 return cri
396
397 # Generate data
398 data = np.random.choice(['S', 'R'], size=(100, 4))
399
400 # Convert into DataFrame
401 df = pd.DataFrame(data,
402 columns=['C%d' % i for i in range(data.shape[1])])
403
404 # Option I
405 # --------
406 # Create empty matrix
407 cols = data.shape[1]
408 matrix = np.empty((cols, cols))
409 matrix[:] = np.nan
410
411 # Compute pairwise (square matrix)
412 for ix in np.arange(cols):
413 for jx in np.arange(ix+1, cols):
414 matrix[ix,jx] = f1(data[:,ix], data[:,jx])
415
416 # Convert to DataFrame for visualisation
417 matrix = pd.DataFrame(matrix,
418 index=df.columns, columns=df.columns)
Lets see the summary of pairwise distances
422 if TERMINAL:
423 # Heading
424 print_example_heading(n=7)
425 print("\nSummary of pairwise computations:")
426 print(matrix)
427 matrix
428
429 # Option II
430 # ----------
431 #for i, j in list(combinations(df.columns, 2)):
h) Example with more than 2 classes
Let’s see how it works for more than two classes.
Note
The computation using mutual_info_matrix_v2 should not work because it is designed for 2 classes. However, while it does not work for low number of samples (e.g. 5), it works for larger values (e.g. 100).
443 # .. note:: The computation using mutual_info_matrix_v3 which is inspired
444 # by sklearn returns an array of length 5 when the number of
445 # samples is low. However, it works when large number of samples
446 # is used.
447
448 # Generate data
449 data = np.random.choice(['S', 'R', 'I'], size=(100, 2))
450
451 # Convert into DataFrame
452 df = pd.DataFrame(data,
453 columns=['C%d' % i for i in range(data.shape[1])])
454
455 # Compute
456 m1 = mutual_info_matrix_v1(x=df.C0, y=df.C1)
457 m2 = mutual_info_matrix_v2(x=df.C0, y=df.C1)
458 m3 = mutual_info_matrix_v3(x=df.C0, y=df.C1)
459
460 # Show
461 print_example_heading(n=8)
462 print("Result m1:")
463 print(m1)
464 print("\nResult m2:")
465 print(m2)
466 print("\nResult m3:")
467 print(m3)
468
469 print("\n")
470 #print("Are the results equal (m1, m2)? %s" % np.allclose(m1, m2))
471 print("Are the results equal (m1, m3)? %s" % np.allclose(m1, m3))
Out:
================================================================================
Example 8
================================================================================
Result m1:
[[ 0.02 0.01 -0.02]
[ 0.01 -0.03 0.04]
[-0.02 0.04 -0.01]]
Result m2:
[[ 0.02 0.01 -0.02]
[ 0.01 -0.03 0.04]
[-0.02 0.04 -0.01]]
Result m3:
[[ 0.02 0.01 -0.02]
[ 0.01 -0.03 0.04]
[-0.02 0.04 -0.01]]
Are the results equal (m1, m3)? True
Total running time of the script: ( 0 minutes 4.830 seconds)