Note

Click here to download the full example code

Mutual Information Criteria

The Mutual Information Score, often denoted as MIS, expresses the extent to which observed frequency of co-occurrence differs from what we would expect (statistically speaking). In statistically pure terms this is a measure of the strength of association between words x and y.

See below for a few resources.

R1: Detailed video tutorial step by step.

R2: Detailed tutorial with python code.

R3: Possible libraries in python/R and other tools.

R5: Efficient pairwise MIS implementation…

M1: Idenetification of antibiotics pairs that evade …. manuscript.

Lets import the main libraries

 # Generic
 import warnings
 import numpy as np
 import pandas as pd

 # Specific
 from itertools import combinations
 from timeit import default_timer as timer
 from scipy.stats.contingency import crosstab
 from sklearn.metrics import mutual_info_score
 from sklearn.metrics import adjusted_mutual_info_score
 from sklearn.metrics import normalized_mutual_info_score

 # Own
 from mic import mutual_info_matrix_v3
 from mic import mutual_info_matrix_v2
 from mic import mutual_info_matrix_v1

 warnings.filterwarnings("ignore")

 def print_example_heading(n, t=''):
     print("\n" + "=" * 80 + "\nExample %s\n"%n + "=" * 80)

 try:
     __file__
     TERMINAL = True
 except:
     TERMINAL = False

a) Manual example (youtube)

Lets start with a hard coded example extracted from a very detailed youtube tutorial (R1). This video tutorial shows step by step how to compute the mutual information score using the contingency matrix defined below. Pay special attention to the following consideration when implementing the MIS:

only possible to compute where more than one class present

log(0) raises a zero division.

lim x->0 log(x) = 0

this np.nan can be filled with 0.

 # See: https://www.youtube.com/watch?v=eJIp_mgVLwE

 # Contingency
 ct = np.array([[3/5, 1/5], [0/5, 1/5]])

 # Compute MIS manually
 mi1 = (3/5)*np.log((3/5) / ((3/5)*(4/5)))
 #mi2 = (0/5)*np.log((0/5) / ((3/5)*(1/5))) # zero division
 mi3 = (1/5)*np.log((1/5) / ((2/5)*(4/5)))
 mi4 = (1/5)*np.log((1/5) / ((2/5)*(1/5)))
 m1 = np.array([[mi1, mi3], [0, mi4]])
 score1 = mi1 + mi3 + mi4 # 0.22

 # Compute component information matrix
 m2 = mutual_info_matrix_v1(ct=ct)
 m3 = mutual_info_matrix_v2(ct=ct)
 m4 = mutual_info_matrix_v3(ct=ct)

 # .. note: Raises a math domain error.
 # Compute MIS scikits
 #score4 = mutual_info_score(labels_true=None,
 #                           labels_pred=None,
 #                           contingency=ct)

 # Cumu
 cumu = pd.DataFrame([
     ['manual'] + m1.flatten().tolist(),
     ['mutual_info_matrix_v1'] + m2.flatten().tolist(),
     ['mutual_info_matrix_v2'] + m3.flatten().tolist(),
     ['mutual_info_matrix_v3'] + m4.flatten().tolist()
 ], columns=['method', 'c11', 'c12', 'c21', 'c22'])

 # Compute MIS score
 cumu['mis'] = cumu.sum(axis=1)

Lets see the contingency matrix

 if TERMINAL:
     print_example_heading(n=1)
     print('\nContingency:')
     print(ct)
 pd.DataFrame(ct)

	0	1
0	0.6	0.2
1	0.0	0.2

Lets see the results

 if TERMINAL:
     print("\nResults:")
     print(cumu)
 cumu

	method	c11	c12	c21	c22	mis
0	manual	0.133886	-0.094001	0.0	0.183258	0.223144
1	mutual_info_matrix_v1	NaN	NaN	NaN	NaN	0.000000
2	mutual_info_matrix_v2	0.133886	-0.094001	0.0	0.183258	0.223144
3	mutual_info_matrix_v3	0.133886	-0.094001	0.0	0.183258	0.223144

Note

The method mutual_info_matrix_v1 does not work in this example!

b) Another two class example

In the previous example we started with the definition of the contingency matrix. However, that is not often the case. In this example we will go one step back and show how to compute the contingency matrix from the raw vectors using either scipy o pandas. Note that the contingency matrix is just a way to display the frequency distribution of the variables.

 # Generate the dataset
 x = np.array([
     ['S1', 'S2'],
     ['S1', 'R2'],
     ['R1', 'S2'],
     ['R1', 'R2']])
 d = np.repeat(x, [63, 22, 15, 25], axis=0)
 d = pd.DataFrame(data=d)

 # Create variables
 x = d[0]
 y = d[1]

 # Compute contingency
 #ct = crosstab(d[0], d[1]).count
 ct = pd.crosstab(x, y)

 # Compute MIS
 score0 = mutual_info_score(labels_true=x, labels_pred=y)

 # Compute MIS
 m1 = mutual_info_matrix_v1(x=x, y=y)
 m2 = mutual_info_matrix_v2(x=x, y=y)
 m3 = mutual_info_matrix_v3(x=x, y=y)

 # Compute MIS
 m4 = mutual_info_matrix_v1(ct=ct)
 m5 = mutual_info_matrix_v2(ct=ct)
 m6 = mutual_info_matrix_v3(ct=ct)

 # Cumu
 cumu = pd.DataFrame([
     #['mutual_info_score'] + m1.flatten().tolist(),
     ['mutual_info_matrix_v1 (x,y)'] + m1.flatten().tolist(),
     ['mutual_info_matrix_v2 (x,y)'] + m2.flatten().tolist(),
     ['mutual_info_matrix_v3 (x,y)'] + m3.flatten().tolist(),
     ['mutual_info_matrix_v1 (ct)'] + m4.flatten().tolist(),
     ['mutual_info_matrix_v2 (ct)'] + m5.flatten().tolist(),
     ['mutual_info_matrix_v3 (ct)'] + m6.flatten().tolist(),
 ], columns=['method', 'c11', 'c12', 'c21', 'c22'])

 # Compute MIS score
 cumu['mis'] = cumu.sum(axis=1)

Lets see the contingency matrix

 if TERMINAL:
     print_example_heading(n=2)
     print('\nContingency:')
     print(ct)
 ct

1	R2	S2
0
R1	25	15
S1	22	63

Lets see the results

 if TERMINAL:
     print("\nResults:")
     print(cumu)
 cumu

	method	c11	c12	c21	c22	mis
0	mutual_info_matrix_v1 (x,y)	0.101633	-0.061107	-0.065726	0.086733	0.061532
1	mutual_info_matrix_v2 (x,y)	0.101633	-0.061107	-0.065726	0.086733	0.061532
2	mutual_info_matrix_v3 (x,y)	0.101633	-0.061107	-0.065726	0.086733	0.061532
3	mutual_info_matrix_v1 (ct)	0.101633	-0.065726	-0.061107	0.086733	0.061532
4	mutual_info_matrix_v2 (ct)	0.101633	-0.061107	-0.065726	0.086733	0.061532
5	mutual_info_matrix_v3 (ct)	0.101633	-0.061107	-0.065726	0.086733	0.061532

c) Collateral Resistance Index

Now, lets compute the MIS score as defined in the manuscript (M1). Note that the manuscript provided the cumulative data as appendix material and therefore we can use it to compare that our implementation produces the same result.

Note

The results provided by our own MIS implementation differs from the results provided in the manuscript. This discrepancy occurs for those rows in which the contingency matrix contains one of more zeros.

 def collateral_resistance_index(m):
     """Collateral Resistance Index

     The collateral resistance index is based on the mutual
     information matrix. This implementation assumes there
     are two classes resistant (R) and sensitive (S).

     Parameters
     ----------
     m: np.array
         A numpy array with the mutual information matrix.

     Returns
     -------
     """
     return (m[0, 0] + m[1, 1]) - (m[0, 1] + m[1, 0])

 def CRI(x, func):
     ct = np.array([[x.S1S2, x.S1R2], [x.R1S2, x.R1R2]])
     m = func(ct=ct)
     return collateral_resistance_index(m)

 def compare(data, x, y):
     return data[x].round(5).compare(data[y].round(5)).index.values

 # Load data
 data = pd.read_excel('./data/mmc2.xlsx')

 # .. note: MIS_v1 is inspired by the implementation in sklearn. For some
 #          reason, when one of the values of the contigency matrix is 0
 #          it returns an array with three values and thus raises an error.

 # Compute MIC score ourselves
 #data['MIS_v1'] = data.apply(CRI, args=(mutual_info_matrix_v1,), axis=1)
 data['MIS_v2'] = data.apply(CRI, args=(mutual_info_matrix_v2,), axis=1)
 data['MIS_v3'] = data.apply(CRI, args=(mutual_info_matrix_v3,), axis=1)

 # Compute indexes of those that do not give same result.
 idxs1 = compare(data, 'MIS', 'MIS_v3')

Lets see the data

 if TERMINAL:
     print_example_heading(n=3)
     print("\nData:")
     print(data)
 data.iloc[:, 3:]

	Antibiotic_2	S1S2	S1R2	R1S2	R1R2	MIS	P-value	MIS_v2	MIS_v3
0	AMPICILLIN	13849	5657	10	3110	0.295015	0.000000e+00	0.295015	0.295015
1	AMPICILLIN_SULBACTAM	15264	4234	43	3076	0.331826	0.000000e+00	0.331826	0.331826
2	AZTREONAM	63	22	15	25	0.315198	1.370511e-04	0.315198	0.315198
3	CEFAZOLIN	8992	264	741	712	0.236398	0.000000e+00	0.236398	0.236398
4	CEFEPIME	14272	112	2211	123	0.024380	1.073382e-44	0.024380	0.024380
...	...	...	...	...	...	...	...	...	...
875	TETRACYCLINE	703	2047	86	923	0.125936	5.609167e-34	0.125936	0.125936
876	VANCOMYCIN	3632	32	1167	57	0.029895	5.091803e-15	0.029895	0.029895
877	TETRACYCLINE	156	696	222	689	-0.060360	2.076596e-03	-0.060360	-0.060360
878	VANCOMYCIN	1064	31	1134	22	-0.009254	1.650263e-01	-0.009254	-0.009254
879	VANCOMYCIN	904	10	3307	81	0.008185	1.354200e-02	0.008185	0.008185

880 rows × 9 columns

Lets see where the results are different

 if TERMINAL:
     print("\nAre they equal? Show differences below:")
     print(data.iloc[idxs1, :])
 data.iloc[idxs1, 3:]

	Antibiotic_2	S1S2	S1R2	R1S2	R1R2	MIS	P-value	MIS_v2	MIS_v3
54	CEFEPIME	8850	0	1032	245	0.091413	3.878054e-229	0.090710	0.090710
241	AZTREONAM	5	0	1	10	NaN	NaN	0.629259	0.629259
246	CEFUROXIME	32	2	2	20	NaN	NaN	0.685774	0.685774
251	MEROPENEM	25	0	3	0	NaN	NaN	0.000000	0.000000
286	CEFTAZIDIME	2646	0	72	54	0.095338	3.297768e-78	0.093018	0.093018
387	MEROPENEM	602	0	14	5	0.051227	4.602569e-09	0.042614	0.042614
492	MEROPENEM	922	1	0	31	0.152671	3.238929e-56	0.145842	0.145842
502	NITROFURANTOIN	1784	1979	0	34	0.014705	4.577273e-09	0.014099	0.014099
624	AZTREONAM	4	0	0	0	NaN	NaN	0.000000	0.000000
626	CEFTAZIDIME	1459	7	0	1	0.009526	1.244097e-04	0.004865	0.004865
627	CEFTRIAXONE	1459	8	1	0	0.004484	2.025597e-02	-0.000011	-0.000011
628	CEFUROXIME	48	0	3	0	NaN	NaN	0.000000	0.000000
631	LEVOFLOXACIN	1272	145	1	0	0.002209	2.793473e-01	-0.000220	-0.000220
632	PIPERACILLIN_TAZOBACTAM	1456	1	1	0	0.005266	6.147526e-03	-0.000001	-0.000001
634	TOBRAMYCIN	1427	60	2	0	0.002809	1.561921e-01	-0.000164	-0.000164
713	AMPICILLIN	1850	276	0	17	0.031664	3.211012e-15	0.029332	0.029332
715	AZTREONAM	538	6	14	0	0.006651	2.074790e-01	-0.000814	-0.000814
723	PIPERACILLIN_TAZOBACTAM	1907	2	15	0	0.003155	3.483272e-02	-0.000024	-0.000024
756	PIPERACILLIN_TAZOBACTAM	1764	4	40	0	0.002473	1.314297e-01	-0.000147	-0.000147
786	PIPERACILLIN_TAZOBACTAM	1343	3	38	0	0.003181	1.362082e-01	-0.000179	-0.000179

d) Exploring the efficiency

This code is used to compare whether the implementations are more or less efficient between them. Note that the methods have itself some limitations.

 # Generate data
 N = 10000000
 choices = np.arange(2)
 vector1 = np.random.choice(choices, size=N)
 vector2 = np.random.choice(choices, size=N)

 # Compute times
 t1 = timer()
 m1 = mutual_info_matrix_v1(x=vector1, y=vector2)
 t2 = timer()
 m2 = mutual_info_matrix_v2(x=vector1, y=vector2)
 t3 = timer()
 m3 = mutual_info_matrix_v3(x=vector1, y=vector2)
 t4 = timer()

 # Display
 print_example_heading(n=4)
 print("Are the results equal (m1, m2)? %s" % np.allclose(m1, m2))
 print("Are the results equal (m1, m3)? %s" % np.allclose(m1, m3))
 print("time v1: %.5f" % (t2-t1))
 print("time v2: %.5f" % (t3-t2))
 print("time v3: %.5f" % (t4-t3))

Out:

================================================================================
Example 4
================================================================================
Are the results equal (m1, m2)? True
Are the results equal (m1, m3)? True
time v1: 1.60590
time v2: 1.37441
time v3: 1.35685

e) Edge scenarios

There are some edge scenarios which we might have or might have not considered yet. We are including some of them here for future reference and some interesting questions below.

What is the CRI range? (-0.7, 0.7)

Should we normalize this value? [-1, 1]? [0, 1]?

How to compute CRI if we have three outcomes R, S and I?

 # Heading
 print_example_heading(n=5)

 # Create cases
 data = [
     (['R', 'R', 'R', 'R'], ['R', 'R', 'R', 'R']),
     (['R', 'R', 'R', 'R'], ['S', 'S', 'S', 'S']),
     (['R', 'R', 'S', 'S'], ['R', 'R', 'S', 'S']),
     (['R', 'R', 'S', 'S'], ['S', 'S', 'R', 'R']),
     (['R', 'I', 'S', 'S'], ['R', 'I', 'S', 'S'])
 ]

 # Results
 cumu = []

 # Loop
 for i, (x, y) in enumerate(data):

     # Compute mutual information scores
     mis = mutual_info_score(x, y)
     misa = adjusted_mutual_info_score(x, y)
     misn = normalized_mutual_info_score(x, y)

     # Compute mutual information matrix
     m = mutual_info_matrix_v1(x=x, y=y)

     # Compute collateral resistance index
     try:
         cri = collateral_resistance_index(m)
     except Exception as e:
         print(e)
         cri = None

     # Append
     cumu.append([x, y, mis, misa, misn, cri])

     # Show
     print("\n%s. Contingency matrix:" % i)
     print(m)


 # Create the dataframe
 df = pd.DataFrame(cumu,
     columns=['x', 'y', 'mis', 'mis_adjusted', 'mis_normalized', 'cri'])

Out:

================================================================================
Example 5
================================================================================
'float' object is not subscriptable

0. Contingency matrix:
0.0
'float' object is not subscriptable

1. Contingency matrix:
0.0
too many indices for array: array is 1-dimensional, but 2 were indexed

2. Contingency matrix:
[0.35 0.35]
too many indices for array: array is 1-dimensional, but 2 were indexed

3. Contingency matrix:
[0.35 0.35]
too many indices for array: array is 1-dimensional, but 2 were indexed

4. Contingency matrix:
[0.35 0.35 0.35]

Lets see the summary of edge cases

 if TERMINAL:
     print("\nSummary of edge scenarios:")
     print(df)
 df

	x	y	mis	mis_adjusted	mis_normalized	cri
0	[R, R, R, R]	[R, R, R, R]	0.000000	1.0	1.0	None
1	[R, R, R, R]	[S, S, S, S]	0.000000	1.0	1.0	None
2	[R, R, S, S]	[R, R, S, S]	0.693147	1.0	1.0	None
3	[R, R, S, S]	[S, S, R, R]	0.693147	1.0	1.0	None
4	[R, I, S, S]	[R, I, S, S]	1.039721	1.0	1.0	None

f) For continuous variables

There are several approaches, one of them is just binning. For more information just check online, there are many good resources and or implementations that might be found out there.

 # Heading
 print_example_heading(n=6)

 bins = 5 #?

 def f(X, Y, bins):
     c_XY = np.histogram2d(X, Y, bins)[0]
     c_X = np.histogram(X, bins)[0]
     c_Y = np.histogram(Y, bins)[0]
     return 1

Out:

================================================================================
Example 6
================================================================================

g) Computing pairwise score

Let’s see how we can compute the mutual information square in a pairwise fashion.

 def f1(x, y):
     # Compute mutual information matrix
     m = mutual_info_matrix_v3(x=x, y=y)
     # Compute collateral resistance index
     cri = collateral_resistance_index(m)
     # Return
     return cri

 # Generate data
 data = np.random.choice(['S', 'R'], size=(100, 4))

 # Convert into DataFrame
 df = pd.DataFrame(data,
     columns=['C%d' % i for i in range(data.shape[1])])

 # Option I
 # --------
 # Create empty matrix
 cols = data.shape[1]
 matrix = np.empty((cols, cols))
 matrix[:] = np.nan

 # Compute pairwise (square matrix)
 for ix in np.arange(cols):
     for jx in np.arange(ix+1, cols):
         matrix[ix,jx] = f1(data[:,ix], data[:,jx])

 # Convert to DataFrame for visualisation
 matrix = pd.DataFrame(matrix,
     index=df.columns, columns=df.columns)

Lets see the summary of pairwise distances

 if TERMINAL:
     # Heading
     print_example_heading(n=7)
     print("\nSummary of pairwise computations:")
     print(matrix)
 matrix

 # Option II
 # ----------
 #for i, j in list(combinations(df.columns, 2)):

	C0	C1	C2	C3
C0	NaN	-0.069964	-0.009999	-0.053950
C1	NaN	NaN	-0.030000	-0.224257
C2	NaN	NaN	NaN	-0.093784
C3	NaN	NaN	NaN	NaN

h) Example with more than 2 classes

Let’s see how it works for more than two classes.

Note

The computation using mutual_info_matrix_v2 should not work because it is designed for 2 classes. However, while it does not work for low number of samples (e.g. 5), it works for larger values (e.g. 100).

 # .. note:: The computation using mutual_info_matrix_v3 which is inspired
 #           by sklearn returns an array of length 5 when the number of
 #           samples is low. However, it works when large number of samples
 #           is used.

 # Generate data
 data = np.random.choice(['S', 'R', 'I'], size=(100, 2))

 # Convert into DataFrame
 df = pd.DataFrame(data,
     columns=['C%d' % i for i in range(data.shape[1])])

 # Compute
 m1 = mutual_info_matrix_v1(x=df.C0, y=df.C1)
 m2 = mutual_info_matrix_v2(x=df.C0, y=df.C1)
 m3 = mutual_info_matrix_v3(x=df.C0, y=df.C1)

 # Show
 print_example_heading(n=8)
 print("Result m1:")
 print(m1)
 print("\nResult m2:")
 print(m2)
 print("\nResult m3:")
 print(m3)

 print("\n")
 #print("Are the results equal (m1, m2)? %s" % np.allclose(m1, m2))
 print("Are the results equal (m1, m3)? %s" % np.allclose(m1, m3))

Out:

================================================================================
Example 8
================================================================================
Result m1:
[[ 0.02  0.01 -0.02]
 [ 0.01 -0.03  0.04]
 [-0.02  0.04 -0.01]]

Result m2:
[[ 0.02  0.01 -0.02]
 [ 0.01 -0.03  0.04]
 [-0.02  0.04 -0.01]]

Result m3:
[[ 0.02  0.01 -0.02]
 [ 0.01 -0.03  0.04]
 [-0.02  0.04 -0.01]]


Are the results equal (m1, m3)? True

Total running time of the script: ( 0 minutes 4.830 seconds)

Gallery generated by Sphinx-Gallery