Note
Go to the end to download the full example code
Step 04 - TSA to estimate trends
Warning
Pending!
In this example, we will learn about important concepts such as time series decomposition, autocorrelation, moving averages, and forecasting methods from basic linear square models to more complex approaches such as ARIMA (AutoRegressive Integrated Moving Average). We will also explore various statistical tools commonly used for time series analysis, empowering you to harness the power of time-dependent data and extract meaningful information from it. Get ready to embark on a brief yet exciting journey into the world of time series analysis!
The diagram in Figure 4.7 describes the methodology suggested to estimate secular trends in AMR from susceptibility test data. Since data corruption might occur in clinical environments, those susceptibility test records wrongly reported (human or device errors) or duplicated should be discarded. The remaining records are often divided into combinations (tuples defined by sample type or specimen, pathogen and antimicrobial) for which a resistance time series signal needs to be generated using either independent or overlapping time intervals (see xxx). The time series were linearly interpolated to fill sporadic missing values. No additional filters or preprocessing steps were applied. An analysis of stationarity around a trend was carried out to identify interesting combinations and regression analysis was used to quantify its tendency
The linear model (see Equation 4.3) has been selected to quantify resistance tendency for several reasons: (i) the development of resistance in pathogens is an evolutionary response to the selective pressure of antimicrobials, hence large variations in short periods (e.g. consecutive days or months) are not expected (ii) the slope parameter can be directly translated to change over time increasing its practicability and (iii) the offset parameter is highly related with the overall resistance. Hence, the response variable in regression analysis (resistance index) is described by the explanatory variable (time). The slope (m) ranges within the interval [-1,1] where sign and absolute value capture direction and rate of change respectively. The unit of the slope is represented by ∆y/∆x.
Least Squares Regression
The optimization problem in ordinary least squares or
OLS
regression minimizes the least square errors to find the best fitting model. Ordinary least squares - OLS - assumes identical weights (wi) and independently distributed residuals with a normal distribution. It is frequent to observe that some residuals might have higher variance than others, meaning that those observations are effectively less certain. To contemplate such variability, weighted linear squares orWLS
regression (see Equation 4.4) applies a weighting function to the residuals. The confidence of the computed resistance index (observed variable) relies on the number of susceptibility test records manipulated. Hence, the sigmoid function has been used to define weights proportional to the population size.Warning
Include equation and/or external reference.
Auto Regressive Integrated Moving Average or
ARIMA
An autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model which can be also applied in scenarios where data show evidence of non-stationarity. The autoregressive (AR) part expresses the variable of interest (resistance index) as a function of past values of the variable. The moving average (MA) indicates that the regression error is a linear combination of error terms which occurred contemporaneously and at various times in the past. An ARIMA(p,d,q) model is described by: p is the number of autoregressive terms, d is the number of differences needed for stationarity and q is the number of lagged forecast errors.
Warning
Include equation and/or external reference.
The interpretation of the parameter µ depends on the ARIMA model used for the fitting. In order to estimate the linear trend, it was interesting to consider exclusively MA models so that the expected value of µ was the mean of the one-time differenced series; that is, the slope coefficient of the un-differenced series. The Bayesian information criterion (BIC) was used to select the best ARIMA(0,1,q) model, being the one with the lowest BIC the preferred.
Lets load the libraries and create the time series.
88 # Import
89 import pandas as pd
90
91 # Import class.
92 import sys
93 import warnings
94 import numpy as np
95 import pandas as pd
96 import matplotlib as mpl
97 import matplotlib.pyplot as plt
98 import statsmodels.api as sm
99 import statsmodels.robust.norms as norms
100
101 # import weights.
102 from pyamr.datasets.load import make_timeseries
103
104 # Filter warnings
105 warnings.simplefilter(action='ignore', category=FutureWarning)
106
107 # ----------------------------
108 # set basic configuration
109 # ----------------------------
110 # Matplotlib options
111 mpl.rc('legend', fontsize=6)
112 mpl.rc('xtick', labelsize=6)
113 mpl.rc('ytick', labelsize=6)
114
115 # Set pandas configuration.
116 pd.set_option('display.max_colwidth', 14)
117 pd.set_option('display.width', 150)
118 pd.set_option('display.precision', 4)
119
120 # ----------------------------
121 # create data
122 # ----------------------------
123 # Create timeseries data
124 x, y, f = make_timeseries()
Example using WLS
Libraries
130 from pyamr.core.regression.wls import WLSWrapper
131 from pyamr.metrics.weights import SigmoidA
132
133 # Create method to compute weights from frequencies
134 W = SigmoidA(r=200, g=0.5, offset=0.0, scale=1.0)
135
136 # Note that the function fit will call M.weights(weights) inside and will
137 # store the M converter in the instance. Therefore, the code execute is
138 # equivalent to <weights=M.weights(f)> with the only difference being that
139 # the weight converter is not saved.
140 wls = WLSWrapper(estimator=sm.WLS).fit( \
141 exog=x, endog=y, trend='c', weights=f,
142 W=W, missing='raise')
c:\users\kelda\desktop\repositories\virtualenvs\venv-py391-pyamr\lib\site-packages\statsmodels\regression\linear_model.py:807: RuntimeWarning:
divide by zero encountered in log
All information can be seen as a series
147 # Print series.
148 print("\nSeries:")
149 print(wls.as_series())
Series:
wls-rsquared 0.5042
wls-rsquared_adj 0.4992
wls-fvalue 99.6762
wls-fprob 0.0
wls-aic inf
wls-bic inf
wls-llf -inf
wls-mse_model 153748.2628
wls-mse_resid 1542.4777
wls-mse_total 3079.9099
wls-const_coef 269.2284
wls-const_std 18.0069
wls-const_tvalue 14.9514
wls-const_tprob 0.0
wls-const_cil 233.4942
wls-const_ciu 304.9625
wls-x1_coef 2.5201
wls-x1_std 0.2524
wls-x1_tvalue 9.9838
wls-x1_tprob 0.0
wls-x1_cil 2.0192
wls-x1_ciu 3.0211
wls-s_dw 0.548
wls-s_jb_value 16.984
wls-s_jb_prob 0.0002
wls-s_skew 0.355
wls-s_kurtosis 4.89
wls-s_omnibus_value 9.863
wls-s_omnibus_prob 0.007
wls-m_dw 0.1566
wls-m_jb_value 7.0686
wls-m_jb_prob 0.0292
wls-m_skew -0.6131
wls-m_kurtosis 3.4392
wls-m_nm_value 7.5327
wls-m_nm_prob 0.0231
wls-m_ks_value 0.6094
wls-m_ks_prob 0.0
wls-m_shp_value 0.939
wls-m_shp_prob 0.0002
wls-m_ad_value 2.7507
wls-m_ad_nnorm False
wls-missing raise
wls-exog [[1.0, 0.0...
wls-endog [26.961124...
wls-trend c
wls-weights [0.0423348...
wls-W <pyamr.met...
wls-model <statsmode...
wls-id WLS(c,Sig(...
dtype: object
Or a summary
154 # Print summary.
155 print("\nSummary:")
156 print(wls.as_summary())
Summary:
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.504
Model: WLS Adj. R-squared: 0.499
Method: Least Squares F-statistic: 99.68
Date: Thu, 15 Jun 2023 Prob (F-statistic): 1.31e-16
Time: 18:15:44 Log-Likelihood: -inf
No. Observations: 100 AIC: inf
Df Residuals: 98 BIC: inf
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 269.2284 18.007 14.951 0.000 233.494 304.962
x1 2.5201 0.252 9.984 0.000 2.019 3.021
==============================================================================
Omnibus: 9.863 Durbin-Watson: 0.548
Prob(Omnibus): 0.007 Jarque-Bera (JB): 16.984
Skew: 0.355 Prob(JB): 0.000205
Kurtosis: 4.890 Cond. No. 228.
Normal (N): 7.533 Prob(N): 0.023
==============================================================================
The regression line can be seen as
161 # Print regression line.
162 print("\nRegression line:")
163 print(wls.line(np.arange(10)))
Regression line:
[269.22835846 271.74850753 274.26865659 276.78880566 279.30895472
281.82910379 284.34925285 286.86940192 289.38955098 291.90970005]
Let’s see how to make predictions using the wrapper and how
to plot the result in data. In addition, it compares the intervals
provided by get_prediction
(confidence intervals) and the intervals
provided by wls_prediction_std
(prediction intervals).
171 # -----------------------------
172 # Predictions
173 # -----------------------------
174 # Variables.
175 start, end = None, 180
176
177 # Compute predictions (exogenous?). It returns a 2D array
178 # where the rows contain the time (t), the mean, the lower
179 # and upper confidence (or prediction?) interval.
180 preds = wls.get_prediction(start=start, end=end)
181
182
183 # Create figure
184 fig, ax = plt.subplots(1, 1, figsize=(11,5))
185
186 # -----------------------------
187 # Plotting confidence intervals
188 # -----------------------------
189 # Plot truth values.
190 ax.plot(x, y, color='#A6CEE3', alpha=0.5, marker='o',
191 markeredgecolor='k', markeredgewidth=0.5,
192 markersize=5, linewidth=0.75, label='Observed')
193
194 # Plot forecasted values.
195 ax.plot(preds[0,:], preds[1, :], color='#FF0000', alpha=1.00,
196 linewidth=2.0, label=wls._identifier(short=True))
197
198 # Plot the confidence intervals.
199 ax.fill_between(preds[0, :],
200 preds[2, :],
201 preds[3, :],
202 color='r',
203 alpha=0.1)
204
205 # Legend
206 plt.legend()
207
208 # Show
209 plt.show()
Example using ARIMA
Note
Since we are using arima, we should have checked for stationarity!
220 # Libraries
221 # Import ARIMA from statsmodels.
222 from statsmodels.tsa.arima.model import ARIMA
223 # Import ARIMA wrapper from pyamr.
224 from pyamr.core.regression.arima import ARIMAWrapper
225
226 # ----------------------------
227 # create data
228 # ----------------------------
229 # Create timeseries data
230 x, y, f = make_timeseries()
231
232 # Create exogenous variable
233 exog = x
234
235 # ----------------------------
236 # fit the model
237 # ----------------------------
238 # Create specific arima model.
239 arima = ARIMAWrapper(estimator=ARIMA).fit( \
240 endog=y[:80], order=(1,0,0), trend='c', disp=0)
All information can be seen as a series
245 # Print series
246 print("\nSeries:")
247 print(arima.as_series())
Series:
arima-aic 761.509
arima-bic 768.6551
arima-hqic 764.3741
arima-llf -377.7545
arima-const_coef 266.5951
arima-const_std 165.575
arima-const_tvalue 1.6101
arima-const_tprob 0.1074
arima-const_cil -57.9259
arima-const_ciu 591.1161
arima-ar.L1_coef 0.9917
arima-ar.L1_std 0.019
arima-ar.L1_tvalue 52.1312
arima-ar.L1_tprob 0.0
arima-ar.L1_cil 0.9544
arima-ar.L1_ciu 1.029
arima-sigma2_coef 702.5324
arima-sigma2_std 92.619
arima-sigma2_tvalue 7.5852
arima-sigma2_tprob 0.0
arima-sigma2_cil 521.0025
arima-sigma2_ciu 884.0623
arima-m_dw 1.735
arima-m_jb_value 2049.8936
arima-m_jb_prob 0.0
arima-m_skew -3.8991
arima-m_kurtosis 26.5405
arima-m_nm_value 101.5915
arima-m_nm_prob 0.0
arima-m_ks_value 0.588
arima-m_ks_prob 0.0
arima-m_shp_value 0.6822
arima-m_shp_prob 0.0
arima-m_ad_value 4.6133
arima-m_ad_nnorm False
arima-converged True
arima-endog [9.1093314...
arima-order (1, 0, 0)
arima-trend c
arima-disp 0
arima-model <statsmode...
arima-id ARIMA(1, 0...
dtype: object
Or a summary
252 # Print summary.
253 print("\nSummary:")
254 print(arima.as_summary())
Summary:
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 80
Model: ARIMA(1, 0, 0) Log Likelihood -377.754
Date: Thu, 15 Jun 2023 AIC 761.509
Time: 18:15:44 BIC 768.655
Sample: 0 HQIC 764.374
- 80
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 266.5951 165.575 1.610 0.107 -57.926 591.116
ar.L1 0.9917 0.019 52.131 0.000 0.954 1.029
sigma2 702.5324 92.619 7.585 0.000 521.002 884.062
==============================================================================
Manual
------------------------------------------------------------------------------
Omnibus: 0.000 Durbin-Watson: 1.735
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2049.894
Skew: -3.899 Prob(JB): 0.000
Kurtosis_m: 26.541 Cond No:
Normal (N): 101.591 Prob(N): 0.000
==============================================================================
Lets see how to make predictions using the wrapper which has been previously fitted. It also demonstrates how to plot the resulting data for visualization purposes. It shows two different types of predictions: (i) dynamic predictions in which the prediction is done based on the previously predicted values. Note that for the case of ARIMA(0,1,1) it returns a line and (ii) not dynamic in which the prediction is done based on the real values of the time series, no matter what the prediction was for those values.
265 # ----------------
266 # Predictions
267 # ----------------
268 # Variables.
269 s, e = 50, 120
270
271 # Compute predictions
272 preds_1 = arima.get_prediction(start=s, end=e, dynamic=False)
273 preds_2 = arima.get_prediction(start=s, end=e, dynamic=True)
274
275 # Create figure
276 fig, axes = plt.subplots(1, 2, figsize=(8, 3))
277
278 # ----------------
279 # Plot non-dynamic
280 # ----------------
281 # Plot truth values.
282 axes[0].plot(y, color='#A6CEE3', alpha=0.5, marker='o',
283 markeredgecolor='k', markeredgewidth=0.5,
284 markersize=5, linewidth=0.75, label='Observed')
285
286 # Plot forecasted values.
287 axes[0].plot(preds_1[0, :], preds_1[1, :], color='#FF0000', alpha=1.00,
288 linewidth=2.0, label=arima._identifier())
289
290 # Plot the confidence intervals.
291 axes[0].fill_between(preds_1[0, :], preds_1[2, :],
292 preds_1[3, :],
293 color='#FF0000',
294 alpha=0.25)
295
296 # ------------
297 # Plot dynamic
298 # ------------
299 # Plot truth values.
300 axes[1].plot(y, color='#A6CEE3', alpha=0.5, marker='o',
301 markeredgecolor='k', markeredgewidth=0.5,
302 markersize=5, linewidth=0.75, label='Observed')
303
304 # Plot forecasted values.
305 axes[1].plot(preds_2[0, :], preds_2[1, :], color='#FF0000', alpha=1.00,
306 linewidth=2.0, label=arima._identifier())
307
308 # Plot the confidence intervals.
309 axes[1].fill_between(preds_2[0, :], preds_2[2, :],
310 preds_2[3, :],
311 color='#FF0000',
312 alpha=0.25)
313
314 # Configure axes
315 axes[0].set_title("ARIMA non-dynamic")
316 axes[1].set_title("ARIMA dynamic")
317
318 # Format axes
319 axes[0].grid(True, linestyle='--', linewidth=0.25)
320 axes[1].grid(True, linestyle='--', linewidth=0.25)
321
322 # Legend
323 axes[0].legend()
324 axes[1].legend()
325
326 # Show
327 plt.show()
Total running time of the script: ( 0 minutes 0.301 seconds)