Step 04 - TSA to estimate trends

Warning

Pending!

In this example, we will learn about important concepts such as time series decomposition, autocorrelation, moving averages, and forecasting methods from basic linear square models to more complex approaches such as ARIMA (AutoRegressive Integrated Moving Average). We will also explore various statistical tools commonly used for time series analysis, empowering you to harness the power of time-dependent data and extract meaningful information from it. Get ready to embark on a brief yet exciting journey into the world of time series analysis!

The diagram in Figure 4.7 describes the methodology suggested to estimate secular trends in AMR from susceptibility test data. Since data corruption might occur in clinical environments, those susceptibility test records wrongly reported (human or device errors) or duplicated should be discarded. The remaining records are often divided into combinations (tuples defined by sample type or specimen, pathogen and antimicrobial) for which a resistance time series signal needs to be generated using either independent or overlapping time intervals (see xxx). The time series were linearly interpolated to fill sporadic missing values. No additional filters or preprocessing steps were applied. An analysis of stationarity around a trend was carried out to identify interesting combinations and regression analysis was used to quantify its tendency

The linear model (see Equation 4.3) has been selected to quantify resistance tendency for several reasons: (i) the development of resistance in pathogens is an evolutionary response to the selective pressure of antimicrobials, hence large variations in short periods (e.g. consecutive days or months) are not expected (ii) the slope parameter can be directly translated to change over time increasing its practicability and (iii) the offset parameter is highly related with the overall resistance. Hence, the response variable in regression analysis (resistance index) is described by the explanatory variable (time). The slope (m) ranges within the interval [-1,1] where sign and absolute value capture direction and rate of change respectively. The unit of the slope is represented by ∆y/∆x.

Least Squares Regression

The optimization problem in ordinary least squares or OLS regression minimizes the least square errors to find the best fitting model. Ordinary least squares - OLS - assumes identical weights (wi) and independently distributed residuals with a normal distribution. It is frequent to observe that some residuals might have higher variance than others, meaning that those observations are effectively less certain. To contemplate such variability, weighted linear squares or WLS regression (see Equation 4.4) applies a weighting function to the residuals. The confidence of the computed resistance index (observed variable) relies on the number of susceptibility test records manipulated. Hence, the sigmoid function has been used to define weights proportional to the population size.

Warning

Include equation and/or external reference.
Auto Regressive Integrated Moving Average or ARIMA

An autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model which can be also applied in scenarios where data show evidence of non-stationarity. The autoregressive (AR) part expresses the variable of interest (resistance index) as a function of past values of the variable. The moving average (MA) indicates that the regression error is a linear combination of error terms which occurred contemporaneously and at various times in the past. An ARIMA(p,d,q) model is described by: p is the number of autoregressive terms, d is the number of differences needed for stationarity and q is the number of lagged forecast errors.

Warning

Include equation and/or external reference.

The interpretation of the parameter µ depends on the ARIMA model used for the fitting. In order to estimate the linear trend, it was interesting to consider exclusively MA models so that the expected value of µ was the mean of the one-time differenced series; that is, the slope coefficient of the un-differenced series. The Bayesian information criterion (BIC) was used to select the best ARIMA(0,1,q) model, being the one with the lowest BIC the preferred.

Lets load the libraries and create the time series.

 # Import
 import pandas as pd

 # Import class.
 import sys
 import warnings
 import numpy as np
 import pandas as pd
 import matplotlib as mpl
 import matplotlib.pyplot as plt
 import statsmodels.api as sm
 import statsmodels.robust.norms as norms

 # import weights.
 from pyamr.datasets.load import make_timeseries

 # Filter warnings
 warnings.simplefilter(action='ignore', category=FutureWarning)

 # ----------------------------
 # set basic configuration
 # ----------------------------
 # Matplotlib options
 mpl.rc('legend', fontsize=6)
 mpl.rc('xtick', labelsize=6)
 mpl.rc('ytick', labelsize=6)

 # Set pandas configuration.
 pd.set_option('display.max_colwidth', 14)
 pd.set_option('display.width', 150)
 pd.set_option('display.precision', 4)

 # ----------------------------
 # create data
 # ----------------------------
 # Create timeseries data
 x, y, f = make_timeseries()

Example using WLS

Libraries

 from pyamr.core.regression.wls import WLSWrapper
 from pyamr.metrics.weights import SigmoidA

 # Create method to compute weights from frequencies
 W = SigmoidA(r=200, g=0.5, offset=0.0, scale=1.0)

 # Note that the function fit will call M.weights(weights) inside and will
 # store the M converter in the instance. Therefore, the code execute is
 # equivalent to <weights=M.weights(f)> with the only difference being that
 # the weight converter is not saved.
 wls = WLSWrapper(estimator=sm.WLS).fit( \
     exog=x, endog=y, trend='c', weights=f,
     W=W, missing='raise')

c:\users\kelda\desktop\repositories\virtualenvs\venv-py391-pyamr\lib\site-packages\statsmodels\regression\linear_model.py:807: RuntimeWarning:

divide by zero encountered in log

All information can be seen as a series

 # Print series.
 print("\nSeries:")
 print(wls.as_series())

Series:
wls-rsquared                  0.5042
wls-rsquared_adj              0.4992
wls-fvalue                   99.6762
wls-fprob                        0.0
wls-aic                          inf
wls-bic                          inf
wls-llf                         -inf
wls-mse_model            153748.2628
wls-mse_resid              1542.4777
wls-mse_total              3079.9099
wls-const_coef              269.2284
wls-const_std                18.0069
wls-const_tvalue             14.9514
wls-const_tprob                  0.0
wls-const_cil               233.4942
wls-const_ciu               304.9625
wls-x1_coef                   2.5201
wls-x1_std                    0.2524
wls-x1_tvalue                 9.9838
wls-x1_tprob                     0.0
wls-x1_cil                    2.0192
wls-x1_ciu                    3.0211
wls-s_dw                       0.548
wls-s_jb_value                16.984
wls-s_jb_prob                 0.0002
wls-s_skew                     0.355
wls-s_kurtosis                  4.89
wls-s_omnibus_value            9.863
wls-s_omnibus_prob             0.007
wls-m_dw                      0.1566
wls-m_jb_value                7.0686
wls-m_jb_prob                 0.0292
wls-m_skew                   -0.6131
wls-m_kurtosis                3.4392
wls-m_nm_value                7.5327
wls-m_nm_prob                 0.0231
wls-m_ks_value                0.6094
wls-m_ks_prob                    0.0
wls-m_shp_value                0.939
wls-m_shp_prob                0.0002
wls-m_ad_value                2.7507
wls-m_ad_nnorm                 False
wls-missing                    raise
wls-exog               [[1.0, 0.0...
wls-endog              [26.961124...
wls-trend                          c
wls-weights            [0.0423348...
wls-W                  <pyamr.met...
wls-model              <statsmode...
wls-id                 WLS(c,Sig(...
dtype: object

Or a summary

 # Print summary.
 print("\nSummary:")
 print(wls.as_summary())

Summary:
                            WLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.504
Model:                            WLS   Adj. R-squared:                  0.499
Method:                 Least Squares   F-statistic:                     99.68
Date:                Thu, 15 Jun 2023   Prob (F-statistic):           1.31e-16
Time:                        18:15:44   Log-Likelihood:                   -inf
No. Observations:                 100   AIC:                               inf
Df Residuals:                      98   BIC:                               inf
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        269.2284     18.007     14.951      0.000     233.494     304.962
x1             2.5201      0.252      9.984      0.000       2.019       3.021
==============================================================================
Omnibus:                        9.863   Durbin-Watson:                   0.548
Prob(Omnibus):                  0.007   Jarque-Bera (JB):               16.984
Skew:                           0.355   Prob(JB):                     0.000205
Kurtosis:                       4.890   Cond. No.                         228.
Normal (N):                     7.533   Prob(N):                         0.023
==============================================================================

The regression line can be seen as

 # Print regression line.
 print("\nRegression line:")
 print(wls.line(np.arange(10)))

Regression line:
[269.22835846 271.74850753 274.26865659 276.78880566 279.30895472
 281.82910379 284.34925285 286.86940192 289.38955098 291.90970005]

Let’s see how to make predictions using the wrapper and how to plot the result in data. In addition, it compares the intervals provided by get_prediction (confidence intervals) and the intervals provided by wls_prediction_std (prediction intervals).

 # -----------------------------
 # Predictions
 # -----------------------------
 # Variables.
 start, end = None, 180

 # Compute predictions (exogenous?). It returns a 2D array
 # where the rows contain the time (t), the mean, the lower
 # and upper confidence (or prediction?) interval.
 preds = wls.get_prediction(start=start, end=end)


 # Create figure
 fig, ax = plt.subplots(1, 1, figsize=(11,5))

 # -----------------------------
 # Plotting confidence intervals
 # -----------------------------
 # Plot truth values.
 ax.plot(x, y, color='#A6CEE3', alpha=0.5, marker='o',
            markeredgecolor='k', markeredgewidth=0.5,
            markersize=5, linewidth=0.75, label='Observed')

 # Plot forecasted values.
 ax.plot(preds[0,:], preds[1, :], color='#FF0000', alpha=1.00,
              linewidth=2.0, label=wls._identifier(short=True))

 # Plot the confidence intervals.
 ax.fill_between(preds[0, :],
                 preds[2, :],
                 preds[3, :],
                 color='r',
                 alpha=0.1)

 # Legend
 plt.legend()

 # Show
 plt.show()

Example using ARIMA

Note

Since we are using arima, we should have checked for stationarity!

 # Libraries
 # Import ARIMA from statsmodels.
 from statsmodels.tsa.arima.model import ARIMA
 # Import ARIMA wrapper from pyamr.
 from pyamr.core.regression.arima import ARIMAWrapper

 # ----------------------------
 # create data
 # ----------------------------
 # Create timeseries data
 x, y, f = make_timeseries()

 # Create exogenous variable
 exog = x

 # ----------------------------
 # fit the model
 # ----------------------------
 # Create specific arima model.
 arima = ARIMAWrapper(estimator=ARIMA).fit( \
  endog=y[:80], order=(1,0,0), trend='c', disp=0)

All information can be seen as a series

 # Print series
 print("\nSeries:")
 print(arima.as_series())

Series:
arima-aic                    761.509
arima-bic                   768.6551
arima-hqic                  764.3741
arima-llf                  -377.7545
arima-const_coef            266.5951
arima-const_std              165.575
arima-const_tvalue            1.6101
arima-const_tprob             0.1074
arima-const_cil             -57.9259
arima-const_ciu             591.1161
arima-ar.L1_coef              0.9917
arima-ar.L1_std                0.019
arima-ar.L1_tvalue           52.1312
arima-ar.L1_tprob                0.0
arima-ar.L1_cil               0.9544
arima-ar.L1_ciu                1.029
arima-sigma2_coef           702.5324
arima-sigma2_std              92.619
arima-sigma2_tvalue           7.5852
arima-sigma2_tprob               0.0
arima-sigma2_cil            521.0025
arima-sigma2_ciu            884.0623
arima-m_dw                     1.735
arima-m_jb_value           2049.8936
arima-m_jb_prob                  0.0
arima-m_skew                 -3.8991
arima-m_kurtosis             26.5405
arima-m_nm_value            101.5915
arima-m_nm_prob                  0.0
arima-m_ks_value               0.588
arima-m_ks_prob                  0.0
arima-m_shp_value             0.6822
arima-m_shp_prob                 0.0
arima-m_ad_value              4.6133
arima-m_ad_nnorm               False
arima-converged                 True
arima-endog            [9.1093314...
arima-order                (1, 0, 0)
arima-trend                        c
arima-disp                         0
arima-model            <statsmode...
arima-id               ARIMA(1, 0...
dtype: object

Or a summary

 # Print summary.
 print("\nSummary:")
 print(arima.as_summary())

Summary:
                               SARIMAX Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   80
Model:                 ARIMA(1, 0, 0)   Log Likelihood                -377.754
Date:                Thu, 15 Jun 2023   AIC                            761.509
Time:                        18:15:44   BIC                            768.655
Sample:                             0   HQIC                           764.374
                                 - 80
Covariance Type:                  opg
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        266.5951    165.575      1.610      0.107     -57.926     591.116
ar.L1          0.9917      0.019     52.131      0.000       0.954       1.029
sigma2       702.5324     92.619      7.585      0.000     521.002     884.062
==============================================================================
                                    Manual
------------------------------------------------------------------------------
Omnibus:                        0.000  Durbin-Watson:                    1.735
Prob(Omnibus):                  0.000  Jarque-Bera (JB):              2049.894
Skew:                          -3.899  Prob(JB):                         0.000
Kurtosis_m:                    26.541  Cond No:
Normal (N):                   101.591  Prob(N):                          0.000
==============================================================================

Lets see how to make predictions using the wrapper which has been previously fitted. It also demonstrates how to plot the resulting data for visualization purposes. It shows two different types of predictions: (i) dynamic predictions in which the prediction is done based on the previously predicted values. Note that for the case of ARIMA(0,1,1) it returns a line and (ii) not dynamic in which the prediction is done based on the real values of the time series, no matter what the prediction was for those values.

 # ----------------
 # Predictions
 # ----------------
 # Variables.
 s, e = 50, 120

 # Compute predictions
 preds_1 = arima.get_prediction(start=s, end=e, dynamic=False)
 preds_2 = arima.get_prediction(start=s, end=e, dynamic=True)

 # Create figure
 fig, axes = plt.subplots(1, 2, figsize=(8, 3))

 # ----------------
 # Plot non-dynamic
 # ----------------
 # Plot truth values.
 axes[0].plot(y, color='#A6CEE3', alpha=0.5, marker='o',
              markeredgecolor='k', markeredgewidth=0.5,
              markersize=5, linewidth=0.75, label='Observed')

 # Plot forecasted values.
 axes[0].plot(preds_1[0, :], preds_1[1, :], color='#FF0000', alpha=1.00,
              linewidth=2.0, label=arima._identifier())

 # Plot the confidence intervals.
 axes[0].fill_between(preds_1[0, :], preds_1[2, :],
                      preds_1[3, :],
                      color='#FF0000',
                      alpha=0.25)

 # ------------
 # Plot dynamic
 # ------------
 # Plot truth values.
 axes[1].plot(y, color='#A6CEE3', alpha=0.5, marker='o',
              markeredgecolor='k', markeredgewidth=0.5,
              markersize=5, linewidth=0.75, label='Observed')

 # Plot forecasted values.
 axes[1].plot(preds_2[0, :], preds_2[1, :], color='#FF0000', alpha=1.00,
              linewidth=2.0, label=arima._identifier())

 # Plot the confidence intervals.
 axes[1].fill_between(preds_2[0, :], preds_2[2, :],
                      preds_2[3, :],
                      color='#FF0000',
                      alpha=0.25)

 # Configure axes
 axes[0].set_title("ARIMA non-dynamic")
 axes[1].set_title("ARIMA dynamic")

 # Format axes
 axes[0].grid(True, linestyle='--', linewidth=0.25)
 axes[1].grid(True, linestyle='--', linewidth=0.25)

 # Legend
 axes[0].legend()
 axes[1].legend()

 # Show
 plt.show()

Total running time of the script: ( 0 minutes 0.301 seconds)

Gallery generated by Sphinx-Gallery