Forecasting CEPCI to 2060: A Hybrid Approach with Prophet and Gaussian Process Regression

Nov 03, 2024

Forecasting the future of the Chemical Engineering Plant Cost Index (CEPCI) is critical for planning and decision-making across industries. With data spanning back to 1957, CEPCI offers a rich dataset but presents unique forecasting challenges. Our approach combines the seasonality and trend detection of Prophet with the fine-grained pattern recognition of Gaussian Process Regression (GPR). Here’s a deep dive into how this hybrid model is built, why it works, and how it provides an accurate, reliable forecast for decades to come.

Repository link.

Why CEPCI Forecasting Matters?

The Chemical Engineering Plant Cost Index (CEPCI) is vital for the chemical industry, representing labour, material, and maintenance costs associated with plant construction. Its predictive power enables firms to budget effectively, analyse feasibility, and navigate cost scenarios in response to economic shifts. However, CEPCI forecasting is inherently challenging due to its intricate patterns:

Long-Term Trends: CEPCI reflects sustained cost changes over decades, influenced by economic factors like inflation, technological advancements, and regulatory policies.
Seasonal and Cyclical Patterns: CEPCI exhibits periodic fluctuations that may follow annual or multi-year cycles, adding a layer of complexity to the forecast.
Sudden Shocks: CEPCI can experience abrupt increases or decreases due to economic shocks, regulatory shifts, or commodity price fluctuations, requiring a model that adapts to these changes.
Residual Noise: Traditional models often struggle with the “noise” in CEPCI data—erratic variations that hold predictive value but are not captured by trends or seasonality.

A successful CEPCI forecasting model needs to:

Capture seasonal and long-term trends to represent CEPCI’s overall trajectory.
Adapt to sudden shifts and anomalies caused by economic or regulatory changes.
Quantify uncertainty by providing reliable confidence intervals for decision-making.

Why a Hybrid Model?

Meeting these demands requires a sophisticated approach. The hybrid model combining Prophet and Gaussian Process Regression (GPR) was developed to capture CEPCI’s macroscopic trends while adapting to complex, minute fluctuations. Prophet’s decomposition of data trends and seasonality handles macro-patterns, while GPR fine-tunes by capturing residual erratic behaviours, enhancing accuracy and confidence in the forecast.

Prophet: This model excels at decomposing time series data into trend, seasonality, and residuals, handling the long-term trends and seasonal patterns in CEPCI effectively. Prophet’s changepoint detection allows it to respond to significant shifts, while its seasonal modelling provides a strong foundation for cyclical behaviour in CEPCI.
Gaussian Process Regression (GPR): While Prophet captures broad patterns, GPR focuses on refining residuals—the erratic elements not explained by trend or seasonality. Trained on Prophet’s residuals, GPR adapts to complex, non-linear fluctuations, enhancing the precision of the final forecast.

This hybrid approach harnesses Prophet’s trend and seasonal modelling capabilities alongside GPR’s residual adjustment, producing a forecast that captures both macro-trends and fine-grained variations. The result is a forecast that not only predicts future values but also provides reliable confidence intervals, quantifying uncertainty for more informed decision-making.

Step 1: Data Preparation & Feature Engineering

With CEPCI data from 1957 to 2024, the preprocessing begins with essential transformations to capture periodic trends, account for growth patterns, and smoothen data for better predictions.

Log Transformation

Taking the log of CEPCI values stabilises the variance, allowing Prophet to manage phases of exponential growth effectively.

df['CEPCI_log'] = np.log(df['CEPCI'])

Sine and Cosine Transformations

Sine and cosine transformations capture cyclic behaviour, adding periodic components that aid Prophet in modelling regular fluctuations and is essential for Prophet to model seasonal variations more naturally.

df['Year_sin'] = np.sin(2 * np.pi * (df.index - df.index.min()) / len(df)) df['Year_cos'] = np.cos(2 * np.pi * (df.index - df.index.min()) / len(df))

Rolling Statistics

We calculated rolling mean, standard deviation, and median over a 5-year window, providing Prophet with smoothed trends and offering GPR stable reference points for further adjustment.

df['CEPCI_rolling_mean_5'] = df['CEPCI'].rolling(window=5, min_periods=1).mean() df['CEPCI_rolling_std_5'] = df['CEPCI'].rolling(window=5, min_periods=1).std() df['CEPCI_rolling_median_5'] = df['CEPCI'].rolling(window=5, min_periods=1).median()

Lagged Variables and Differencing

Lagged CEPCI values (1-year and 2-year lags) were included to introduce temporal dependencies, helping the model detect recent historical influence.

df['CEPCI_lag_1'] = df['CEPCI'].shift(1) df['CEPCI_diff_1'] = df['CEPCI'].diff(1)

Polynomial Terms

Adding squared and cubed year terms enabled the model to capture non-linear growth patterns, essential for representing CEPCI’s occasional acceleration and deceleration phases

df['Year_squared'] = (df.index - df.index.min()) ** 2 df['Year_cubed'] = (df.index - df.index.min()) ** 3

Differencing:

We introduced first and second-order differences to remove non-stationarity, reducing trend impact and making the data easier to predict.

Step 2: Hyperparameter Optimization with Optuna

Using Optuna for automatic hyperparameter tuning optimises Prophet and GPR settings for error minimisation. This ensures that the hybrid model adapts precisely to CEPCI’s specific characteristics.

Prophet Hyperparameter Optimisation with Optuna

Prophet’s flexibility in handling trend shifts and seasonality can be enhanced by optimising:

Changepoint Prior Scale: Controls trend flexibility, enabling Prophet to adapt to shifts in long-term trends.
Seasonality Prior Scale: Determines sensitivity to seasonal effects, allowing the model to balance specificity with generality.
Seasonality Mode: Dictates whether seasonality is additive or multiplicative, letting us capture scaling effects in CEPCI.

Optuna fine-tuned these parameters to minimise the Mean Absolute Percentage Error (MAPE), resulting in a Prophet configuration tailored to CEPCI’s characteristics.

def objective(trial):
    changepoint_prior_scale = trial.suggest_loguniform('changepoint_prior_scale', 0.001, 0.5)
    seasonality_prior_scale = trial.suggest_loguniform('seasonality_prior_scale', 0.01, 10)
    seasonality_mode = trial.suggest_categorical('seasonality_mode', ['additive', 'multiplicative'])

    model = Prophet(
        changepoint_prior_scale=changepoint_prior_scale,
        seasonality_prior_scale=seasonality_prior_scale,
        seasonality_mode=seasonality_mode,
        yearly_seasonality=True
    )

    model.fit(df)
    forecast = model.predict(df)
    mape = mean_absolute_percentage_error(df['y'], forecast['yhat']) * 100

    return mape

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50)
best_params = study.best_params

GPR Hyperparameter Optimisation

For GPR, which models Prophet’s residuals, we focused on parameters like:

Alpha: A regularisation term that helps prevent GPR from overfitting to small fluctuations.
n_restarts_optimizer: Ensures thorough searching for local minima, crucial for accurate residual prediction.

By optimising these parameters, GPR became finely attuned to the residuals, making precise adjustments to Prophet’s baseline forecast.

def gpr_objective(trial):
    alpha = trial.suggest_loguniform('alpha', 0.001, 1.0)
    n_restarts_optimizer = trial.suggest_int('n_restarts_optimizer', 5, 20)

    kernel = C(1.0, (1e-3, 1e3)) * Matern(length_scale=1.0, nu=1.5)
    model = GaussianProcessRegressor(
        kernel=kernel,
        alpha=alpha,
        n_restarts_optimizer=n_restarts_optimizer,
        normalize_y=True
    )

    model.fit(X, y)
    predictions = model.predict(X)
    mse = mean_squared_error(y, predictions)

    return mse

gpr_study = optuna.create_study(direction="minimize")
gpr_study.optimize(gpr_objective, n_trials=50)
best_gpr_params = gpr_study.best_params

Step 3: Training the Hybrid Model

Once optimised, we trained Prophet and GPR in sequence.

Prophet Training:

Prophet trained on the prepared dataset, capturing the long-term trends and seasonal cycles in CEPCI. Its forecast formed the baseline prediction, leaving the residuals for GPR to fine-tune.

prophet_model = Prophet(**best_params)
prophet_model.fit(train_df)

Residual Analysis:

Prophet’s residuals, or prediction gaps, were analysed to reveal patterns that the model couldn’t capture. GPR trained on these residuals, modelling the unpredictable shifts in CEPCI.

GPR Training on Residuals:

GPR refined the forecast by learning from Prophet’s residuals, adding precision to capture subtle fluctuations and non-linear dependencies in the data.

historical_forecast = prophet_model.predict(train_df)
residuals = train_df['y'] - historical_forecast['yhat']

gpr_model = GaussianProcessRegressor(kernel=C(1.0, (1e-3, 1e3)) * Matern(length_scale=1.0, nu=1.5),
                                     n_restarts_optimizer=best_gpr_params['n_restarts_optimizer'],
                                     alpha=best_gpr_params['alpha'],
                                     normalize_y=True)
gpr_model.fit(np.arange(len(residuals)).reshape(-1, 1), residuals)

Step 4: Forecast Generation

Baseline Forecast from Prophet:

Prophet’s long-term trend and seasonal predictions provided the main forecast trajectory for CEPCI from 2024 to 2060.

GPR Adjustment:

GPR’s residual adjustment fine-tuned Prophet’s forecast, addressing complex fluctuations and capturing erratic patterns, enhancing overall forecast accuracy.

Confidence Intervals:

GPR also provided uncertainty estimates, resulting in 95% confidence intervals for each year’s prediction, helping to quantify potential variability.

future_dates = pd.date_range(start='2025', periods=36, freq='Y')
future_df = pd.DataFrame({'ds': future_dates})
prophet_forecast = prophet_model.predict(future_df)

# GPR Adjustment
X_future = np.arange(len(df), len(df) + len(future_dates)).reshape(-1, 1)
gpr_predictions, gpr_std = gpr_model.predict(X_future, return_std=True)

# Combine Forecasts
forecast_df = pd.DataFrame({
    'Year': future_dates.year,
    'CEPCI_Predicted': prophet_forecast['yhat'] + gpr_predictions,
    'CEPCI_Lower': prophet_forecast['yhat_lower'] + (gpr_predictions - 2 * gpr_std),
    'CEPCI_Upper': prophet_forecast['yhat_upper'] + (gpr_predictions + 2 * gpr_std)
})

Step 5: Model Validation & Metrics

We evaluated model performance using key metrics, calculated over a validation window:

MAPE (Mean Absolute Percentage Error): Indicates percentage accuracy, a key metric for real-world forecasting.
RMSE (Root Mean Squared Error): Measures the average magnitude of error, highlighting forecast precision.
R² (Coefficient of Determination): Reflects model fit, indicating how well the forecast aligns with actual data.

The metrics confirmed strong predictive performance, showing the hybrid model’s ability to handle CEPCI’s complexities accurately.

mape = mean_absolute_percentage_error(actual, predicted) * 100
rmse = np.sqrt(mean_squared_error(actual, predicted))
r2 = r2_score(actual, predicted)

print(f"Validation Metrics:\nMAPE: {mape:.2f}%\nRMSE: {rmse:.2f}\nR²: {r2:.3f}")

Step 6: Visualising the Forecast

We used Seaborn and Matplotlib to visualise the final forecast:

Historical Data Points: Displayed as black dots to ground the forecast in reality.
Forecast Line: Shows the combined Prophet-GPR forecast for CEPCI up to 2060.
Confidence Bands: Shaded regions represent 95% confidence intervals, illustrating potential variability.

This visualisation brings clarity to CEPCI’s future trajectory, helping stakeholders interpret trends and uncertainties at a glance.

plt.figure(figsize=(15, 8))
sns.scatterplot(data=historical_df, x='Year', y='CEPCI', color='black', label='Historical Data')
plt.plot(forecast_df['Year'], forecast_df['CEPCI_Predicted'], 'b-', label='Hybrid Forecast')
plt.fill_between(forecast_df['Year'], forecast_df['CEPCI_Lower'], forecast_df['CEPCI_Upper'], alpha=0.2, color='blue', label='95% Confidence Interval')
plt.xlabel('Year')
plt.ylabel('CEPCI Value')
plt.title('CEPCI Forecast with Uncertainty Bounds')
plt.legend()
plt.show()

Key Findings

Overall CEPCI Growth Trend
The CEPCI index demonstrates a steady upward trend through much of the forecast period, reflecting ongoing inflation, rising material costs, and increasing labour expenses within the chemical engineering industry. From an initial prediction of around 814 in 2024, the CEPCI index is forecasted to approach 2271 by 2059, indicating substantial cost increases that stakeholders must account for in future budgeting and financial planning.
Increasing Volatility in Later Years
The model ensures that CEPCI never falls below zero, even in the lower bound of the confidence intervals. In previous forecasts, certain years displayed potential negative values, which could misrepresent industry trends. Therefore, the lower bound stabilises at zero, which is particularly visible from 2051 onward, where uncertainty widens significantly but remains non-negative.
From around 2040, the model shows wider confidence intervals. By 2060, forecasted values display potential variability, ranging from approximately 0 to 5119. This widening band suggests heightened uncertainty, likely due to compounding effects of market variability and inflationary pressures. The volatility hints at potential industry disruptions or increased sensitivity to economic changes in later years.
Mid- to Long-Term Shifts
In the near-to-mid term, confidence intervals remain more constrained, making forecasts for 2024-2045 relatively reliable. This period offers a clearer view, with CEPCI values predicted to range between 814 and 1664, providing a more stable basis for budgeting and cost estimation.
Decision-makers are advised to use this more reliable forecast range for medium-term projects, as it reflects less variability compared to the long-term forecast.
Confidence Bands and Risk Assessment
The 95% confidence intervals widen significantly as the forecast horizon extends, underscoring increasing uncertainty about long-term cost predictions. For instance, by 2060, the CEPCI range extends from 0 to 4849. This increased variability indicates that while the trend remains upward, there is greater uncertainty due to possible external market or economic shifts, particularly in the long term.
Decision-makers should consider this variability, particularly for projects with extended timelines or those highly sensitive to cost fluctuations, such as large-scale chemical plant constructions.
Forecast Anomalies
Some anomalies in the forecast, such as potential declines seen around 2056 and 2058, may indicate external shocks or unforeseen downturns in the index. Such instances highlight the potential impact of market corrections or structural changes in the industry. These downturns, although speculative, offer valuable scenarios for risk planning.

Practical Implications for Stakeholders

Budget Planning: This CEPCI forecast aids stakeholders in aligning long-term financial projections with expected cost escalations in the chemical engineering industry, especially for capital-intensive projects.
Contingency Planning: Given the observed volatility in later years, companies should consider contingency budgets to accommodate potential cost overruns or unexpected economic fluctuations. This is especially relevant for projects forecasted to occur in the mid-2040s to 2060.
Policy and Investment Strategy: The projected rise in CEPCI highlights the importance of continued investment in cost-saving technologies and sustainable practices to mitigate rising expenses. Policymakers can also use this data to inform regulations aimed at stabilising industry costs.

An Effective Forecasting Solution for CEPCI

The Prophet-GPR hybrid captures CEPCI’s unique dynamics, combining trend awareness with fine-grained adjustments. This approach not only aligns with the needs of long-term planning but also supports agile adjustments in response to economic changes.

Mohamed’s Substack

Discussion about this post