Abstract
Electricity is a crucial, reliable, convenient, powerful, and widely used energy source with a huge impact on social and economic aspects, which could be produced in environmentally friendly ways. Predicting electricity consumption needs has an outstanding effect on producing efficient energy and building proper infrastructures. Inaccurate forecasts can cause significant financial losses and numerous unwanted power cut incidents.
Based on the aforementioned reasons, many industries and academic researchers are increasingly inclined to find efficient ways to predict electricity consumption needs accurately. In this dissertation, a comprehensive dataset of eight years of actual electricity consumption for the whole province, in hourly resolution with the amount and time of daily usage peaks, was utilised. This dataset was merged with two other datasets, including a date dataset consisting of holidays and events, and a meteorology dataset consisting of each day’s maximum and minimum temperatures and weather conditions. Python coding in the Google Colaboratory environment was utilized to perform data analytics and machine learning aspects of the project. Additionally, Tableau software was employed for all visualizations in this dissertation.
The dataset was split into three subsets, namely the train, test and evaluation datasets, and machine learning algorithms such as Random Forest Regressor, Gradient Boosting Machines (LightGBM), Decision Tree, Long Short-Term Memory Networks (LSTM) and Convolutional Neural Networks (CNNs) for Time Series were employed to uncover hidden consumption patterns in the train dataset. The accuracy of each algorithm was evaluated using error measures, with the best performance achieved by LightGBM, yielding the following results: Mean Absolute Error (MAE) = 175.94, Root Mean Squared Error (RMSE) = 318.64, Mean Absolute Percentage Error (MAPE) = 0.7%, and Coefficient of Determination (R-squared) = 0.99.
The best-performing algorithm can be implemented in the province with real-time data to accurately predict electricity consumption needs.
Contents
Declaration.......................................................... Error! Bookmark not defined.
Acknowledgements.................................................... Error! Bookmark not defined.
Abstract.................................................................................................................. 0
Table of Figures and Tables................................................................. 2
1 Introduction.................................................................................................... 3
1.1 Background................................................................................................ 3
1.2 Aims............................................................................................................ 4
1.3 Objectives.................................................................................................... 5
1.4 Work Done and Results........................................................................ 6
1.5 Structure of the Report................................................................. 6
2 Literature Review........................................................................................... 7
2.1 Traditional methods................................................................................ 7
2.2 Modern Machine Learning Methods....................................................... 8
2.3 Literature Survey...................................................................................... 10
2.4 Evaluation Metrics and Benchmarking:............................................... 11
2.5 Research Gap and Contribution........................................................... 11
3 Research Methodology................................................................................. 12
3.1 Introduction............................................................................................... 12
3.2 Dataset..................................................................................................... 12
3.3 Data Pre-processing................................................................................. 13
3.4 Model Training......................................................................................... 14
3.5 Result Evaluation....................................................................................... 16
3.6 Prediction................................................................................................ 18
4 Implementation and Testing.................................................................. 19
4.1 Introduction............................................................................................... 19
4.2 Data Pre-processing................................................................................. 19
4.3 Exploratory Data Analysis................................................................. 19
4.4 Data Splitting.......................................................................................... 24
4.5 Training the models............................................................................... 25
5 Discussion and Evaluation......................................................................... 26
6 Conclusions, Recommendation................................................................ 27
References....................................................................................................... 29
7 Appendix A – Ethical Approval................................................... Error! Bookmark not defined.
8 Appendix B – Meeting Logs....................................................... Error! Bookmark not defined.
9 Appendix C – python codes..................................................... Error! Bookmark not defined.
9.1 Data Cleaning....................................................................................... Error! Bookmark not defined.
Table of Figures and Tables
Figure 1- 1: Insight of production to consuming electrical energy (Shutterstock, 2018). ....4
Figure 2- 1: Machine Learning techniques (Abdella et al., 2020). .........................................9
Figure 3- 1: Dissertation flowchart ................................................................................................ 12
Figure 3- 2: Insight of Final Dataset ......................................................................................... 13
Figure 4- 1: Total Daily Energy Consumption .......................................................................... 20
Figure 4- 2: Baseload Based on 2016 Data............................................................................ 20
Figure 4- 3: Baseload and Peak Load in 2022 and 2023........................................................ 21
Figure 4- 4: Average Consumption for Each Hour of Day During Eight Years................................ 21
Figure 4- 5: Frequency of Peak Hours ................................................................................... 22
Figure 4- 6: Interactive Relation Between Maximum and Minimum Temperature with Daily Energy Consumption................................................................. 22
Figure 4- 7: Electricity Consumption Behaviours in Same Time in Different Years....... 23
Table 2- 1: Research on Electricity Consumption Need Prediction .................................. 10
Table 3- 1: Features Explanation ................................................................................ 13
Table 5- 1: Result for Trained Models ................................................................. 26
2
1 Introduction
Electricity is a crucial, reliable, convenient, powerful, and widely used energy source with a huge impact on social and economic aspects, which could be produced in environmentally friendly ways. Predicting electricity consumption needs has an outstanding effect on producing efficient energy and building proper infrastructures. Inaccurate forecasts can cause significant financial losses and numerous unwanted power cut incidents.
Based on the aforementioned reasons, many industries and academic researchers are increasingly inclined to find efficient ways to predict electricity consumption needs accurately. In this dissertation, a comprehensive dataset of eight years of actual electricity consumption for the whole province, in hourly resolution with the amount and time of daily usage peaks, was utilised. This dataset was merged with two other datasets, including a date dataset consisting of holidays and events, and a meteorology dataset consisting of each day’s maximum and minimum temperatures and weather conditions. Python coding in the Google Colaboratory environment was utilized to perform data analytics and machine learning aspects of the project. Additionally, Tableau software was employed for all visualizations in this dissertation. For accurate electricity consumption need, in this dissertation machine learning algorithms will be employed to learn from the dataset and unveil hidden patterns. These patterns will then pave the way to potentially predict future consumption needs. The performance of each machine learning algorithm will be assessed by calculating error measures, and the algorithm with the highest accuracy will be selected for predicting future consumption based on the used dataset and data model.
1.1 Background
Electricity is crucial for modern life and industries for numerous reasons. Almost all devices in our modern lives and industries utilise electricity to function. Additionally, electricity is cleaner and more environmentally friendly compared to other energy sources and has the potential to become even cleaner in the future. Furthermore, electricity is an efficient and convenient energy source with a high level of reliability. (Mahfoudh and Amar, 2014)
Electricity has a significant impact on social and economic aspects. Access to electricity enhances the quality of life by enabling access to education, healthcare, and communication technologies. Moreover, it drives economic development by boosting productivity and creating job opportunities. (Groh and Ziegler, 2022)
As it is demonstrated in Figure 1- 1 the Production, Transmission, and Distribution of electricity require extensive infrastructure. The majority of electricity is generated in large power plants located outside urban areas. It is then transmitted to cities through high-voltage networks and distributed within cities using medium and low-voltage networks. (Singh, 2008)
3
Electric Power Distribution
Figure 1-1: Insight of production to consuming electrical energy (Shutterstock, 2018).
Electricity consumption prediction is exceptionally important for the electricity industry. One of the significant weaknesses of electricity is the absence of an efficient method for storage electrical energy in large scale, so electricity must be consumed at the same time as production. Over-forecasting can result in the construction of numerous unused infrastructures, causing huge unnecessary costs. However, underestimation can lead to significant costs for distributors and the risk of power cuts. (Hadi and Soltanaghayi, 2022) Electricity consumption prediction also is important for several other reasons. It serves as a fundamental factor in Intelligent Power Management Systems (IPMS) and the preparation of national energy development policies. Additionally, it plays a crucial role in maintaining power market balance and has a decisive role in calculating electricity costs. Furthermore, electricity consumption prediction is used in designing more efficient power networks, reducing costs associated with building reserve networks. Lastly, it plays a vital role in ensuring the provision of an effective and sustainable energy source. (Gonzalez-Briones et al., 2019)
The department currently responsible for predicting the future consumption of the province that the dataset comes from, employs manual and traditional methods exercised by human agents, which is prone to error and lacks accuracy. This project aims to transition the process from a manual approach to an automated, intelligent, and scalable pipeline, leveraging data and machine learning algorithms.
1.2 Aims
As previously discussed in the preceding section regarding the significance of accurate electricity consumption prediction and the existing situation and methodologies employed for predicting electricity consumption in the province, this dissertation pursues the following Aims.
The aims of this dissertation are:
1. To predict electricity consumption demand for a province with 3.3 million population using machine learning algorithms.
The utilisation of machine learning algorithms facilitates more accurate and professional electricity consumption prediction. It can mitigate human errors, and the entire process can be automated through internal networks and APIs, resulting in quicker and more reliable predictions (AI Mamun et al., 2020).
2. To enhance accuracy to the highest extent achievable by utilising the provided dataset.
In the path of fulfil these aims some questions arise as the research questions which listed as below:
1. Is the time period of the dataset and the existing features within it adequate to enable accurate predictions of future electricity energy consumption?
2. Are there any key patterns in electricity consumption habits within the province over the last eight years at daily, weekly, and monthly intervals?
3. How effective are machine learning algorithms in unveiling hidden patterns in the comprehensive dataset of electricity consumption, and which algorithms perform best in this context?
4. Can the identified patterns in electricity consumption be used to accurately predict future consumption needs for the entire province at daily, weekly, and monthly resolutions?
5. Which features have a greater impact on electricity consumption habits?
6. What is the comparative accuracy of machine learning algorithms in predicting future electricity consumption needs, and which algorithm demonstrates the highest accuracy in the context of the specified province?
7. How accurate can the predictions be utiliseing this dataset and its features?
1.3 Objectives
In this dissertation, a journey will be undertaken to fulfil the main aims of this project. This path is divided into specific objectives and steps as follows:
The aims of this dissertation are:
1. Pre-process the provided dataset to make it suitable for machine learning purposes.
2. To create a suitable dataset for utilisation in the machine learning project, several pre-processing steps are necessary. These include identifying and handling missing values, removing duplicate entries, and scaling the values for use in machine learning algorithms (Abdella et al., 2020).
3. Utilise machine learning algorithms to identify patterns within the data.
Machine learning algorithms offer the most effective approach to uncovering hidden patterns within large-scale datasets (Zhang et al., 2021b), which may be impossible for humans to recognise. This dissertation will utilise several machine learning algorithms to extract these hidden patterns from the dataset.
4. Evaluate the applicability of various machine learning algorithms to the problem.
The performance of each machine learning algorithm will be evaluated using error measures, and the algorithm with the best performance based on these measures will be identified (Doshi-Velez and Perlis, 2019).
5
5. Employ the most accurate machine learning algorithm to predict future electricity consumption needs for the entire province.
The most accurate machine learning algorithm will be used to predict electricity consumption need for the entire province at hourly, daily, and weekly intervals.
6. Visualising the historical trend and future prediction.
In this dissertation, visualisation tools such as Python libraries and Tableau will be used to visualise historical trends and future predictions of electricity consumption needs.
1.4 Work Done and Results
In this dissertation, an eight-year dataset of hourly and dated electricity consumption is utilised to predict future electricity consumption needs. in this dissertation, the manual approach to electricity consumption prediction will transition to an automatic and intelligent approach through the use of machine learning.
1.5 Structure of the Report
This dissertation formed in 6 chapter, in chapter 1 an insight of the problem provided including how electrical energy produce and transmitted and dispatch for people using and why accurate electricity consumption prediction is important, then the research aims and research questions are introduced. In chapter 2 a comprehensive reviewing on the problem domain was provided with explanation of each reference methodology and results then the research gap and contribution explained. In Chapter 3, the methodology and tools for performing this dissertation are introduced and explained. In this chapter, the first dataset is introduced, and its features are explained. Then, the data pre-processing approach and data splitting method are described, followed by an explanation of the machine learning algorithms used. Finally, the measures of error are introduced and explained. In Chapter 4, the processes outlined in Chapter 3 will be executed. This includes all steps of the project such as data pre-processing, exploratory data analysis, feature engineering, model training, evaluation, and the implementation of Python code. Additionally, Tableau visualisations will be presented to complement the analyses. In Chapter 5, the results for the trained models will be presented, framed within the context of error measures. Interpretations for all results will be provided, and the best-performing model will be selected. The conclusions, limitations, and recommendations are elaborated in Chapter 6, while all coding for the implementation process is provided in Appendix C.
6
2 Literature Review
Electrical energy is a crucial driving force for economic development, with the precision of demand prediction being a key factor in the success of productivity planning (Gorman et al., 2020). Consequently, energy analysts require guidelines to select the most suitable prediction techniques for providing accurate predictions of electricity consumption trends. Despite numerous existing techniques tailored towards predicting future electricity demands, finding an accurate method for tracking the consumption pattern is essential, particularly given the growing demand for electricity globally (Le et al., 2019). The increasing demand necessitates the development of intelligent prediction methods and algorithms. Estimating electrical energy demand based on economic and non-economic indicators can be achieved through specific statistical, mathematical, linear, or nonlinear simulation models and machine learning algorithms. Long-term prediction of electricity consumption serves as the foundation for energy investment planning and plays a vital role for governments in developing countries (Mosavi and Bahmani, 2019). The importance of accurate prediction can be understood from the fact that overestimating electricity consumption can lead to wasting financial resources in building over-designed infrastructure whilst underestimating it can result in electricity shortages and impose unpredictable and considerable costs(Wang et al., 2011). Therefore, accurate energy consumption prediction can prevent significant financial losses(Le et al., 2019, Hadi and Soltanaghayi, 2022). The traditional approach to electricity consumption prediction is often stuck by the challenges posed by the nonlinear and dynamic aspects of energy consumption trends. Consequently, many industries and academic researchers increasingly tend to use machine learning and deep learning algorithms for this purpose. These techniques offer automation and more accurate predictions(Liu et al., 2019, Antonanzas et al., 2016). This literature review provides an outline of recent studies in electricity consumption predicting by machine learning algorithms.
2.1 Traditional methods
Traditional methods involve various approaches. For instance, in the mentioned province, individual efforts are employed for electricity consumption prediction. This involves utilizing historical data of usage and weather conditions to anticipate future needs. Some other traditional methods include regression, multiple regression, exponential smoothing and the Iterative Reweighted Least Squares (IRLS) technique (Hammad et al., 2020). Regression is a statistical method that widely used for consumption prediction due to its simple implementation and acceptable accuracy (Singh et al., 2012). Multiple regression method is a statistical technique to find the relation between a dependant variable with two or more independent variable, Pankib et al. (2015) using multiple regression method to predict electricity consumption in Thailand. Deng et al. (2021) explain that Exponential smoothing is a widely used time series forecasting method for predicting future values based on past data and use this method to predict electricity consumption need and optimise this by PSO (Particle Swarm Optimization) algorithm. Iterative Reweighted Least Squares (IRLS) is a computational algorithm used to estimate parameters in statistical models, particularly in the context of load forecasting which used for short term power forecasting in (Mbamalu and El-Hawary, 1993).
7
trends, and incorporate multiple influencing factors. Additionally, they may be sensitive to outliers, struggle with non-stationary data, and lack real-time adaptability. These weaknesses hinder their ability to provide accurate and comprehensive predictions for electricity consumption, emphasising the need for more advanced and adaptable forecasting methodologies.
2.2 Modern Machine Learning Methods
Recently machine learning methods are widely using for electricity consumption prediction.
Utilising machine learning for electricity consumption prediction offers several advantages.
The prediction process can be automated through the use of machine learning algorithms, eliminating the need for human intervention.
Furthermore, machine learning algorithms continually enhance their performance, thus, using them in electricity consumption prediction resulted an ongoing process for improving accuracy.
Also, these algorithms result better accuracy with lower cost(Zhang et al., 2021b).
2.2.1 Supervised Learning Techniques
Supervised machine learning is used widely in predicting electricity consumption.
The concept of supervised machine learning involves training the model using labelled input to identify hidden trends in a dataset.
Some various supervised machine learning algorithms, including K-Nearest Neighbours (k-NN), Naïve Bayes, Support Vector Machine (SVM), Neural Network (NN), Deep Learning, and Tree-based Algorithms, can utilised for this purpose (Ahmad et al., 2018).
During training a model, factors beyond historical consumption data, such as weather conditions, population density, and network parameters, must also be considered(Abdella et al., 2020).
(Qiu et al., 2023) explain K-Nearest Neighbours is a non-parametric, lazy learning algorithm used for classification and regression tasks.
In the context of machine learning, KNN makes predictions based on the similarity of input data points to the training data.
This paper using K-NN to predict electricity consumption need for a building in a tropical region, this study utilise data from four commercial building in Singapore and best performance was achieved with 4% error from actual values.
Azadeh et al. (2008) using Neural Network for annual electricity consumption predating in high energy consuming industrial sector, the using dataset was the actual data from high energy consuming (intensive) industries in Iran from 1979 to 2003 and best performance achieved a MAPE of 0.0099 on the test data.
Neural network is a computational model inspired by the structure and function of the human brain. It consists of a collection of interconnected nodes, known as neurons, which work together to process input data and produce output (Azadeh et al., 2008).
2.2.2 Unsupervised Learning Techniques:
Unsupervised machine learning, is different from supervised learning, involves working with unlabeled data. This exploratory approach requires the model to identify hidden trends in the dataset by recognising factors such as abnormal usage, population density, and so on. One effective algorithm for this purpose is clustering, which helps identify relevant factors in an unsupervised manner (Westermann et al., 2020). Some supervised and unsupervised techniques are illustrated in Figure 2- 1
Machine Learning Techniques
Supervised Learning
Regression Problem:
• Linear Regression
• Multiple Regression
• Penalized Regression
Classification Problem:
• Logistic Regression
• Support Vector Machine
Unsupervised Learning
Clustering Problem:
• K-means Clustering
• Fuzzy-K-means Clustering
Dimension Reduction Problem:
• Principal Component Analysis
• Linear Discriminant Analysis
• Non-negative Matrix Factorization
Figure 2-1: Machine Learning techniques (Abdella et al., 2020).
Based on Zhang et al. (2021a)Clustering involves splitting a dataset into distinct groups, known as clusters, and is a common technique in unsupervised machine learning. In clustering for sample classification, similarity between samples is typically measured using Euclidean Distance, where smaller distances indicate higher similarity. K-means clustering stands out among clustering algorithms. It requires the user to specify the number of clusters, denoted as 'k', beforehand. Each cluster is centred around its centroid, representing the average value of all elements within it. The workflow of K-means is straightforward: initial centroids for each cluster are randomly selected, then each data point is assigned to the cluster closest to it based on Euclidean Distance, as mentioned earlier. Zhang et al. (2021a) utilised an hour-level electricity load dataset from a French family, spanning 1,440 days from December 17, 2006, to November 25, 2010. They employed K-means to predict the electricity consumption needs for a French family, achieving a best performance with an RMSE of 0.86.
Principal Component Analysis (PCA) is a statistical technique used to simplify complex data sets by reducing the number of variables while retaining the critical information. It achieves this by transforming the original variables into a new set of variables known as principal components (Parhizkar et al., 2021). It uses PCA to predict electricity consumption need for a building with an MSE of 0.15 and 0.39 kW, used dataset was came from the Information Technology Centre building at the University of Wyoming, USA. The data contains historical energy data, outdoor dry bulb temperature, humidity, and pressure from September to December 2017, totalling 2,500 time steps of data collected at hourly intervals.
2.2.3 Ensemble Learning:
clustering. Different techniques are described, including averaging and median ensembles, as well as cluster-based ensemble methods that leverage clustering algorithms for forecasting. It discusses three distinct datasets: the London dataset, the Irish dataset, and the Ausgrid dataset. The London dataset covers a period of three months in 2013, from May 21st to August 18th, comprising 90 days. The Irish dataset includes various customer types such as residential, Small and Medium-sized Enterprises (SMEs), and others. After removing consumers with missing data, it retained 3,639 residential consumers. The Ausgrid dataset spans nearly four months in 2010, from September 2nd to December 22nd, totaling 112 days. The best result for London was an MAPE of 3.733, for Ireland it was an MAPE of 3.780, and for Ausgrid, it was an MAPE of 7.860.
The ensemble method utilised in (Li et al., 2021) incorporates a combination of five individual models, namely Teaching-Learning-Based Optimisation-Backpropagation (TLBO-BP), Teaching-Learning-Based Optimisation-Support Vector Machine (TLBO-SVM), Backpropagation-Adaptive Boosting (BP-Adaboost), Extreme Learning Machine (ELM), and Random Forest (RF). These models contribute their outputs as input variables to a secondary predictor, specifically a multiple linear regression model, to form the ensemble predictive model to predict electricity consumption need for Iran. The integration of these five models within the ensemble learning scheme aims to leverage their respective fitting characteristics and enhance overall forecasting accuracy. It used the dataset originated from the first energy forecasting competition organized by ASHRAE in the 1990s and comprised whole building electrical energy (WBE), solar radiation, outdoor dry bulb temperature, and other meteorological data available at hourly intervals for the period from September 1989 to February 1990 and best achieved result was RMSE = 14.96.
2.3 Literature Survey
All the literature review explained in this section is presented in Table 2- 1
Table 2- 1: Research on Electricity Consumption Need Prediction
(Qiu et al., 2023) A k-nearest neighbour attentive deep autoregressive network for electricity consumption prediction PJM energy company dataset K-NN RMSE = 225.28
(Quek et al., 2017) A naïve Bayes Classification Approach for Short-Term Forecast of Photovoltaic System photovoltaic power generation output dataset Naive Bayes 69.75%
(Dong et al., 2005) Applying support vector machines to predict building energy consumption in tropical region data from four commercial building in Singapore SVM 96%
(Azadeh et al., 2008) Annual electricity consumption forecasting by neural network in high energy consuming (intensive) industries in Iran high energy consuming (intensive) industries in Iran Neural Network MAPE = 0.009
(Zhang et al., 2021a) Power consumption predicting and anomaly detection based on transformer and K-means dataset from a French family, spanning 1,440 days K-means RMSE = 0.86
(Parhizkar et al., 2021) Evaluation and improvement of energy consumption forecasting models using principal component analysis based feature reduction from the Information Technology Centre building at the University of Wyoming, USA Principal Component Analysis MSE = 0.15
(Li et al., 2021) Short-term electricity consumption forecasting for buildings using data-driven swarm intelligence-based ensemble model from the first energy forecasting competition organized by ASHRAE in the 1990s ensemble learning method RMSE = 14.96
10
2.4 Evaluation Metrics and Benchmarking:
Evaluating the performance of electricity consumption predicting models requires the application of relevant metrics. Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE) are the most used ones (Chung et al., 2019, Cañete et al., 2023).
2.5 Research Gap and Contribution
All the articles reviewed in this literature review relate to predicting power consumption needs. They utilise machine learning algorithms, which are categorised into three sections: supervised learning, unsupervised learning, and ensemble learning. As evidenced in Table 2- 1 and outlined in the references, the accuracy of prediction is highly correlated with the size of the area for which electricity consumption needs to be predicted. Smaller area predictions tend to be more accurate because electricity consumption prediction is influenced by various factors specific to each consumer, such as household, office, or industrial consumption. This complexity makes predictions on a larger scale less accurate. (Wang et al., 2021)
As a result, each case of electricity consumption prediction must be considered independently, aiming to achieve the best possible result with the utilised dataset and algorithms. This dissertation focuses on a province in the northern region of an Asian country, situated along the coast, bordered by the sea to the north and mountains to the south. It boasts favourable weather conditions and is a popular tourist destination. The province has a population of 3.3 million permanent residents, which significantly increases during holidays such as New Year's, reaching up to 18 million people, thereby complicating the prediction process. (Karimtabar et al., 2015, Tasnim, 2022, Wikipedia, 2024)
Upon reviewing the references, it is noted that only one study focuses on this particular province, namely Karimtabar et al. (2015), which solely concentrates on the total energy consumption throughout the year and predicts electricity consumption needs on a yearly interval. However, this ongoing disseration aims to forecast electricity consumption needs at hourly, daily, and weekly intervals.
3 Research Methodology
3.1 Introduction
In this chapter, the methodology of the dissertation is explained, which includes an overall insight into the utilised dataset, the explanation of features, and approaches for data cleaning. Additionally, the data split approaches are explained and evaluated. Then, the chosen machine learning algorithms are introduced, along with an explanation of how they were implemented. Finally, the evaluation measures are explained. The flowchart for this dissertation project is illustrated in Figure 3- 1. This flowchart outlines all the steps involved in the dissertation. In the first step, historical data is gathered and integrated into a single dataset. Subsequently, data pre-processing is implemented to prepare usable data for machine learning purposes. The next step involves analysing the data and splitting them into training and test data. After training the model, the subsequent stage is evaluating the model’s performance by measuring errors. Following this, the model is retrained for improved accuracy, and the best model, achieving the highest accuracy, is employed for predicting future energy consumption needs.
Figure 3- 1: Dissertation flowchart
3.2 Dataset
One of the most crucial aspects of any data-related project is the provision of a reliable dataset. For this dissertation, a combined dataset for eight years of electricity consumption for the entire province was utilised. This dataset includes information on the days of the week, corresponding dates, and the electrical energy used in the entire province for each hour of the day. Additionally, the dataset includes information on peak usage. To create a comprehensive dataset, it is essential to enrich the dataset with additional information. In particular, incorporating weather conditions and temperature is crucial, as they play a pivotal role in electricity consumption prediction. Air temperature, for instance, exhibits a positive and strong correlation with electricity consumption (Al Mamun et al., 2020). For this dissertation, relevant data regarding weather conditions, maximum, and minimum temperatures were extracted from meteorology sources and integrated into the original dataset. As explained in the Research Gap and Contribution, this province is a popular destination for leisure trips. Therefore, considering holidays and weekends is important for accurately predicting electricity consumption. During weekends and holidays, a significant increase in [unreadable]
12
population occurs in the province, which notably affects the electricity consumption needs.
For this purpose, all holidays and events over the past eight years have been identified and
appended to the original dataset. An insight of a final dataset is shown in Figure 3- 2.
Figure 3-2: Insight of Final Dataset
All features in the dataset are displayed in Table 3- 1, along with their respective
explanation, data types and the number of missing values.
Table 3-1: Features Explanation
Feature Explanation Data type Missing values
Day The day of the week. object 0
Date The specific date. datetime64 0
Holiday Indicates whether the day is a holiday (1) or not (0). int64 0
1-24 Hourly electricity energy consumption data for each hour of the day, MW/H¹. float64 0
Weather Describes the weather conditions (e.g., Rainy, Sunny). object 198
Temp_max The maximum temperature for the day. float64 144
Temp_min The minimum temperature for the day. float64 134
Daily energy Total electricity energy consumption for the day, MW/H. float64 0
peak The maximum hourly energy consumption value for the day, MW. float64 0
Peak Time The hour at which the peak consumption occurred. int64 0
3.3 Data Pre-processing
In this dissertation, data pre-processing consists of five steps. Deal with duplicate values, imputing missing values, encoding categorical values, feature engineering, standardisation or normalisation and data splitting.
3.3.1 Duplicate and missing values
In the first step, duplicate rows are identified and removed, followed by the detection and imputation of missing values. Further examination of the dataset reveals that there are no duplicate values. Additionally, As indicated in Table 3-1, missing values are concentrated in three features: “Weather”, “Temp_max”, and “Temp_min”. Since these missing values pertain to meteorological information, which significantly influences customer consumption behaviour (Kang and Reiner, 2022), the imputation approach must be chosen carefully to create a more reliable dataset and achieve more accurate results.
1 Mega Watt per Hour.
2 Mega Watt.
13
3.3.2 Encode categorical variables
Encoding is the process of transforming data from one form to another for purposes such as storage, transmission, or processing by computer systems. By encoding data, it becomes standardised and machine-readable, ensuring that it can be accurately interpreted and manipulated by various systems. It facilitates interoperability by ensuring that data can be understood and utilised by different devices and software applications (Eriksson and Lindskog, 2017).
In this dissertation, nominal data, such as “Day” and “Weather”, were encoded using one-hot encoding to prepare them for machine learning processes.
3.3.3 Feature engineering
Feature engineering is the next step in data pre-processing, feature engineering in machine learning involves creating and selecting input data features to enhance model performance and accuracy. It focuses on converting raw data into meaningful and informative features that effectively represent the underlying patterns in the dataset. This process often involves creating new features, selecting relevant ones, and optimising their format to improve the model’s ability to learn from the data (Li et al., 2017). In this dissertation, feature engineering is approached in two different ways. Since the aim of this project is to train machine learning algorithms and use them to predict electricity consumption needs at both daily and hourly intervals, feature engineering is performed separately for each.
3.3.4 Standardisation/Normalisation
Standardisation and normalisation are both techniques used to transform and rescale data, but they used for different purposes. Standardisation involves rescalling a dataset so that it has a mean of 0 and a standard deviation of 1. This process allows for easy comparison of different datasets, making it a useful technique for statistical analyses and machine learning. while, normalisation, on the other hand, typically involves rescaling the data to a specific range, often [0, 1] or [-1, 1]. It is useful for bringing diverse datasets into a standard form, making comparisons more meaningful (Sakai, 2016).
In this dissertation based on the dataset and project objective all the data is normalised, as machine learning models generally work more effectively with normalised data.
3.3.5 Split the dataset
Finally, the dataset will be split into training, validation, and test data. The data-splitting process should be implemented in a way that provides the model with the most representative data. To achieve this objective, two different approaches are used for daily prediction and hourly prediction in this dissertation. In order to split the data for daily prediction, 70% of the data for each month is used for training, while 15% is allocated for validation, and the remaining 15% is reserved for testing. For hourly predictions, a similar approach can be adopted, with 70% of the data allocated for training, 15% for validation, and the remaining 15% for testing. Additionally, the data can be separated based on seasons before applying the training-validation-testing split,
3.4 Model Training
The next step in this dissertation involves selecting suitable machine learning algorithms for the project’s purpose and using them with the split data to train a model capable of identifying hidden patterns in the dataset. To achieve better and more accurate results, various machine learning algorithms will be employed, and the performance of each model will be evaluated using error measurements.
A decision tree serves as a supervised machine learning algorithm applicable to both classification and regression tasks. It functions by segmenting data into subsets grounded on input features, then making decisions at each internal node based on the respective feature values. This repeated process persists until the resulting subsets become homogeneous concerning the target variable, so forming leaf nodes. The decision tree algorithm gains popularity due to its interpretability, straightforward comprehension, and adeptness in effectively managing both numerical and categorical data. (Xie et al., 2019).
The decision tree algorithm proves ideal for predicting electricity consumption needs owing to its capability to efficiently manage both numerical and categorical data, making it adaptable to the diverse data types inherent in power load prediction. In the context provided, decision tree classification is employed to pre-process historical power load data, quantify individual features, and categorise the training data. These procedures empower the algorithm to detect patterns and correlations within the data, essential for precise power load forecasting. Moreover, the decision tree algorithm furnishes interpretable classification rules, facilitating insightful understanding of the factors impacting power load. (Xie et al., 2019).
The random forest algorithm is a powerful tool in machine learning, employing ensemble learning to build numerous decision trees in training and merging their results to improve predictive accuracy. Especially advantageous for classification and regression assignments, random forest excels in its capacity for feature selection, mitigating overfitting, resilience against data noise, and interpretability of models. By inherently prioritising features according to their influence on model performance, random forest crafts efficient and reliable models, establishing itself as a favoured option in data science for its adaptability and proficiency in managing intricate datasets. (Veeramsetty et al., 2022).
The random forest algorithm proves to be an ideal solution for electricity consumption need forecasting, thanks to its ensemble learning technique, which combines multiple decision trees to improve predictive precision. In the field of electricity consumption need prediction, where precise predictions are pivotal for grid management and energy market trading, random forest demonstrates its efficacy in feature selection, mitigating overfitting, resilience against data noise, and interpretability of models. By inherently prioritising features according to their influence on model performance, random forest crafts efficient and reliable models, establishing itself as a favoured option in data science for its adaptability and proficiency in managing intricate datasets (Veeramsetty et al., 2022).
LightGBM, short for Light Gradient Boosting Machine, represents a robust and effective gradient boosting framework rooted in decision tree algorithms. Originating from Microsoft, it is accurately crafted to optimise vast datasets with exceptional efficiency and scalability. LightGBM boasts advantages like accelerated training speed, reduced memory usage, and heightened accuracy, when placed together with alternative boosting frameworks. These feats are accomplished through its highly efficient gradient-driven decision tree learning mechanism and its utilisation of leaf-wise tree expansion, histogram-based algorithms, alongside an array of other optimisation strategies. Renowned for its adeptness in managing extensive datasets, LightGBM has garnered extensive acclaim for its excellent performance across a spectrum of machine learning tasks including classification, regression, and ranking (Ju et al., 2019).
LightGBM proves to be an excellent choice for electricity consumption need prediction, owing to its remarkable efficiency and scalability when dealing with extensive and complex datasets. Its leaf-wise tree expansion approach and gradient-driven learning algorithm facilitate faster training speeds, reduced memory overhead, and heightened accuracy,
enabling it to effectively distinguish and interpret complex nonlinear associations within
power consumption or generation datasets. Moreover, the adoption of an enhanced
histogram-based algorithm and the leaf-wise strategy bolsters its resilience and mitigates
overfitting risks, which are paramount for generating dependable power forecasts among
dynamically evolving conditions. Furthermore, LightGBM's capability to manage high-dimensional data through the Exclusive Feature Bundling (EFB) algorithm, coupled with its adeptness in leveraging spatial and temporal correlations, positions it as an appropriate tool for capturing the subtle patterns inherent in electricity consumption need prediction tasks (Ju et al., 2019).
Long Short-Term Memory (LSTM) represents a form of recurrent neural network (RNN)
architecture precisely crafted to proficiently model sequential data and tackle the
complexities of learning continued dependencies within such data. Diverging from conventional RNNs, LSTMs possess the capability to preserve and selectively update information across extensive time intervals, rendering them appropriate for tasks involving time series predictions, natural language processing, and speech recognition.
This feat is accomplished through an complex internal structure comprising memory cells and gates that regulate information flow, enabling LSTMs to recognise patterns in sequential data across extended time spans while mitigating challenges like vanishing and exploding gradients often encountered in standard RNNs (Gao et al., 2019).
LSTMs stand out as an excellent choice for electricity consumption needs prediction owing to their proficiency in effectively capturing continued dependencies and subtle patterns inherent in sequential data, thus making them particularly well-suited for modelling the complexities of time series data such as power output. Their intricate internal structure, featuring memory cells and gating mechanisms, empowers the network to retain and selectively update information across extended time intervals, a crucial capability for achieving accurate predictions in dynamic environments. In the realm of power forecasting, LSTMs address issues like vanishing or exploding gradients commonly encountered in traditional RNNs, thereby boosting the accuracy and stability of electricity consumption need prediction (Rafi et al., 2019).
A Convolutional Neural Network (CNN) stands as a prevalent artificial neural network type extensively utilised in tasks involving image recognition and processing. Comprising specialised layers like convolutional, pooling, and fully connected layers, CNNs are tailored to efficiently handle visual data. Convolutional layers employ filters to extract features from input images, while pooling layers down sample the extracted features. Furthermore, CNNs integrate activation functions and optimisation algorithms to augment their learning and predictive capabilities. Leveraging their architecture, CNNs excel in discerning patterns and intricate relationships within image data, making them a potent tool for applications such as object recognition, image classification, and various computer vision tasks (Rafi et al., 2021).
3.5 Result Evaluation
The evaluation section is the most crucial stage for determining the accuracy of a model.
For each model, three datasets are provided: one for training the model, one for testing the performance of the model, and the third one is kept as unseen data to enable a better evaluation of the models' performance. To assess accuracy, four error measures will be
16
calculated: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE) and R-squared (R^2).
The Mean Absolute Error (MAE) serves as a metric utilised to quantify the average magnitude of errors between predicted values and observed values within a predictive model. It is determined by computing the mean of the absolute differences between predicted and actual values. A lower MAE signifies better predictive accuracy, indicating that the model’s predictions closely align with the actual values. The MAE outcome is directly interpretable in the same units as the data being measured, making it a straightforward gauge of model performance, where smaller values denote enhanced accuracy and larger values signify greater prediction errors. In summary, MAE provides a dependable measure of model performance by assessing the average discrepancy between predicted and observed values (Willmott and Matsuura, 2005). The Mean Absolute Error (MAE) is computed by this formula:
MAE = 1/n ∑_{i=1}^{n} | y_i - ŷ_i |
Where:
n is the number of data points
y_i is the observed values.
ŷ_i is the predicted values.
This measure offers a direct means to evaluate the average magnitude of errors between predicted and observed values (Willmott and Matsuura, 2005).
The Root Mean Squared Error (RMSE) serves as a widely employed metric for assessing the accuracy of predictive models, particularly within regression analysis. It is determined by taking the square root of the average of the squared differences between predicted and observed values. The RMSE outcome is directly interpretable in the same units as the data being measured, facilitating a straightforward evaluation of the model’s predictive performance. A lower RMSE denotes a superior fit between the model’s predictions and the actual values, indicating heightened accuracy. By considering the magnitude of prediction errors, RMSE offers a comprehensive measure of error dispersion, rendering it a valuable tool for appraising the overall quality of a predictive model (Chai and Draxler, 2014).
The RMSE is computed by taking the square root of the average of the squared differences between predicted and observed values.
RMSE = sqrt( 1/n ∑_{i=1}^{n} ( y_i - ŷ_i )^2 )
Where:
n is the number of data points
y_i is the observed values.
ŷ_i is the predicted values.
This calculation yields a single measure that quantifies the average magnitude of errors between predicted and observed values in a dataset (Chai and Draxler, 2014).
The Mean Absolute Percentage Error (MAPE) serves as a metric employed to assess the accuracy of predictions or forecasts. It computes the average percentage difference between predicted and actual values, making it particularly valuable in scenarios where relative errors hold more significance than absolute errors, such as in forecasting and finance. A lower MAPE indicates a more precise prediction, with 0% representing a perfect prediction.
For instance, a MAPE of 5% signifies that the average prediction error is 5% of the actual value. MAPE offers an intuitive measure of forecast accuracy and finds widespread
use
17
application in various practical contexts, particularly when the predicted quantity is expected to remain substantially above zero (De Myttenaere et al., 2016). The Mean Absolute Percentage Error (MAPE) is calculated using the following formula:
MAPE = 1/n ∑_{i=1}^n | y_i − ŷ_i | / y_i × 100%
Where:
n is the number of data points
y_i is the observed values.
ŷ_i is the predicted values.
This metric serves as an indicator of forecast accuracy, particularly valuable when relative errors carry more significance than absolute errors (De Myttenaere et al., 2016).
The R-squared (R^2) value serves as a statistical metric indicating the proportion of the variance in the dependent variable that can be predicted from the independent variable(s) in a regression model. Ranging between 0 and 1, where 1 signifies that the independent variable(s) explain all variability in the dependent variable, and 0 implies that the independent variable(s) fail to explain any variability. Put simply, R-squared reflects the goodness of fit of a regression model, showcasing how effectively the independent variable(s) reveal the variation in the dependent variable. A higher R-squared value denotes a superior fit, indicating that the model's predictions closely match the actual data points, while a lower R-squared value suggests a weaker fit, with the model's predictions deviating further from the actual data points (Chicco et al., 2021).
The R-squared (R^2) value is to be calculated with following formula:
R^2 = 1 - [ Σ_{i=1}^n (y_i − ŷ_i)^2 / Σ_{i=1}^n (y_i − ȳ)^2 ]
Where:
n is the number of data points
y_i is the observed values.
ŷ_i is the predicted values.
ȳ is the mean of the observed values.
This computation provides a measure of how effectively the independent variable(s) explain the variability of the dependent variable, with a value closer to 1 indicating a better fit of the regression model (Chicco et al., 2021).
3.6 Prediction
The fundamental objective of this dissertation is to predict electricity consumption needs. After evaluating the models and identifying the most accurate one, that model will be employed to forecast electricity consumption needs at two different time resolutions: daily, weekly and monthly.
4 Implementation and Testing
4.1 Introduction
In this chapter, all methodologies and tools explained in Chapter 3 were implemented. The process began with Exploratory Data Analysis, where data features were clarified using figures and tables. Subsequently, data pre-processing was conducted, followed by data splitting and standardisation. Next, models were trained and their performance evaluated using metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), R-squared (R^2), and Mean Absolute Percentage Error (MAPE).
For clarity, it is noted that all steps were implemented using Python in the Google Colaboratory environment. The code for each step is available in Appendix C of this dissertation and for all visualisation, Tableau was utilised.
4.2 Data Pre-processing
To prepare the dataset for machine learning purposes, data pre-processing must be done. As mentioned in the Duplicate and missing values section, investigating the dataset using the Google Colaboratory environment and Python code shows there are no duplicate values and outliers. As indicated in Table 3- 1, missing values are concentrated in three features: ‘Weather’, ‘Temp_max’, and ‘Temp_min’. The ‘Weather’ column was filled using the forward fill method, a technique that assigns the most recent non-missing value to missing values. Additionally, missing ‘Temp_Max’ and ‘Temp_Min’ values were linearly interpolated to provide reasonable estimations for imputing missing values. All codes used for finding and imputing missing values are provided in Error! Reference source not found. under the Error! Reference source not found. section. The next step in data cleaning for machine learning involves encoding categorical data to make it usable for machine learning purposes. In this dissertation, one-hot encoding was utilised to encode categorical columns such as ‘Day’ and ‘Weather’. This process extends the database with new columns like ‘Day_Monday’, ‘Day_Tuesday’, ‘Weather_Sunny’, ‘Weather_Rainy’ and etc, where each row specifies inclusion or exclusion for each column, with binary values of 1 indicating inclusion and 0 indicating exclusion. Codes for doing this section are provided in Error! Reference source not found. on Error! Reference source not found. section.
4.3 Exploratory Data Analysis
The dataset utilised in this dissertation consists of eight years of electricity consumption history for the entire province with a population of 3.3 million, integrated with meteorological data. In this dissertation, the dataset, after imputing missing values, was used to perform visualisations to enhance understanding of the dataset and its features. For all visualisations, this dissertation utilised Tableau software. Figure 4- 1 provides insight into the energy consumption trend over eight years. Based on the chart in this figure, electricity energy consumption exhibits a repeating trend over the years, with a mild increase overall. However, this increasing trend was interrupted, turning into a decrease in 2019 and 2020 due to the COVID-19 pandemic. Following the pandemic, the increase resumed, returning to the normal trend.
Figure 4-1: Total Daily Energy Consumption
Electricity consumption is divided into baseload and peak load. Baseload refers to the minimum level of demand for electricity over time, while peak load refers to the period of highest demand for electricity within a specific timeframe. This typically occurs during the time when electricity consumption is at its peak due to factors such as high temperatures (air conditioning use) or other reasons (Biserić and Bugarić, 2021). As shown in Figure 4-2, the green area represents the baseload based on 2016 data, while the area above the green line represents the peak load. During the COVID-19 pandemic, the baseload maintained its increasing trend, while the peak load decreased. This decrease can be attributed to the closure of offices and industries, as well as quarantine policies.
Figure 4-2: Baseload Based on 2016 Data
20
The yearly peak time in the province typically begins in the middle of May and extends until
after the middle of September, resulting in four months of significant peak load. Conversely,
the remaining eight months experience mild peak times, making electricity consumption
prediction even more challenging. Figure 4-3 illustrates the baseload and peak load in
the two most recent years, confirming the mentioned changes in both baseload and peak load
timeline.
Baseload Based on 2022 Data
Baseload Based on 2023 Data
Figure 4-3: Baseload and Peak Load in 2022 and 2023
Furthermore, power consumption exhibits a strong correlation with the time of day, with average hourly electricity consumption showing significant variation throughout the day.
As illustrated in Figure 4-4, peaks mostly occur in the evening when electricity is primarily used for lighting.
Additionally, as observed in Figure 4-5, the most frequent peak times are at 7 PM followed by 9 PM, with 3 PM being the third most common. This shift in peak times from evening to afternoon during summer indicates increased electricity usage for cooling purposes.
Figure 4-4: Average Consumption for Each Hour of Day During Eight Years
21
Peak Time
Count of Peak Time
Figure 4-5: Frequency of Peak Hours
Other features that correlate with electricity consumption behaviours are the maximum and minimum temperatures for each day. As illustrated in Figure 4-6, there is a strong relationship between maximum and minimum temperatures and electricity power consumption. The figure shows that as temperatures increase, electricity consumption predictably increases, and conversely, it decreases as temperatures decrease. However, when temperatures decrease to a comfortable range and continue to decrease further, electricity consumption remains steady around the baseload while the winter peak expected. This phenomenon can be attributed to the fact that in the province, electricity is primarily used for cooling purposes. Conversely, for heating purposes, gas is predominantly used. Therefore, on cold days, the major domestic usage of electricity is for lighting purposes.
The next challenge in predicting electricity consumption is that not only does consumption behaviour vary based on the season of the year, but the usage patterns also differ within the same season across different years. This variability makes accurately predicting hourly electricity consumption very challenging. As illustrated in Figure 4-7, the consumption trend for the same time period can vary significantly between different years.
22
Figure 4-7: Electricity Consumption Behaviours in Same Time in Different Years
In addition, a significant consideration in the realm of this study is the province’s population, which stands at 3.3 million. Energy consumption in this context exhibits inertia, meaning it does not change abruptly. The typical usage pattern mirrors that depicted in Figure 8, illustrating a consistent daily trend with only marginal fluctuations in energy consumption. However, during peak periods throughout the year, such as those depicted in Figure 9, abrupt and substantial changes occur. These deviations are peculiar to the scale of the area and its population.
Figure 4-8: Normal Trend for Some Random Nonpeak Day
Figure 4-9: Consumption Trend for Some Random Peak Day
Further investigation reveals that due to electricity shortages within the province, national dispatching authorities intermittently reduce power supply to certain areas to manage overall consumption and prevent blackouts. These reductions are unannounced and determined on a day-to-day basis, presenting a significant challenge for accurate hourly predictions. The presence of such uncertainty renders precise predictions virtually impossible. ((( In addition, the next considerable matter since the area of study is a province with 3.3 million population, is consumed energy has a inertia and should not change in a sharp
23
and sudden changes, the normal trend of using should be same as the fig 8 that shows a normal day trend and just the range of energy increase or decrease in same trend, but during yearly peak time most day has trend like fig 9 with sharp and sudden change which is wired in the scale of the area and population, further survey and examination shows that due to the shortage of electricity in the province the national dispatching cut the power of some province to manage the electricity consumption in the country and avoid a blackout, and this power cuts are unannounced and decide by each day situation and this is a big challenge for hourly prediction and make impossible to predict accurately in presence of such an uncertainty.)))))
In conclusion, electricity consumption is influenced by a variety of factors that make predicting power consumption difficult and challenging. Some of these factors include the season and month of the year, times of a day, the duration of day and night, and the time of sunset, all of which can affect electricity consumption. Additionally, the maximum and minimum temperatures have a strong effect on the power consumption trend and can shift the timing of peak consumption from evening to afternoon. Moreover, consumption behaviour varies from year to year, meaning that even in the same season and under similar temperatures, consumption behaviour can vary. Furthermore, the other energy supply such as gas has significant effect on peach time and can eliminate winter peak time. These factors collectively make accurate predicting of electricity consumption challenging.
4.4 Data Splitting
In this dissertation, the next step in data pre-processing is data splitting. The approach involves creating three different datasets: one for training the algorithms, containing 70% of the data; another with 15% of the data for testing purposes; and finally, a 15% portion reserved for evaluating the system as unseen data. Unseen data in machine learning, based on Schelter et al. (2020) explanation refers to data that has not been encountered during the model's training phase. It is a crucial aspect because it mirrors real-world scenarios, challenging models to make accurate predictions on unfamiliar information. This testing ground assesses a model’s ability to generalise beyond its training data, evaluating its performance and robustness in practical applications. Stakeholders gain confidence in a model’s predictions when they observe its effectiveness on unseen datasets. Additionally, models evaluated on unseen data showcase adaptability, crucial for navigating dynamic environments and evolving data distributions in real-world scenarios (Schelter et al., 2020).
Finally, the dissertation employed a monthly separation approach, where the dataset was divided based on each month. Subsequently, 70% of the data for each month were allocated for training, 15% for testing, and 15% for evaluation. This method yielded the most favourable results among those investigated in this study.
The code for implementing the monthly data splitting approach is provided in Appendix C under the splitting section.
4.5 Training the models
The next step following the data splitting stage is training the selected algorithms. In this dissertation, five different algorithms were chosen: Decision Tree, Random Forest, Light Gradient Boosting Machine (LightGBM), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN).
Initially, the training, testing, and evaluation datasets were prepared for use in the models. Decision Tree, Random Forest, and LightGBM utilised the same dataset, as their models can work with the same dataset shape. Codes for implementing the training and testing operations for these three models are provided in Appendix C under the Training Three Initial Models section.
Convolutional neural networks and LSTM models require three dimensional datasets, where the time step is a crucial dimension (Ma et al., 2018). In this dissertation, after training the initial three models, the last two models were trained, and their results were compared with the other three.
To accommodate the requirements of Convolutional Neural Network and LSTM models, the dataset needed to be reshaped accordingly. The code used for reshaping the dataset, training the models, and calculating error measures is provided in Appendix C under the Training Two Last Models section.
25
5 Discussion and Evaluation
To recognise the best performance and identify the best model, the performance of all trained models needs to be evaluated. In this dissertation, error measures such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and R-squared (R^2) were employed to evaluate the output results for each model.
The code used for calculating these error measures is provided in Appendix C.
All results for the models are presented in Table 1, and interpretations for these results are provided in the following section.
Table 5- 1: Result for Trained Models
| Model | MAE | RMSE | R^2 | MAPE |
| Random Forest | 222.67 | 377.76 | 0.9972 | 0.89% |
| LightGBM | 175.94 | 318.64 | 0.998 | 0.70% |
| Decision Tree | 382.51 | 609.22 | 0.9928 | 1.57% |
| LSTM | 793.19 | 1327.71 | 0.9659 | 3.50% |
| CNN | 765.15 | 1252.71 | 0.9607 | 3.46% |
According to Table 5-1, the LightGBM consistently outperforms all other models across all metrics, boasting the lowest MAE and RMSE, as well as the highest R2. These results suggest that LightGBM is the most accurate model utilised in this dissertation. Furthermore, its lowest MAPE demonstrates superior accuracy compared to the other models. The Random Forest model, based on the values in Table 5- 1, exhibits low MAE and RMSE, indicating good performance on the dataset. The high value of R2 further supports this performance, while the low MAPE demonstrates high relative accuracy. Additionally, the high value of MAPE further highlights its lower relative accuracy.
The two deep learning models, LSTM and CNN, perform significantly worse than the tree-based models, displaying dramatically higher values in MAE and RMSE, as well as lower values in R2. These results indicate that these models are not well-suited for this dataset.
In conclusion, tree-based models, particularly LightGBM, demonstrate superior performance with this dataset. Despite the notable strengths of CNN and LSTM, their performance is weaker compared to the tree-based models. The LightGBM exhibit best performance between these five models.
26
6 Conclusions, limitation and
Recommendation
In this dissertation, a dataset spanning eight years of electricity consumption history, integrated with a meteorological dataset for a province with a population of 3.3 million, is utilised to train five different machine learning models. As explained earlier in this dissertation, predictions of electricity consumption depend on several factors such as the season and month of the year, time of each day, temperature and humidity, consumption habits of the people living in the area, provided energy sources, and the scale of the location under study in terms of population and area. Each project has different features and must consider the results based on the mentioned factors.
6.1 Conclusions
The five machine learning models used in this dissertation include: Random Forest, Decision Tree, LightGBM, LSTM, and CNN. The achieved results show that LightGBM outperforms the other models for this project and dataset, with the following metrics: Mean Absolute Error (MAE) = 175.94, Root Mean Squared Error (RMSE) = 318.64, Mean Absolute Percentage Error (MAPE) = 0.7%, and Coefficient of Determination (R-squared) = 0.99. These results indicate that LightGBM can accurately predict future electricity consumption needs for the entire province. The second most accurate model is Random Forest, which could serve as an alternative choice for predicting future electricity consumption needs or be used simultaneously with LightGBM to compare and enhance accuracy.
6.2 Limitation
One limitation of this project is the large area and high population of the location under study. Predicting electricity consumption needs for smaller areas, such as cities with lower populations, is comparatively easier because consumption habits depend on each individual consumer and are influenced by factors such as the time of sunset. In cities, the time of sunset is generally consistent across the urban area. However, in provinces and larger areas, this time can vary from one place to another. In addition, as previously explained in section 4.3, the electricity consumption trend in large areas with high populations exhibits inertia, resisting sharp and sudden changes in consumption patterns. However, as evident in Figure 4-7 and Figure 4-9, there are sharp decreases in consumption trends due to power cuts resulting from energy shortages. These power cuts are unplanned and occur daily based on the prevailing situation, introducing significant uncertainty that makes hourly and daily predictions very challenging, especially during peak times.
Another limitation in all electricity consumption needs prediction projects is the multitude of uncertainties involved in determining consumption trends. For instance, power consumption is highly correlated with weather conditions and temperatures. However, when predicting future energy consumption, real-time weather conditions and temperatures are often unavailable, and only predictions of these factors are accessible.
6.3 Recommendation
This dissertation concludes by suggesting at least two recommendations for future work: The first recommendation is to address the daily need for predicting electricity consumption in every power distribution company. The dataset of dispatched electrical energy can be
connected to each company’s machine learning section through the internal network,
facilitating automatic data reception and prediction. This project could focus on developing
the necessary infrastructure, communication protocols, and algorithms.
The second recommendation addresses unplanned power cuts resulting from energy
shortages. Also this power cuts are unplanned but following implementation, the
dispatching department has access to duration and quantity of undispatched energy. This
dissertation proposed dispatching department should record the data and this data should be
incorporated into the historical consumption dataset to improve the accuracy of electricity
consumption prediction algorithms.
Third is manipulate model parameters for LSMT and CNN to create better performance
28
ABDELLA, G. M., KUCUKVAR, M., ONAT, N. C., AL-YAFAY, H. M. & BULAK, M. E. 2020.
Sustainability assessment and modeling based on supervised machine learning techniques: The case for food consumption. Journal of Cleaner Production, 251, 119661.
AHMAD, T., CHEN, H., HUANG, R., YABIN, G., WANG, J., SHAIR, J., AKRAM, H. M. A., MOHSAN, S. A. H. & KAZIM, M. 2018. Supervised based machine learning models for short, medium and long-term energy prediction in distinct building environment. Energy, 158, 17-32.
AL MAMUN, A., SOHEL, M., MOHAMMAD, N., SUNNY, M. S. H., DIPTA, D. R. & HOSSAIN, E. 2020. A comprehensive review of the load forecasting techniques using single and hybrid predictive models. IEEE Access, 8, 134911-134939.
ANTONANZAS, J., OSORIO, N., ESCOBAR, R., URRACA, R., MARTINEZ-DE-PISON, F. J. & ANTONANZAS-TORRES, F. 2016. Review of photovoltaic power forecasting. Solar energy, 136, 78-111.
AZADEH, A., GHADERI, S. & SOHRABKHANI, S. 2008. Annual electricity consumption forecasting by neural network in high energy consuming industrial sectors. Energy Conversion and management, 49, 2272-2278.
BISERČIĆ, A. Z. & BUGARIĆ, U. S. 2021. Reliability of baseload electricity generation from fossil and renewable energy sources. Energy and Power Engineering, 13, 190-206.
CAÑETE, J., CHAPERON, G., FUENTES, R., HO, J.-H., KANG, H. & PÉREZ, J. 2023. Spanish pre-trained bert model and evaluation data. arXiv preprint arXiv:2308.02976.
CHAI, T. & DRAXLER, R. R. 2014. Root mean square error (RMSE) or mean absolute error (MAE). Geoscientific model development discussions, 7, 1525-1534.
CHICCO, D., WARRIENS, M. J. & JURMAN, G. 2021. The coefficient of determination Rsquared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. Peer computer science, 7, e623.
CHUNG, Y.-W., KHAKI, B., LI, T., CHU, C. & GADH, R. 2019. Ensemble machine learning-based algorithm for electric vehicle user behavior prediction. Applied Energy, 254, 113732.
DENG, C., ZHANG, X., HUANG, Y. & BAO, Y. 2021. Equipping seasonal exponential smoothing models with particle swarm optimization algorithm for electricity consumption forecasting. Energies, 14, 4036.
DONG, B., CAO, C. & LEE, S. E. 2005. Applying support vector machines to predict building energy consumption in tropical region. Energy and Buildings, 37, 545-553.
DONG, Z., LIU, J., LIU, B., LI, K. & LI, X. 2021. Hourly energy consumption prediction of an office building based on ensemble learning and energy consumption pattern classification. Energy and Buildings, 241, 110929.
DOSHI-VELEZ, F. & PERLIS, R. H. 2019. Evaluating machine learning articles. Jama, 322, 1777-1779.
ERIKSSON, K. & LINDSKOG, M. 2017. Encoding of numerical information in memory: Magnitude or nominal? Journal of Numerical Cognition, 3.
GAO, M., LI, J., HONG, F. & LONG, D. 2019. Day-ahead power forecasting in a large-scale photovoltaic plant based on weather classification using LSTM. Energy, 187, 115838.
GONZALEZ-BRIONES, A., HERNANDEZ, G., CORCHADO, J. M., OMATU, S. & MOHAMAD, M. S. Machine learning models for electricity consumption forecasting: a review. 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), 2019. IEEE, 1-6.
GORMAN, W., JARVIS, S. & CALLAWAY, D. 2020. Should I Stay Or Should I Go? The importance of electricity rate design for household defection from the power grid. Applied Energy, 262, 114494.
GROH, E. D. & ZIEGLER, A. 2022. On the relevance of values, norms, and economic preferences for electricity consumption. Ecological Economics, 192, 107264.
HADI, A. & SOLTANAGHAYI, M. 2022. پیش بینی مصرف برق با استفاده از شبکه های عصبی [unreadable].
HAMMAD, M. A., JEREB, B., ROSI, B. & DRAGAN, D. 2020. Methods and models for electric load forecasting: a comprehensive review. Logist. Sustain. Transp, 11, 51-76.
JU, Y., SUN, G., CHEN, Q., ZHANG, M., ZHU, H. & REHMAN, M. U. 2019. A model combining convolutional neural network and LightGBM algorithm for ultrashort-term wind power forecasting. IEEE Access, 7, 28309-28318.
KANG, J. & REINER, D. M. 2022. What is the effect of weather on household electricity consumption? Empirical evidence from Ireland. Energy Economics, 111, 106023.
KARIMTABAR, N., PASBAN, S. & ALIPOUR, S. Analysis and predicting electricity energy consumption using data mining techniques—A case study IR Iran— Mazandaran province. 2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA), 2015. IEEE, 1-6.
KLYUEV, R. V., MORG OEV, I. D., MORG OEV A. D., GAVRINA, O. A., MARTYUSHEV, N. V., EFREMENKOV, E. A. & MENGXU, Q. 2022. Methods of forecasting electric energy consumption: A literature review. Energies, 15, 8919.
LAURINEC, P., LÓDERER, M., LUCKÁ, M. & ROZINAJOVÁ, V. 2019. Density-based unsupervised ensemble learning methods for time series forecasting of aggregated or clustered electricity consumption. Journal of Intelligent Information Systems, 53, 219-239.
LIU, Z., WU, D., LIU, Y., HAN, Z., LUN, L., GAO, J., JIN, G. & CAO, G. 2019. Accuracy analyses and model comparison of machine learning adopted in building energy consumption prediction. Energy Exploration & Exploitation, 37, 1426-1451.
MA, C., GUO, Y., YANG, J. & AN, W. 2018. Learning multi-view representation with LSTM for 3-D shape recognition and retrieval. IEEE Transactions on Multimedia, 21, 1169-1182.
MAHFOUDH, S. & AMAR, M. B. 2014. The importance of electricity consumption in economic growth: The example of African nations. The Journal of Energy and Development, 40, 99-110.
MBAMALU, G. & EL-HAWARY, M. 1993. Load forecasting via suboptimal seasonal autoregressive models and iteratively reweighted least squares estimation. IEEE Transactions on Power Systems, 8, 343-348.
MOSAVI, A. & BAHMANI, A. 2019. Energy consumption prediction using machine learning; a review.
PANKLIB, K., PRAKASVUDHISARN, C. & KHUMMONGKOL, D. 2015. Electricity consumption forecasting in Thailand using an artificial neural network and multiple linear regression. Energy Sources, Part B: Economics, Planning, and Policy, 10, 427-434.
PARHIZKAR, T., RAFIEPOUR, E. & PARHIZKAR, A. 2021. Evaluation and improvement of energy consumption prediction models using principal component analysis based feature reduction. Journal of Cleaner Production, 279, 123866.
QIU, X., RU, Y., TAN, X., CHEN, J., CHEN, B. & GUO, Y. 2023. A k-nearest neighbor attentive deep autoregressive network for electricity consumption prediction. International Journal of Machine Learning and Cybernetics, 1-12.
QUEK, Y., WOO, W. & LOGENTHIRAN, T. 2017. A naïve Bayes Classification Approach for Short-Term Forecast of Photovoltaic System. Proceedings of the Sustainable Energy and Environmental Sciences, Singapore, 6-7.
RAFI, S. H., DEEBA, S. R. & HOSSAIN, E. 2021. A short-term load forecasting method using integrated CNN and LSTM network. IEEE Access, 9, 32436-32448.
SAKAI, T. A simple and effective approach to score standardisation. Proceedings of the 2016 ACM international conference on the theory of information retrieval, 2016. 95-104.
SCHELTER, S., RUKAT, T. & BIESMAN, F. Learning to validate the predictions of black box classifiers on unseen data. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020. 1289-1299.
SINGH, A. K., KHAT OON, S., MUAZZAM, M. & CHATURVEDI, D. Load forecasting techniques and methodologies: A review. 2012 2nd International Conference on Power, Control and Embedded Systems, 2012. IEEE, 1-10.
SINGH, S. N. 2008. Electric power generation: transmission and distribution, PHI Learning Pvt. Ltd.
TASNIM 2022. [unreadable Persian text].
VEERAMSETTY, V., REDDY, K. R., SANTHOSH, M., MOHNOT, A. & SINGAL, G. 2022. Short-term electric power load forecasting using random forest and gated recurrent unit. IEEE Access, 9, 32436-32448.
WANG, X., GUO, P. & HUANG, X. 2011. A review of wind power forecasting models. Energy Procedia, 12, 770-778.
WANG, Z., HONG, T., LI, H. & PIETTE, M. A. 2021. Predicting city-scale daily electricity consumption using data-driven models. Advances in Applied Energy, 2, 100025.
31
WESTERMANN, P., DEB, C., SCHLUETER, A. & EVINS, R. 2020. Unsupervised learning of energy signatures to identify the heating system and building type using smart meter data. Applied Energy, 264, 114715.
WIKIPEDIA 2024. Mazandaran province.
WILLMOTT, C. J. & MATSUURA, K. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate research, 30, 79-82.
XIE, Z., WANG, R., WU, Z. & LIU, T. Short-term power load forecasting model based on fuzzy neural network using improved decision tree. 2019 IEEE Sustainable Power and Energy Conference (iSPEC), 2019. IEEE, 482-486.
ZHANG, I., ZHANG, H., DING, S. & ZHANG, X. 2021a. Power consumption predicting and anomaly detection based on transformer and K-means. Frontiers in Energy Research, 9, 779587.
ZHANG, L., WEN, J., LI, Y., CHEN, J., YE, Y., FU, Y. & LIVINGOOD, W. 2021b. A review of machine learning in building load prediction. Applied Energy, 285, 116452.
32
33