Asynchronous federated learning with partial weights aggregation for energy consumption forecasting

Abstract

Accurate energy forecasting is essential for grid stability, demand-side management, and efficient renewable integration. However, energy consumption data collected from smart meters may expose sensitive user information, thus raising privacy concerns. Federated Learning (FL) offers a privacy-preserving mechanism for collaborative model training without sharing raw data. However, conventional synchronous FL suffers from training delays caused by heterogeneous client availability and computational capabilities, while frequent exchange of model parameters can lead to communication overheads. To address these challenges, this paper proposes an asynchronous federated learning framework for energy forecasting that enables continuous global model updating without waiting for all clients to complete local training. We introduce a federated asynchronous adaptive aggregation mechanism, where client-specific learning rates are dynamically adjusted based on both update staleness and model performance contribution. A partial aggregation strategy is defined for a Long Short-Term Memory (LSTM) forecasting model that splits the local models’ layers, allowing clients to exchange only a subset of the weights with the server. The proposed solution is evaluated using real-world energy consumption data from multiple consumers. Experimental results demonstrate that the proposed asynchronous adaptive strategy outperforms the classic FedAvg approach and maintains prediction accuracy relative to personalised FedAvg, while reducing communication costs. Additionally, the proposed method outperforms the classic FedAsync algorithm across all client groups, with statistically significant improvements in most cases.

Keywords

asynchronous federated learning load forecasting partial weight aggregation smart grid LSTM neural network

1. Introduction

The necessity of accurate energy consumption forecasting has increased significantly with the evolution of modern power systems and the growing integration of distributed energy resources. The high penetration of renewable energy sources introduces variability and uncertainty in power generation, requiring more proactive grid management strategies and advanced operational planning.¹ In this context, demand response programs aim to better align electricity consumption with available generation resources.² However, their efficiency depends on consumers’ engagement and accurate energy forecasting at low-scale granularity. The integration of smart meters and Internet of Things (IoT) has facilitated this process by generating large volumes of high-resolution energy consumption data for providing detailed insights into consumer usage patterns and creating valuable datasets for training advanced forecasting models.³ However, centralized forecasting architectures require the collection and storage of large volumes of fine-grained consumption data in a cloud data centre, which creates privacy and security concerns.⁴ Smart meter data can reveal detailed information about household behaviour, occupancy patterns, and lifestyle habits, making citizens reluctant to centralised energy data storage.⁵ In addition, the large volumes of monitored data contribute significantly to communication overhead, especially when the data exchanges with the centralised servers are frequent.

In this context, federated learning (FL) offers a promising solution by enabling collaborative model training across decentralised data sources while preserving data privacy. Thus, only the updated model weights are shared with the server for aggregation, and the raw energy consumption data remains stored locally.^6,7 However, FL faces challenges including non-identical data distribution across clients, computational heterogeneity and synchronisation in training schemes. In the case of energy consumption, data on clients is non-IID (non-independent and identically distributed) because it can vary significantly due to personal user habits, external conditions, or differences in device configurations.⁸ The number of available training samples can vary between clients, and it can negatively impact model convergence and performance. In some cases, a synchronous FL process can result in either unfair or slow global model updates. When participation is constrained by time limits, faster clients are favoured, while ensuring full participation requires waiting for slower clients, which delays model updating.⁹ Also, communication costs remain challenging, as clients must repeatedly exchange high-dimensional model parameters, such as weights and gradients, during each training round. This issue becomes more pronounced when deep learning models, such as Long Short-Term Memory (LSTM), are used for load forecasting. Moreover, in the case of synchronous FL, the global model aggregation can only proceed after all participating clients have submitted their local updates. This dependency makes the training process depend on slower or intermittently available clients, introducing significant synchronisation delays.¹⁰

To address these challenges, we propose an asynchronous FL framework that allows the global model to be continuously updated as soon as client updates are received, without requiring all clients to complete their local training. In this way, we address the delays caused by heterogeneous client availability and computational capabilities as we consider more relaxed synchronisation constraints. To ensure fairness and stability in the aggregation process, we introduce an adaptive aggregation mechanism in which each client’s update is weighed through a dynamically adjusted learning rate. This client-specific learning rate accounts for both the staleness of the update, reflecting the time elapsed since the client last synchronised with the global model and its contribution to global performance, evaluated on a server-side validation dataset. In this way, outdated or less beneficial updates will have a more limited influence on the global model, while more informative contributions are considered. Additionally, to address the communication overhead associated with frequent model parameter exchanges, we have defined a partial aggregation mechanism. Instead of transmitting the full set of model weights at each communication round, only selected layers of the LSTM forecasting model are transmitted. The main contributions can be summarised as follows:

• Design and development of an asynchronous FL framework for energy load forecasting that accommodates client heterogeneity in terms of computational resources and availability, enabling continuous global model updates.

• Design of a selective layer sharing mechanism for the LSTM forecasting model, where only weights from specific layers are shared and aggregated globally to reduce communication overhead.

• Integration of a federated asynchronous adaptive aggregation mechanism that uses an adaptive learning rate specific to each client that is updated every round based on both staleness and contribution to the global model performance.

The rest of the paper is structured as follows: Section 2 presents the existing state of the art in FL for energy forecasting and partial aggregation mechanisms, Section 3 presents the proposed asynchronous FL system with partial global model aggregation with adaptive learning rate for each client, Section 4 presents the prediction results evaluated on real energy consumption data set, Section 5 discusses the findings and presents a comparative analysys, while Section 6 concludes the paper.

2. Related work

To contextualise the proposed approach, this section examines advances in energy consumption forecasting, reviews FL frameworks for distributed model training, and discusses synchronisation strategies in FL environments.

Energy consumption forecasting has evolved, transitioning from statistical methods, such as autoregressive models, to more advanced machine learning approaches capable of capturing nonlinear relationships in data. In Ref. 11 the authors use an Autoregressive Integrated Moving Average (ARIMA) model for time series forecasting of energy consumption across multiple energy sources, leveraging historical data to predict long-term future trends. With the growing availability of high-resolution time-series data, deep learning architectures, particularly recurrent models like LSTM^12,13 and Gated Recurrent Unit (GRU),¹⁴ have demonstrated the ability to model temporal dependencies and complex consumption patterns. In Ref. 15 the authors investigate peak electrical energy consumption forecasting using both traditional time series models and deep learning models, as well as hybrid combinations (ARIMA-LSTM and ARIMA-GRU), demonstrating that hybrid approaches achieve superior predictive performance. A comparative evaluation of Recurrent Neural Network (RNN)-based deep learning models for smart grid energy demand forecasting was conducted in Ref. 16, showing that GRU was the most effective due to higher performance metrics and better handling of long-term dependencies. More recently, transformer-based architectures have been adopted for energy consumption forecasting due to their capacity for explicit long-range dependency modelling, due to their self-attention mechanism.¹⁷ Models such as the Temporal Fusion Transformer^18,19 demonstrate improved performance on high-dimensional and long-horizon time-series tasks by effectively integrating multi-modal inputs, capturing complex temporal dependencies, enhanced forecasting accuracy, particularly in settings with heterogeneous data and limited training samples. However, these approaches often rely on centralised data collection, which raises concerns regarding data privacy, especially in the energy domain when dealing with consumption data from individual consumers.²⁰ While these approaches have demonstrated strong predictive performance, they mostly rely on centralised training that requires aggregating raw consumption data, generating privacy and security concerns for households’ energy consumers.

To address these limitations, FL has emerged as a distributed training paradigm that enables collaborative model development without sharing raw data.²¹ LSTM models are often used for energy consumption predictions on historical monitored data in FL settings. In Ref. 5 the authors propose a FL-based approach for short-term residential load forecasts. Additionally, researchers propose combining an LSTM network with a multi-head self-attention mechanism to capture temporal dependencies and focus on the most relevant sensor features for energy demand prediction, whilst extending training to a decentralised learning setup.²² This type of solution can address both communication overhead and privacy concerns by local training models and, eventually, clustering the users based on consumption patterns and similarities.²³ Federated Average (FedAvg) allows clients to perform several local updates before sending model parameters to the server, which uses computing a weighted average to update the global model. After training, each client sends the updated model weights back to the server. The server aggregates them by performing a weighted average.^24,25 However, FedAvg may perform poorly in non-IID settings due to differences in client data distribution. Multiple algorithms are proposed to address this challenge, such as Federated Stochastic Variance Reduced Gradient (FedSVRG).²⁶ FedProx is an aggregation method based on FedAvg, which introduces a proximal term to stabilise training by preventing local models from diverging too far from the global one.^27,28 Another improved version of FedAvg is Clipped Average Aggregation that applies clipping model updates to a predefined range before averaging.²⁹ To address privacy concerns regarding sharing weights, differential privacy methods are introduced where each client adds noise to its model updates before sending them to the server.³⁰ Additionally, some approaches use bio-inspired optimisation algorithms in the aggregation process to improve the convergence speed and performance of the federated model.³¹

Despite its privacy-preserving benefits, FL introduces challenges in maintaining high prediction performance and ensuring fair model updates.³² Model performance can degrade due to statistical heterogeneity across clients, leading to imbalanced and potentially unfair contributions, especially in non-IID settings where certain clients disproportionately influence the global model.³³ Additionally, in practical deployments, system heterogeneity and varying client availability introduce challenges related to delayed or stale updates.³⁴ Furthermore, communication constraints, such as limited bandwidth, intermittent connectivity, and transmission delays, can further impact training efficiency.³⁵ These challenges highlight the need for more adaptive training and aggregation mechanisms that can effectively address both data and system-level heterogeneity. In the synchronous FL, the central server distributes the global model to a selected subset of clients, waits for all of them to finish training locally, and then aggregates the updates received.³⁶ It can suffer from stability issues, particularly in the presence of heterogeneous client devices and unreliable communication. Clients can delay the entire training process, increasing the total convergence time.³⁷

In comparison, asynchronous FL provides several advantages by accommodating different clients, reducing idle times, and increasing fault tolerance.^38,39 The asynchronous FL allows the server to aggregate model updates from clients as soon as they are received, without waiting for other clients to complete their training. Using this strategy, the server continuously updates the global model whenever it receives an update based on client contributions. This ensures the gradual integration of new updates while preserving stability. In addition, this mechanism enables devices and networks with heterogeneous capabilities to participate in the training process.⁹ This type of strategy is mostly suited to energy forecasting environments where edge devices may have intermittent connectivity or differences in the complexity of the data.⁴⁰ However, asynchronous FL can have effects on the convergence and overall model accuracy, as the global model incorporates outdated or inconsistent local information.⁴¹ Semi-asynchronous FL approaches have been investigated, highlighting the trade-off in improving training efficiency between synchronous (latency) and asynchronous approaches (accuracy).^42,43 Partial weight training is a strategy adopted in FL to reduce communication overhead and enable personalisation, particularly in non-IID environments.⁴⁴ In this approach, only a subset of model layers is shared with the server for aggregation (feature extractor) and used for collaborative training. The rest of the layers remain local and are trained on each client to capture specific patterns. FedPA (Partial Aggregation Strategy) introduces an adaptive aggregation number per round, and only a subset of client models is used to update the global model.⁴⁵ This strategy integrates stale updates by adjusting weights based on staleness and data size, balancing convergence and efficiency. Other partial model aggregation mechanisms aggregate only lower layers of the neural network (the feature extractors), keeping upper layers (predictors) locally for personalisation.⁴⁶ This design has been shown to improve generalisation under non-IID data while significantly reducing transmission costs.

3. Asynchronous federated learning

We propose an asynchronous FL approach with partial weight aggregation for 24-hour energy consumption forecasting. The asynchronous mechanism aims to ensure fair participation amongst heterogeneous clients by updating the global model immediately upon receiving updates from individual clients. In addition, a partial weight aggregation method was implemented starting from the work presented in Ref. 30.

3.1. System architecture

Figure 1 presents the proposed asynchronous FL architecture incorporating the partial weight update strategy. The clients have access to their own private data and a local model (LSTM) split into a shared feature extractor and a local predictor. Each client requests the global model weights corresponding to its selected layers, initialises its local model, trains the entire model and sends only the updates for the shared layers (feature extractor), keeping the predictor layers private. The split of the model is controlled by selected layers, which indicate the layers that the client is willing to share for global aggregation.

Figure 1.

Async FL with adaptive aggregation architecture.

The server receives and aggregates updates asynchronously to the global model, considering the selected layers for each client and using the adaptive learning rate. The learning rate for each client is dynamically updated whenever the client sends a model update. The adjustment depends on two factors: (1) the performance of the client’s new weights on a server-side synthetic validation dataset and (2) the client’s staleness, a variable that quantifies the delay since the client’s last update. The synthetic validation dataset approximates the statistical characteristics of real energy consumption data. This dataset is generated by sampling from aggregated client statistics, ensuring that no individual user data is transmitted to the server. The clients compute summary statistics locally (e.g., mean, variance, hourly consumption distributions) over their private data and share only these aggregated metrics with the server. Using these statistics, the server samples values from distributions defined by client-provided means and variances to create realistic sequences. The synthetic dataset incorporates multiple usage patterns, including peak and off-peak hours, weekday and weekend variations, and seasonal trends, to provide a representative reference for model evaluation.

The flow of the asynchronous FL training process is presented in Figure 2. Firstly, the server initialises the global model. Before participating in a training round, the client must register with the server to initialise its learning rate. When a client wants to start a training process, it requests the part of the global model corresponding to its selected shared layers. The server initialises its staleness, and the client starts training the model locally on private data. After the weights are updated, it sends back to the server the ones from the shared layers, triggering the adaptive aggregation process. The server immediately updates the client’s learning rate and applies the received update to the global model. Lastly, the server increases the staleness for all the other clients. With such an asynchronous strategy, clients participate independently based on availability, requesting the latest global model at any time and performing local training. The server integrates the updates from the clients immediately, without waiting for other clients.

Figure 2.

Asynchronous FL with partial aggregation flow.

3.2. Asynchronous adaptive aggregation method

The aggregation method used for the proposed asynchronous FL system is described in Algorithm 1. The FedAsyncAdaptive function is executed each time a client c sends updates. The global model weights (line 3), as well as staleness and learning rate for each client, are stored on the server (line 4-5). The updated weights for this step are received from the client (line 6). Firstly, the delta loss between the global model and the received update is computed on the validation set of the server (line 9). Then, the learning rate for the client c is updated according to the loss delta and staleness of the client (line 10). The staleness for each client is increased when another client updates the model (lines 11-12). Lastly, the global model weights are updated and stored on the server (line 14).

3.3. Partial weight aggregation

The LSTM model layers are divided into a shared feature extractor and a private predictor that remains on the client.⁴⁶ In addition, we developed this approach by enabling custom model splitting for each client; thus, each client can have a custom number of selected layers for the feature extractor and for the predictor, respectively. This enhances the personalisation of client training. The steps for local training and partial weight update on the server are described in Algorithm 2. The client c has its model weights from the previous training round (line 3) as well as a list of selected layers (4), which indicates how the model of the client is split for this training round. This function returns the feature extractor part of the updated weights to the server (line 5). Firstly, the client splits its previous round weights (line 7) and then requests the corresponding feature extractor weights from the server (line 8). In this step, the server stores the layers’ configuration of the client for this round. The client initialises its model with the weights composed of a global feature extractor and its own predictor (line 9) and then trains it on the local dataset (line 10). The updated weights are then split (line 11) into a feature extractor that is sent to the server (line 12) and a predictor that is kept private on the client. This supports both communication efficiency and client-level personalisation.

The weights are split based on the layers that are selected by the client to be shared. The split_layers function divides the weights into the feature extractor (weights that are in the selected layers list) and predictor (weights that are not in the list). This is achieved by traversing the model recursively. Thus, only a subset of the weights is shared with the server, minimising communication costs and enabling personalised training on clients. Additionally, this function is used when updating the shared part of the client model with the global model received from the server, ensuring that only globally aggregated layers are overwritten.

The split_layers function is further described in Algorithm 3, having as input the entire LSTM model and the selected_layers array (lines 2-4), and returning two sets of weights (line 9) representing the feature extractor (shared) and predictor weights (private). Firstly, the two arrays are initialised (line 7), and then the recursive function is called (line 8). The recursive function (lines 11-26) passes through the layers of the LSTM, and for each layer, it calls itself to build the two required arrays (lines 17-20). For the layers that have no sublayers, their corresponding weights are directly concatenated either to the feature extractor or to the predictor weights array (lines 21-26). This way, the weights are divided, and the clients can share with the server only the selected ones. A function with the reverted process is implemented to build the entire LSTM model by reuniting the feature extractor and predictor weights. This function is needed on both the client and server side.

4. Evaluation results

The consumers used for evaluation were selected from a real-world dataset that comprises time-series measurements of monitored energy over continuous periods for approximately 1000 prosumers, identified by unique IDs and having different energy scales.³¹ For evaluation, only the energy consumption component was considered. The consumption data is recorded in kWh at 15-minute intervals and spans an approximate period of four years (2015–2018), offering a high temporal resolution for consumption patterns. It contains a timestamp indicating the date and time at which the measurement was taken and the active energy measured at that timestamp. A subset of 40 consumers was selected through a multi-stage filtering process. First, consumers with missing values were excluded. Second, we removed consumers who have strong temporal drift between the training and validation periods (e.g., near-zero consumption throughout the validation period), as well as those with a high proportion of zero readings. This ensured that only consumers with continuous and consistent measurements were used in the federated process. Finally, to ensure diversity in consumption behaviour and have a non-IID scenario, the average load profiles of the remaining candidates were analysed, and the clients were selected to reflect a wide range of distinct usage patterns.

The time series of each client was first normalised using a MinMaxScaler applied independently per client. The scaler parameters were fitted on the training set of each client’s data and then applied to the validation and test splits. The training samples were created using a sliding window approach. Specifically, for each client, input sequences of length 96 were constructed, corresponding to the past 24 hours of energy consumption at a 15-minute resolution. Each input window $x_{t} = [y_{t - 95}, \dots, y_{t}]$ was paired with the next target values [ $y_{t + 1}, \dots, y_{t + 96}]$ representing the consumption for the following 24 hours. The prediction model architecture consists of a three-layer LSTM stack (128, 64, 32 units) followed by a Dense layer with 128 ReLU units and a final output layer producing the full sequence forecast. The input size of the model corresponds to the window size mentioned before, and the output length is 96, representing the energy consumption forecasted for the following 24 hours. During training, Adam Optimiser is used with mean-squared error (MSE) as the primary loss function.

We evaluated the prediction performance for the proposed strategy by simulating the FL process. We have considered three different groups of clients with different training durations: fast (10s), medium (20s), and slow (40s). Within each group, clients were further subdivided by the number of shared LSTM layers uploaded per round to see the impact on the communication cost reduction (see Table 1).

Table 1.

Selected clients and simulation setup.

Group	Selected layers	Clients no.
Fast (10s)	lstm_0, lstm_1, lstm_2	5
	lstm_0, lstm_1	6
	lstm_0	5
Medium (20s)	lstm_0, lstm_1, lstm_2	5
	lstm_0, lstm_1	5
	lstm_0	4
Slow (40s)	lstm_0, lstm_1, lstm_2	3
	lstm_0, lstm_1	3
	lstm_0	4

The simulation was run on a virtual time of 640 seconds, with each client participating according to its group. Figure 3 shows the staleness over the simulation period for each client group. Staleness measures the number of global model updates that occurred between a client’s request for the global model and its current update. Fast clients maintain low staleness, while slow clients are near the maximum for the entire duration of the simulation.

Figure 3.

Staleness over simulation time for each group.

Figure 4 presents the evolution of the coefficient of determination (R²) and symmetric mean absolute percentage error (sMAPE) metrics during training for each local client round, averaged over groups, and the number of shared layers. The results are averaged within each group, with shaded regions indicating variability. Fast clients, which complete approximately 65 local training rounds within the simulation window, contribute most significantly to the global model. In this group, configurations sharing 2 layers achieve the best trade-off, with higher R² and lower sMAPE compared to both more localised (1 layer) and fully shared (3 layers) setups. This suggests that partial sharing provides a balance between global model generalisation and local adaptation. For the Medium group (32 rounds), the configuration with 3 shared layers shows the most stable and strong performance. Slow clients (around 15 rounds) have slower convergence, with overall lower R² and higher sMAPE across all configurations. This behaviour is expected, as their limited participation leads to stale updates, which are frequently discarded or have reduced impact on the global aggregation. Although performance differences between sharing strategies are less pronounced in this group, configurations with more shared layers have a higher training performance.

Figure 4.

R² and SMAPE during training by client group and shared layers.

The detailed evaluation results on the test set for the proposed method are presented in Table 2. First, Fast clients achieve the highest predictive accuracy, with Fast_3L and Fast_2L reaching R² values above 0.70 and the lowest normalised mean absolute error (NMAE) scores (0.18 and 0.19, respectively), indicating that faster clients, which have more contribution, benefit the most from the model. In the case of medium speed clients, the Medium_3L has the highest single R² of 0.7295. In contrast, slow clients have the weakest performance across all metrics, with R² values consistently below 0.47 and notably higher NMAE (up to 0.49 for Slow_2L), suggesting that the low participation of the slow clients affects the prediction quality on their data. The mean absolute error (MAE) values show high variance, reflecting the wide range of consumption magnitudes across clients in these groups, making NMAE the more reliable metric for cross-group comparison.

Table 2.

Evaluation results by category.

Client group	R2	MAE	NMAE
Fast_3L	0.7032 ± 0.1492	1.1435 ± 1.1431	0.1764 ± 0.0974
Fast_2L	0.7177 ± 0.1371	86.7016 ± 201.3182	0.1920 ± 0.1266
Fast_1L	0.5196 ± 0.1105	22.3526 ± 43.2938	0.2167 ± 0.1200
Medium_3L	0.7295 ± 0.1983	4.1734 ± 4.6624	0.1986 ± 0.1645
Medium_2L	0.5805 ± 0.2700	3.1177 ± 1.8102	0.2281 ± 0.1901
Medium_1L	0.6944 ± 0.1103	4.6902 ± 4.0115	0.2317 ± 0.1801
Slow_3L	0.4680 ± 0.2571	9.9728 ± 4.1610	0.3228 ± 0.0660
Slow_2L	0.4555 ± 0.0495	7.9559 ± 2.5575	0.4869 ± 0.2347
Slow_1L	0.4594 ± 0.1456	19.0880 ± 22.2393	0.3819 ± 0.1632

5. Discussion

We first compared the proposed solution against the FedAvg baseline method on the same datasets and model configuration. The classic FedAvg baseline trains a fully shared global model by aggregating all layers, including all LSTM layers and the dense layer, across clients. In addition, we compared the model with a Personalised FedAvg in which the aggregation is limited to the LSTM layers while keeping the dense layer client-specific, ensuring a fair comparison in non-IID settings. The client groups are the ones for the proposed asynchronous method, whilst the FedAvg methods were trained synchronously. The NMAE metric comparison is shown in Figure 5. FedAvg is consistently the worst across all nine groups, with a significant gap for fast clients. The difference is smaller in slow groups, for the clients who participated less during the asynchronous training. Personalised FedAvg shows similar performance for medium clients and a smaller gap for the fast ones, whilst in all three slow groups, Personalised FedAvg achieves lower NMAE than our solution. Additionally, we have compared our solution with a classic asynchronous FL method proposed by Xie et al. in (FedAsync).³⁸ In this approach, each model update is weighted by a staleness discount, the clients share all model layers, and the update quality is not considered in the weight. For evaluation, we used the same virtual clock, and uploading intervals defined by the client group (fast, medium and slow). The results show that our solution achieves lower NMAE than the classic FedAsync baseline across all client groups, with reductions ranging from 0.13 to 0.22 NMAE across different groups.

Figure 5.

NMAE comparison with personalised FedAvg, FedAvg and classic Async FL by client group.

Table 3 shows the t-test results between our solution, classic and personalised FedAvg, and the classic FedAsync method. The table reports the mean improvement in NMAE (ΔNMAE) of the proposed method relative to each baseline (PersFedAvg, FedAvg and FedAsync), along with the corresponding t-statistics and p-values obtained from paired t-tests assessing whether the observed improvements differ significantly from zero. For each client, the individual ΔNMAE is computed by subtracting our solution score from FedAvg scores. Group-level statistics are then obtained by applying a paired t-test to the subset of per-client deltas belonging to that category, where the t-statistic is the mean delta divided by its standard error, and the p-value is derived from a t-distribution with n−1 degrees of freedom. The p-values that indicate a statistically significant difference (p<0.05) are marked with *, whilst ns (not significant) means the difference could plausibly be due to random variation. Our solution outperforms classic FedAvg across all client groups, with negative ΔNMAE values ranging from -0.0035 to -0.1892, and statistically significant differences for fast clients. For Personalised FedAvg, no statistically significant differences are observed in most groups, apart from slow clients, where Personalised FedAvg performs better for clients with lower participation in the asynchronous method. The advantages of our solution with the asynchronous strategy and the communication reduction are achieved with similar or better performance for fast and medium clients. The performance of the prediction is affected only for slow clients, justified by their low participation in the asynchronous FL process. Compared to the classic asynchronous method, the differences are statistically significant at p < 0.05 in 6 out of 9 client groups, with the three non-significant cases attributed to groups with less than 4 clients. This highlights the benefits of the adaptive learning rate, which considers the performance on the validation pool, as well as the effect of personalisation enabled through partial weight sharing.

Table 3.

Paired t-test results for NMAE: Our solution vs FedAvg and personalised FedAvg.

Client group	n	ΔNMAE			t-statistic			p-value
Client group	n	vs PersFedAvg	vs FedAvg	vs FedAsync	Pers FedAvg	FedAvg	FedAsync	Pers FedAvg	FedAvg	FedAsync
Fast 1L	5	-0.0474	-0.1781	-0.1801	-2.472	-3.062	-2.9567	0.069 (ns)	0.038*	0.042*
Fast 2L	6	-0.0111	-0.106	-0.1639	-0.8435	-2.7381	-2.8133	0.437 (ns)	0.041*	0.037*
Fast 3L	5	-0.0396	-0.1892	-0.2240	-2.3819	-4.1508	-4.7774	0.076 (ns)	0.014*	0.009*
Med 1L	4	0.027	-0.0916	-0.1642	0.8393	-0.9856	-1.7680	0.463 (ns)	0.397 (ns)	0.175 (ns)
Med 2L	5	0.023	-0.0709	-0.1164	0.6399	-1.8608	-2.7853	0.557 (ns)	0.136 (ns)	0.049*
Med 3L	5	-0.0094	-0.1403	-0.1949	-0.7051	-2.3048	-2.8332	0.520 (ns)	0.083 (ns)	0.047*
Slow 1L	4	0.0824	-0.0144	-0.1099	2.1076	-0.4862	-4.9237	0.126 (ns)	0.660 (ns)	0.016*
Slow 2L	3	0.0599	-0.1029	-0.2121	4.3313	-1.9128	-2.8019	0.049*	0.196 (ns)	0.107 (ns)
Slow 3L	3	0.0597	-0.0035	-0.0901	3.2893	-0.1073	-1.2883	0.081 (ns)	0.924 (ns)	0.327 (ns)

Figure 6 shows the impact on communication of each shared layer configuration and the full model. For our solution, the sizes of the updates sent to the server were measured during the simulation and are represented in kilobytes. The Dense Layers are the part of the model that are never shared with the server for our solution. For classic FedAvg and FedAsync, the full model (∼530 KB) is uploaded, whilst for the personalised version, only the LSTM layers are shared (∼500 KB). In comparison with the two methods, our approach reduced communication overhead from ∼500 KB to ∼450 KB, and even to ∼260 KB, depending on the number of layers selected by the clients. Moreover, when larger models are used, the impact on communication can be even more significant, as they usually contain layers with a higher number of parameters. This reduction in communication overhead improves scalability, as its benefits become more significant in large-scale deployments with many clients and frequent updates. In such settings, lower communication costs improve overall efficiency, while the asynchronous mechanism reduces the impact of delayed updates. Moreover, data heterogeneity increases as the number of clients grows, further motivating the advantage of personalisation in adapting the model to specific clients.

Figure 6.

Communication overhead per round by shared layers configuration.

6. Conclusion

In this paper, we propose an asynchronous FL solution with an adaptive aggregation mechanism and partial model sharing for energy consumption forecasting. This solution addresses the identified limitations of synchronous FL, including training delays caused by client heterogeneity and high communication overhead. Moreover, using a client-specific adaptive learning rate, it addresses challenges commonly associated with asynchronous approaches, such as update staleness and fairness across clients. The adaptive mechanism assigns dynamic learning rates to clients based on both update staleness and the performance of each client’s update measured on the server’s synthetic validation data. In addition, the partial aggregation approach, tailored for LSTM-based models, allows clients to selectively share only a subset of model layers, reducing communication overhead. The experimental results on real-world energy consumption data show that the proposed asynchronous strategy achieves comparable predictive performance with the standard FedAvg baseline across most client groups, with statistically significant improvements observed for fast clients. Compared to Personalised FedAvg, the proposed approach performs similarly for fast and medium clients, while having a lower predictive performance for slow clients, which is justified by their reduced participation in the asynchronous training process. In addition, the proposed solution outperforms a baseline asynchronous federated method in terms of prediction accuracy, with statistically significant improvements observed for most of the groups. The partial aggregation mechanism reduces the amount of transmitted data per training round, lowering the communication cost of a full model update to smaller updates depending on the number of shared layers, while maintaining the prediction accuracy. Future work includes evaluating the proposed asynchronous FL framework on larger-scale settings involving more clients with different computational capabilities and larger models. This would provide better insights into the impact on prediction performance, the reduction in communication costs, and the trade-off between them.

Footnotes

ORCID iDs

Liana Toderean

Tudor Cioara

Ionut Anghel

Ethical considerations

This article does not contain any studies with human or animal participants.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the project “Romanian Hub for Artificial Intelligence-HRIA”, Smart Growth, Digitization and Financial Instruments Program, MySMIS no. 334906.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Habbak

Mahmoud

Metwally

, et al. Load forecasting techniques and their applications in smart grids. Energies 2023; 16(3): 1480. https://doi.org/10.3390/en16031480

Toderean

Cioara

Anghel

, et al. Demand response optimisation for smart grid integrated buildings: review of technology enablers landscape and innovation challenges. Energy Build 2025; 326: 115067. https://doi.org/10.1016/j.enbuild.2024.115067

Dewangan

Abdelaziz

Biswal

. Load forecasting models in smart grid using smart meter information: a review. Energies 2023; 16(3): 1404. https://doi.org/10.3390/en16031404

Badr

Mahmoud

Fang

, et al. Privacy-preserving and communication-efficient energy prediction scheme based on federated learning for smart grids. IEEE Internet Things J 2023; 10(9): 7719–7736. https://doi.org/10.1109/jiot.2022.3230586

Savi

Olivadese

. Short-term energy consumption forecasting at the edge: a federated learning approach. IEEE Access 2021; 9: 95949–95969. https://doi.org/10.1109/access.2021.3094089

Banabilah

Aloqaily

Alsayed

, et al. Federated learning review: fundamentals, enabling technologies, and future applications. Inf Process Manag 2022; 59(6): 103061. https://doi.org/10.1016/j.ipm.2022.103061

Hosseini

Taheri

Akhavan

, et al. Privacy-preserving federated learning: application to behind-the-meter solar photovoltaic generation forecasting. Energy Convers Manag 2023; 283: 116900. https://doi.org/10.1016/j.enconman.2023.116900

Toderean

Daian

Cioara

, et al. Heuristic based federated learning with adaptive hyperparameter tuning for households energy prediction. Sci Rep 2025; 15(1): 12564. https://doi.org/10.1038/s41598-025-96443-3

Fekri

Grolinger

Mir

. Asynchronous adaptive federated learning for distributed load forecasting with smart meter data. Int J Electr Power Energy Syst 2023; 153: 109285. https://doi.org/10.1016/j.ijepes.2023.109285

10.

Zheng

Sumper

Aragüés-Peñalba

, et al. Advancing power system services with privacy-preserving federated learning techniques: a review. IEEE Access 2024; 12: 76753–76780. https://doi.org/10.1109/access.2024.3407121

11.

Ozturk

. Forecasting energy consumption of Turkey by ARIMA model. J Asian Sci Res 2018; 8(2): 52–60. https://doi.org/10.18488/journal.2.2018.82.52.60

12.

Jin

Yang

, et al. Highly accurate energy consumption forecasting model based on parallel LSTM neural networks. Adv Eng Inform 2022; 51: 101442. https://doi.org/10.1016/j.aei.2021.101442

13.

Alizadegan

Rashidi Malki

Radmehr

, et al. Comparative study of long short-term memory (LSTM), bidirectional LSTM, and traditional machine learning approaches for energy consumption prediction. Energy Explor Exploit 2025; 43(1): 281–301. https://doi.org/10.1177/01445987241269496

14.

Jrhilifa

Ouadi

Jilbab

, et al. Forecasting smart home electricity consumption using VMD-Bi-GRU. Energy Effic 2024; 17(4): 35. https://doi.org/10.1007/s12053-024-10205-0

15.

Pierre

Akim

Semenyo

, et al. Peak electrical energy consumption prediction by ARIMA, LSTM, GRU, ARIMA-LSTM and ARIMA-GRU approaches. Energies 2023; 16(12): 4739. https://doi.org/10.3390/en16124739

16.

Amalou

Mouhni

Abdali

. Multivariate time series prediction by RNN architectures for energy consumption forecasting. Energy Rep 2022; 8: 1084–1091. https://doi.org/10.1016/j.egyr.2022.07.139

17.

Antonesi

Cioara

Anghel

, et al. Hybrid transformer model with liquid neural networks and learnable encodings for buildings’ energy forecasting. Energy AI 2025; 20: 100489. https://doi.org/10.1016/j.egyai.2025.100489

18.

Nazir

Shaikh

Shah

, et al. Forecasting energy consumption demand of customers in smart grid using temporal fusion transformer (TFT). Results Eng 2023; 17: 100888. https://doi.org/10.1016/j.rineng.2023.100888

19.

Cao

. Multi-task learning and temporal-fusion-transformer-based forecasting of building power consumption. Electronics 2023; 12(22): 4656. https://doi.org/10.3390/electronics12224656

20.

Samuel

. Optimising energy consumption through AI and cloud analytics: addressing data privacy and security concerns. World J Adv Eng Technol Sci 2024; 13(2): 789–806.

21.

Wang

Luan

, et al. Secure and efficient federated learning for smart grid with edge-cloud collaboration. IEEE Trans Ind Inform 2021; 18(2): 1333–1344. https://doi.org/10.1109/tii.2021.3095506

22.

Połap

Srivastava

Jaszcz

. Energy consumption prediction model for smart homes via decentralised federated learning with LSTM. IEEE Trans Consum Electron 2023; 70(1): 990–999.

23.

Wang

Yun

Rayhana

, et al. An adaptive federated learning system for community building energy load forecasting and anomaly prediction. Energy Build 2023; 295: 113215. https://doi.org/10.1016/j.enbuild.2023.113215

24.

Zhou

. Communication-efficient federated learning with compensated overlap-FedAvg. IEEE Trans Parallel Distrib Syst 2021; 33(1): 192–205. https://doi.org/10.1109/tpds.2021.3090331

25.

Fekri

Grolinger

Mir

. Distributed load forecasting using smart meter data: federated learning with recurrent neural networks. Int J Electr Power Energy Syst 2022; 137: 107669. https://doi.org/10.1016/j.ijepes.2021.107669

26.

Moshawrab

Adda

Bouzouane

, et al. Reviewing federated learning aggregation algorithms: strategies, contributions, limitations and future perspectives. Electronics 2023; 12(10): 2287. https://doi.org/10.3390/electronics12102287

27.

Chiaro

Guzzo

, et al. Model aggregation techniques in federated learning: a comprehensive survey. Future Gener Comput Syst 2024; 150: 272–293. https://doi.org/10.1016/j.future.2023.09.008

28.

, et al. A federated learning method for non-intrusive load monitoring based on Fed-Prox and Bi-GRU. In: International Conference on Neural Computing for Advanced Applications, Guilin, China, 5-7 July, 2024, pp. 239–254. Springer Nature.

29.

Yang

Ghaderi

. Byzantine-robust decentralized learning via remove-then-clip aggregation. Proc AAAI Conf Artif Intell 2024; 38(19): 21735–21743. https://doi.org/10.1609/aaai.v38i19.30173

30.

Gai

Xue

Zhu

, et al. An efficient data aggregation scheme with local differential privacy in smart grid. Digit Commun Netw 2022; 8(3): 333–342. https://doi.org/10.1016/j.dcan.2022.01.004

31.

Chifu

Cioara

Anitei

, et al. A federated learning model with the whale optimization algorithm for renewable energy prediction. Comput Electr Eng 2025; 123: 110259. https://doi.org/10.1016/j.compeleceng.2025.110259

32.

Zhang

, et al. SFFL: self-aware fairness federated learning framework for heterogeneous data distributions. Expert Syst Appl 2025; 269: 126418. https://doi.org/10.1016/j.eswa.2025.126418

33.

Nightingale

Wang

Zobiri

, et al. Effect of clustering in federated learning on non-IID electricity consumption prediction. In: IEEE PES ISGT-Europe, Novi Sad, Serbia, 10-12 October 2022, pp. 1–5. IEEE.

34.

Guo

Liu

Sha

, et al. PracMHBench: re-evaluating model-heterogeneous federated learning based on practical edge device constraints. In: ACM/IEEE Design Automation Conf (DAC), San Francisco, CA, USA, 2-25 June 2025, pp. 1–7. IEEE.

35.

Alotaibi

Khan

Mahmood

. Communication efficiency and non-independent and identically distributed data challenge in federated learning: a systematic mapping study. Appl Sci 2024; 14(7): 2720. https://doi.org/10.3390/app14072720

36.

Zhang

, et al. Fedhisyn: a hierarchical synchronous federated learning framework for resource and data heterogeneity. In: Proc Int Conf Parallel Processing, Bordeaux France, 9 August-1 September 2022, pp. 1–11.

37.

Zakerinia

Talaei

Nadiradze

, et al. Communication-efficient federated learning with data and client heterogeneity. 2024; AISTATS: Conference on Artificial Intelligence and Statistics, Valencia, Spain, https://research-explorer.ista.ac.at/record/17093.

38.

Xie

Koyejo

Gupta

. Asynchronous federated optimization, 2019. 2019; https://doi.org/10.48550/arXiv.1903.03934, arXiv preprint arXiv:1903.03934.

39.

. How asynchronous can federated learning be? In: IEEE/ACM Int Symp Quality Service (IWQoS), Oslo, Norway, 0-11 June 2022, pp. 1–11. IEEE.

40.

Nilsson

Smith

Ulm

, et al. A performance evaluation of federated learning algorithms. In: Workshop Distributed Infrastructures Deep Learning, Rennes France, 10-11 December 2018, pp. 1–8.

41.

Sun

Zhang

Pan

, et al. Staleness-controlled asynchronous federated learning: accuracy and efficiency trade-off. IEEE Trans Mob Comput 2024; 23(12): 12621–12634. https://doi.org/10.1109/tmc.2024.3416216

42.

Choi

Lee

, et al. Staleness aware semi-asynchronous federated learning. J Parallel Distrib Comput 2024; 193: 104950. https://doi.org/10.1016/j.jpdc.2024.104950

43.

Sun

Albelaihi

, et al. Latency-aware semi-synchronous client selection and model aggregation for wireless federated learning. Future Internet 2023; 15(11): 352. https://doi.org/10.3390/fi15110352

44.

Yang

Guliani

Beaufays

, et al. Partial variable training for efficient on-device federated learning. In: IEEE ICASSP, Singapore, Singapore, 3-27 May 2022, pp. 4348–4352. IEEE.

45.

Liu

Wang

Rong

, et al. FedPA: an adaptively partial model aggregation strategy in federated learning. Comput Netw 2021; 199: 108468. https://doi.org/10.1016/j.comnet.2021.108468

46.

Chen

Shin

, et al. Efficient wireless federated learning with partial model aggregation. IEEE Trans Commun 2024; 72(10): 6271–6286. https://doi.org/10.1109/tcomm.2024.3396748