Abstract
The processes related to solid waste management (SWM) are being revised as new technologies emerge and are applied in the area to achieve greater environmental, social and economic sustainability for society. To achieve our goal, two robust review protocols (Population, Intervention, Comparison, Outcome, and Context (PICOC) and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)) were used to systematically analyze 62 documents extracted from the Web of Science database to identify the main techniques and tools for Knowledge Discovery in Databases (KDD) and Data Mining (DM) as applied to SWM and explore the technological potential to optimize the stages of collecting and transporting waste. Moreover, it was possible to analyze the main challenges and opportunities of KDD and DM for SWM. The results show that the most used tools for SWM are MATLAB (29.7%) and GIS (13.5%), whereas the most used techniques are Artificial Neural Networks (35.8%), Linear Regression (16.0%) and Support Vector Machine (12.3%). In addition, 15.3% of the studies were conducted with data from China, 11.1% from India and 9.7% of the studies analyzed and compared data from several other countries. Furthermore, the research showed that the main challenges in the field of study are related to the collection and treatment of data, whereas the opportunities appear to be linked mainly to the impact on the pillars of sustainable development. Thus, this study portrays important issues associated with the use of KDD and DM for optimal SWM and has the potential to assist and direct researchers and field professionals in future studies.
Introduction
The gradual pace of population growth, the urbanization process and the adoption of technologies significantly expand industrial production and the acquisition of consumer goods (Das and Bhattacharyya, 2015; Hannan et al., 2018; Ahmad and Kim (2020). However, the same circumstances that drive companies also create a series of worldwide environmental, social and public health sustainability problems (Babaee Tirkolaee et al., 2019; Faccio et al., 2011). Proof of this scenario of change is evident in society’s lifestyle. Unrestrained consumption increases waste generation and impacts the three pillars of sustainable development, which in turn increases the need for waste collection and treatment solutions (Furstenau et al., 2020b; Sott et al., 2020a). In this sense, the integration between the three dimensions of sustainable development is of paramount importance in the discussion on the generation of solid waste since they are guided by three basic principles: people, planet and profit. In this regard, our results show that the main challenges of waste management are related to the pillars of sustainability.
Solid waste generation indicators have shown great growth each year (Hoornweg and Bhada-Tata, 2012; Nowakowski et al., 2017), and as a result, government management must remain attentive to the management of this waste (Babaee Tirkolaee et al., 2019). In this context, several approaches are being used to assist in the analysis and decision-making related to solid waste management (SWM). One approach that stands out is the use of Knowledge Discovery in Databases (KDD).
The purpose of the KDD methodology is to detect valid, innovative, advantageous and coherent patterns. These concepts are also often attributed to another very widespread technique in the field of artificial intelligence, Data Mining (DM). Despite the divergence among some authors, in this review, the approach presented by Fayyad et al. (1996) was considered. In this method, KDD is approached as a process composed of a sequence of activities, which are: Selection – defines and clarifies the context and purpose of the project to properly carry out data collection; Pre-processing – identifies and treats inconsistencies (such as incomplete records, incorrect values and inconsistent data); Transformation – uses techniques to optimize the performance of the model in the DM stage (such as generalization of attributes, discretization of variables and normalization); DM – consists of the application of data discovery and analysis algorithms that build models on the data and Analysis/Assimilation – performs the model validation through the application of performance and quality measures (such as accuracy, error and confidence) (Fayyad et al., 1996; Kvasničková-Stanislavská et al., 2020; Leary et al., 2020).
Regarding the use of KDD for urban solid waste management (USWM), research highlights the need for data collection, treatment and analysis related to the wide range of variables associated with the stages of managing the collection, transportation and disposal of waste. For this reason, data mining techniques are important tools for generating strategic knowledge for the field of study (Bagheri et al., 2019). Moreover, the growing social, governmental and academic concern with waste management reinforces the use of technologies and analytical tools for data collection and processing to promote sustainable development (Sharma and Jain 2019). Therefore, the potential of DM goes beyond simple data analysis, as it allows for identifying relationships between variables and performing classification, prediction and causal analysis of data (Yang et al., 2019).
Given the concern with actions related to the increase in the generation of waste and the advantages associated with the application of DM in several scenarios, this work identifies the techniques and tools most used for the management of urban solid waste, and their contributions to the collection of solid waste according to the literature. In addition, the study explores the main challenges and opportunities regarding the application of these concepts to improve sustainability as related to this scenario. To obtain the results and ensure the quality and robustness of the research, a systematic literature review (SLR) was planned and performed based on the structures of the Population, Intervention, Comparison, Outcome, and Context (PICOC) protocol and the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) protocol.
This work is organized into the following sections: section ‘Materials and methods’ presents the methodology adopted and the steps taken; section ‘Results and discussion’ presents the results regarding the main DM techniques and tools used for SWM and the main challenges and opportunities related to the field of study. Finally, section ‘Conclusion’ comprises the conclusions, limitations, and suggestions for further research.
Materials and methods
To achieve the objective of this research and explore the use of KDD and DM for SWM, an SLR was carried out, supported by the PICOC and PRISMA protocols. The methodological procedures were adapted from Pollock and Berge (2018) and Sott et al. (2020a) and are described below.
PICOC protocol
It is essential to have the definition of methods that assist in the development of research and ensure its effectiveness (López-Robles et al., 2020). This makes it possible to carry out steps that guarantee the quality and reproducibility of the study (da Silva et al., 2020; Sott et al., 2021). In this work, the PICOC protocol was used to support the process of elaborating research questions as well as defining the search string and criteria for including and excluding documents. The attributes considered in the PICOC protocol are described in Table 1.
Attributes according to the PICOC protocol (adapted from Pollock and Berge, 2018).
DM: Data Mining; KDD: Knowledge Discovery in Databases; SWM: solid waste management.
Once the analysis attributes were defined, three research questions (RQ1 to RQ3) were composed to guide this work. The issues are described below and address the use of DM and KDD for SWM.
RQ1 – What are the DM and KDD techniques and tools used for SWM?
RQ2 – How can DM and KDD contribute to the optimization of solid waste collection?
RQ3 – What are the challenges and opportunities of using DM and KDD for SWM?
To answer the research questions, studies available in the Web of Science (WoS) Core Collection database were used. All indexes were considered such as Social Science Citation Index (SSCI), Science Citation Index Expanded (SCI-E) and others, since WoS is an indexed database with a large volume of quality research (Cobo et al., 2011; Furstenau et al., 2021a, 2021; Severo et al., 2021; Sott et al., 2020b). In the population phase of the PICOC protocol, the following search string ((‘data mining’ OR ‘knowledge-discovery’ OR ‘estimation’ OR ‘prediction’) AND (‘solid waste’)) was used to identify the documents associated with the field of study. In the Intervention phase, criteria for inclusion and exclusion of documents were stipulated in order to select only works related to the objective of this study. In this sense, only documents of the types ‘article’, ‘article in press’ and ‘review’, in English, published from January 2015 to 14 November 2020, and which had the search terms present in the title, abstract, or keywords were considered. The sample period was defined to explore the recent research (last 6 years) of the field of study, because DM techniques have been widely explored in recent years (Kolling et al., 2021). To reach breadth and consider several DM and KDD approaches used to analyze characteristics and impacts related to SWM, no filters related to the application area were applied. The last three phases of the PICOC protocol (Comparison, Outcome and Context) were conducted together with the Eligibility phase of the PRISMA protocol. The documents were fully read to identify the DM and KDD approaches applied to SWM and answer the research questions.
PRISMA protocol
In the second stage, the PRISMA protocol was used because it is a robust method that allows for the identification, screening, eligibility and inclusion of documents related to the research objective (McInnes et al., 2018; Sott et al., 2020a). PRISMA is characterized as a checklist of items that assists in reducing bias. It guides researchers towards the objectives of the study and ensures the elicitation of knowledge with data integrity and clarity of information (Stewart et al., 2015). Figure 1 shows the steps of the PRISMA protocol performed to identify, analyze and interpret all the evidence related to the research questions in this study.

Phases of the PRISMA protocol (adapted from Sott et al., 2020a).
SLR allows us to understand the relevance of the field of study and the efforts of researchers from around the world concerning the use of KDD techniques for SWM. In the Identification phase, the combination of the selected search terms resulted in a total of 488 documents. During the download of the documents, three articles were included, which appeared as suggested reading because they were accessed by other authors who carry out similar searches related do SWM. Throughout the Screening phase, the collected sample was added directly from the database to the EndNote reference manager and through its functionalities, it was possible to confirm that there were no duplicate documents in the set. Posteriorly, from the title and abstract of each article, it was possible to identify whether the methodology developed was in accordance with the scope of this work, eliminating 417 documents, leaving 74 articles for analysis. This left 74 documents were at the end of the inclusion stage.
To discuss and answer the questions presented during the Eligibility phase, the articles were explored based on a checklist and divided into two groups: (a) articles that used strategies for knowledge extraction in the context of SWM processes, without defining a specific process as a focus; and (b) articles that used strategies for extracting knowledge in the context of SWM processes, but with a focus on optimizing the stages of collecting and transporting solid waste. Thus, there were 64 documents related to RQ1; 10 articles related to RQ2 and 62 documents related to RQ3. The 74 documents from the eligibility stage were fully read and 12 documents unrelated to the RQ’s were removed. Finally, in the Inclusion phase remaining documents (62) comprised the qualitative analysis and the results of this study.
Results and discussion
Some papers were discarded in the abstract reading process. However, it is worth mentioning that in a representative quantity, the strategy used to explore the information contained in the data set employed multivariate analysis techniques (Adeniyi and Ighalo, 2020; Bayard et al., 2018; Liu et al., 2020; Ayodele et al., 2020; Sebastian and Dinesh Alappat, 2020; Zhang et al., 2018) or multicriteria analysis (Kang et al., 2018; Mitsos et al., 2018; Qiang et al., 2019; Vesely et al., 2016). Like most studies that addressed the stages of collecting and transporting solid waste in their scope and used techniques for extracting knowledge, they do not mention data mining in their development, which allows us to infer that there is an opportunity to explore the use of technology to support decision-making in this context. These articles were fully read, but as they did not use KDD or DM, they were not considered in the analysis. Unrelated documents are identified in Figure 1 as articles deleted after reading RQ2: ( 8).
Figure 2 considers the 62 articles that were qualitatively analyzed and represents the evolution of publications in the last 6 years (2015–2020). It is possible to see that, in general, there was an increase in the number of publications related to the field of study in the period analyzed. In 2020, the number of documents is less than in 2019 due to the date of data collection (14 November 2020). Although there has been an increase in research related to the use of DM for SWM, the low number of documents shows the need for in-depth research to assist in the development of the field of study.

Evolution of publications on Data Mining (DM) for knowledge discovery (2015 until 14 November 2020).
Knowledge discovery in solid waste process databases
To determine the selection of variables in the data set, each author sought to substantiate their choices through a means, which are: in results from previous publications; calculating the correlation between the variables or relating the previous two; identifying in the literature the most associated variables and after validating through correlation. The Pearson correlation coefficient was the method most used by the authors. In this regard, the aspects associated with solid waste in publications were: total number of inhabitants (Pérez-López et al., 2016); population density (Colvero et al., 2019; Coskuner et al., 2020); average minimum and maximum temperature (Kumar et al., 2016); precipitation rate (Vu et al., 2019b); per capita income; schooling; unemployment rate (Ceylan, 2020); waste category; residence size (Abbasi et al., 2019); amount collected by category of waste (Kontokosta et al., 2018) and emissions of gases (Dimishkovsk et al., 2019) into the atmosphere. Among these, we can highlight that the correlation coefficient between the generation of waste, per capita income (Vu et al., 2019b) and educational level (Kumar and Samadder, 2017) showed a strong and positive connection, showing a link between the socioeconomic profile and level of development with the generation of solid waste.
In this sense, it was possible to divide the data analysis approaches into two main scenarios: (a) estimate the monthly, annual, or seasonal generation of a municipality from a set of data (weeks, months and years) or (b) compare the annual production of waste group of countries united in the data set. The source of data for these surveys varied between government agency websites, statistical databases on sustainability indicators, other academic works that disseminated the analyzed material, SWM companies or through surveys strategically designed to compose a heterogeneous data set that clearly represents the area studied.
To develop a methodology and present results, different types of software and tools were used by the researchers. In this context, some authors did not expose the methodology that was used, so to account for these works, the label ‘Unidentified’ was adopted; for those who mentioned which software and tools were used, their presence counted in the sample. For works in which the solution used had not been identified, two situations were suggested: (a) the researchers decided to develop their own analysis tools; or (b) the researchers chose not to disclose the tool used. Among the proprietary tools, the two most referenced were MATLAB (29.7%) and ArcGIS (Geographic Information System) (13.5%) (Figure 3). Among the open-source tools, the use of Python (6.8%) stands out via the scikit-learn library and Weka (Waikato Environment for Knowledge Analysis) (5.4%) both through the graphical interface and through its Java API (Application Programming Interface for Java Language). All tools cited in the sample were represented graphically (74 investigated articles).

Representation of the use of tools in the 74 investigated articles.
As for the technique for predicting the generation of waste, the approaches adopted were: linear regression (LR), artificial neural networks (ANNs), genetic algorithms (GAs), Bayesian networks (BNs), decision trees (DTs), random forest (RF), fuzzy logic (FL), genetic programming (GP) and support vector machine (SVM). Figure 4 depicts techniques used to extract knowledge. The analysis showed that ANN (35.8%) was predominantly the most used technique for this context, followed by LR (16%) and SVM (12.3%).

Representation of the techniques used to extract knowledge.
Figure 5 illustrates in percentages the place of origin of data on SWM processes collected in each country. The countries that stood out the most as an object of study in the data set analyzed by the researchers to forecast waste generation (considering the 62 articles analyzed) were China (15.3%) and India (11.1%). It is also worth mentioning that in 8.3% of the publications, the country of origin of the data collection was not identified, also 9.7% of the studies analyzed and compared data from several other countries.

Representation of the countries in percentages in the analysis of survey data.
Data mining to support decision-making for the collection and transport of solid waste
Kannangara et al. (2018) consider that the ability to predict waste generation gives municipalities the benefit of scheduling and improving their waste management processes, which is why monitoring and recording the history of collected waste is essential to analyze various aspects that can contribute to tracing adequate strategies for the collection and transportation of solid waste. In this context, the DM algorithms assist in data processing and provide efficiency in the processes and decision-making related to questions such as, on which days of the week is more waste generated? What is the best geographical position to allocate containers? Or DM can even optimize the route for garbage trucks (Ahmad and Kim (2020)); assist in reducing the emission of harmful gases associated with the collection and transportation of waste, organizational costs, labour, and help solve issues related to waste management, such as recycling and reverse logistics (Sharma and Jain 2019).
Within the scope of this review, the approaches adopted by researchers to estimate and predict generation took place as follows: monthly generation of municipal solid waste (MSW) (Abbasi and El Hanandeh, 2016; Abbasi et al., 2019; Ahmmed et al., 2020; Ali and Ahmad, 2019; Araiza-Aguilar et al., 2020; Azarmi et al., 2018), annual MSW generation (Coskuner et al., 2020), seasonal generation of MSW (Ahmad and Kim (2020)), annual national generation (Ceylan, 2020), national generation of hazardous waste (Adamović et al., 2018), monthly generation of hospital waste (Çetinkaya et al., 2020;), annual generation of hospital waste (Ceylan et al., 2020), generation of construction waste (Kupusamy et al., 2019; Li et al., 2016; Ram and Kalidindi, 2017), annual generation (kg/inhabitant/year) of collected packaging waste separately (Oliveira et al., 2019), generation of waste by type, biodegradable and non-biodegradable (Kumar and Samadder, 2017) and generation of recyclable waste (Vu et al., 2019a, 2019b).
Moreover, DM algorithms present methods to identify changes and trends in the generation of household waste based on operational data, making it possible to determine the influence of data, such as the total population (Ceylan, 2020; Oliveira et al., 2019), annual income per capita (Ceylan, 2020; Dai et al., 2020), literacy rate (Kolekar et al., 2017; Pérez-López et al., 2016), age group (Kannangara et al., 2018; Kolekar et al., 2017) and monthly consumption (Dai et al., 2020) expenses in the temporal variability of MSW generation (Kolekar et al., 2017). They also make it possible to establish a relationship between the rate of plastic waste generation and socioeconomic groups (Wu et al., 2020). The main objective of these techniques is to explore data, synthesize, and process the relationship between variables in order to discover and clarify patterns for researchers (Nair et al., 2016).
Despite the potential of such techniques and tools, the results show that the use of KDD and DM are rarely applied in approaches to SWM. This fact can be explained since most authors, like Abdallah et al. (2020), cite challenges related to data collection due to the lack or inability to represent complex characteristics.
Main challenges and opportunities
The following are the main challenges and opportunities linked to the extraction of knowledge from solid waste databases. New approaches and technologies have been adopted as a growth strategy, both in the production sector and in the services sector, contributing to the evolution of environmental strategies in the development of processes and products. Given the new practices, reducing waste and gaining efficiency are criteria to meet sustainable projects, through non-generation, minimizing the generation or recycling of waste and generated emissions, with environmental, economic, social and occupational health benefits (IPEA, 2020).
In the complex urban environment, where the origin of the waste is diverse, the transformation of environments caused by the modernization of the infrastructure of cities results in a series of environmental setbacks in which it is important to consider the management of exorbitant quantities of MSW (Coskuner et al., 2020; Jassim and Coskuner, 2007; Li et al., 2017). Moreover, among local authorities and third parties, the large number of stakeholders involved in the solid waste production chain makes the management process extremely complicated and fraught with failures (Kannangara et al., 2018). Considering several sources of information and particular aspects, one of the most significant and arduous problems in the forecast of MSW is the definition of factors that influence the generation of waste and allow for establishing a standard that underlies decision-making (Chhay et al., 2018; Niu et al., 2020).
In addition to all the advantages related to the use of DM for knowledge extraction, it is also worth mentioning that the technique consists of a long and laborious process, in which collecting reliable data (quantity and quality) and treating them in the correct way for significant results can be a great challenge (Bagheri et al., 2019; Colvero et al., 2019; Fernández-Braña et al., 2021; Hoque and Rahman, 2020; Kumar et al., 2018; Niu et al., 2020). According to Kannangara et al. (2018), the scarcity of data sources is related to infrastructure and waste management practices. As an example, in the case of information on the types of collection, the type most identified among the case studies portrays the collection process by using containers, which shows a lack of work that portrays other realities, such as the ‘door-to-door’ collection commonly found in most municipalities (Boskovic et al., 2016).
Even in the selection phase, it is evident that very small data sets increase the complexity of the analyses and constitute models with low precision as a result of the high variability of the attributes (Ceylan, 2020; Hartnett et al., 2019). Faced with this great challenge, some authors have proposed, as a solution to the evident lack of data collection on the waste management process, the use of cargo handling records from waste transportation companies (Ram and Kalidindi, 2017). Understanding that the extraction of strategic information can be the differential to leverage the business and preserve the environment is fundamental for both SWM companies and government management. In this sense, it is essential to invest in training that demonstrates the added value of storing, structuring, and managing data in all stages of the waste management system so that in the future it is possible to extract the maximum knowledge from the available data.
It is crucial that this perception becomes common sense among the authorities and incorporates laws to guarantee and regulate violations that address both the economic, social, and environmental biases (Ahmad and Kim (2020); Kupusamy et al., 2019). Figure 6 shows the evolution of the sustainable pillars (social, environmental, and economic) based on the authors’ interpretation of the discussions presented in each investigated publication (62 articles). In 2015, few discussions were made regarding the impact of using DM on the sustainable pillars of waste management, and only the social and economic pillars were explored. On the other hand, in 2016 there was considerable growth in the number of publications that explored the facets of sustainable development, but whose focus was given only to the environmental and economic pillars. In 2018 and 2019, the concern with the environmental pillar gained prominence, highlighting the concern of organizations with the environment. Researchers’ concerns with the different pillars of sustainable development symbolize the search for paths that meet organizational needs while protecting the environment and ensuring dignity and equity for members of society. From this perspective, it is possible to see that in 2020 the three pillars were almost equally considered.

Evolution of sustainable pillars within the analyzed publications.
In the context of environmental impact, the authors tried to estimate: the illegal disposal of waste (Yang, Fan, et al., 2019), the emissions of polluting gases from the inadequate final disposal of waste (Dimishkovsk et al., 2019; Kumar et al., 2016; Vu et al., 2018), soil pollution by heavy metals (Perez-Alonso et al., 2017) and the calorific value of waste as a potential source of energy (Baghban and Shamshirband, 2019; Bagheri et al., 2019; Boumanchar et al., 2019; Drudi et al., 2019; Li et al., 2020; Rostami and Baghban, 2018). In terms of economic impact, studies sought to predict the generation of waste to promote cost efficiency in the provision of services and to identify the best form of service provision (Pérez-López et al., 2016). With regard to social impact, research has sought to predict the generation of waste to distribute the location of containers for collection (Cavallin et al., 2020), to identify the socio-economic factors that generate impacts of waste generation (Buenrostro-Delgado et al., 2015; Chhay et al., 2018) if the association of socio-economic and demographic factors impacts on the generation of waste (Colvero et al., 2019) and the generation by type of waste – biodegradable or non-biodegradable (Kumar and Samadder, 2017; Oliveira et al., 2019).
Conclusion
This research presents an SLR about the use of KDD and DM for SWM. The PICOC and PRISMA protocols were used to ensure the robustness of the research and to analyze quantitatively and qualitatively 62 documents exported from the WoS database related to the field of study.
Through the analysis, it was possible to discover that the most used data mining tools for SWM are MATLAB (29.7%) and GIS (13.5%), whereas the most used techniques are ANNs (35. 8%), LR (16.0%) and SVM) (12.3%). The origin of the data used by the studies analyzed was mainly from China (15.3%) and India (11.1%). However, 9.7% of the studies used data from more than one country for analysis, modelling and predictions associated with the management, collection, transport and destination of SWM.
Thus, it can be concluded that the KDD and DM techniques improve the SWM of municipalities and countries, as they can be used as strategies to collect and analyze large volumes of data associated with the stages of collection and transport management. Moreover, these analyses allow for mapping these steps through the identification of variables and analysis of scenarios to optimize processes and decision-making about the collection and transportation of waste. The incipience of the use of such techniques for SWM faces several challenges, mainly linked to the collection and treatment of data. This is because the massive processes of structuring and analysis impair or increase the time of assertive analyses for the intervention in the complex scenarios that can be proposed. On the other hand, the impact of such approaches on the environmental, social and economic pillars of sustainable development instigates research in the field of study and encourages the search for social and organizational development as well.
This research listed the main KDD and DM techniques and tools applied to urban waste management. Future work can explore the applicability and effectiveness of each mentioned technique, such as LR, ANNs, GAs and BNs, among others, to discover the potential of each one to assist in SWM. Despite the influence of the collection and transport of solid waste and the impact of these actions for sustainable development, few efforts were devoted to integrating the three pillars, since most of the work focused on an individual pillar, highlighting the need for discussions that integrate social development, organizational economics and environmental protection.
This research was limited to using only relevant documents available on the WoS, and other databases can be explored in future works, such as Scopus and Science Direct databases, to cover research not covered in this study. In addition, works published in other media, such as books and conferences, can also be explored. It is important to mention that there is a large amount of data collection and content related to the field of study available on websites and reports from private and governmental organizations, which have not yet been addressed and can be explored in future research. In addition, we explored the past 6 years of research, and future studies may explore a longer period to analyze the evolution of the field of study over time. Besides, the advantages and disadvantages of KDD and DM for SWM can be explored in depth in future research.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was financed in part by the Coordination for the Improvement of Higher Education Personnel (CAPES) – Finance Code 001, Brazil.
