EARLY DEFECT DETECTION USING CLUSTERING ALGORITHMS

Product quality is a crucial issue for manufacturing companies, so it is essential to take note of any emerging product defects. In contrast to the use of traditional methods, the “modern” constantly evolving data mining methods are now being more frequently used. The main objective of this paper is to detect the potential cause or the area of the production process where the majority of product defects arise. The dataset from the semiconductor manufacturing process has been used for this purpose. First, it was necessary to address dataset quality. Significant multicollinearity was found in the data and to detect and delete the collinear variables, correlations and variance inflation factors have been used. The MICE-CART method has been used for the imputation because the original dataset contained more than 5% of random missing values. In further analysis, the K-means clustering method has been used to separate the failed products from the flawless ones. Following this, the hierarchical clustering method has been used for the failed product to create groups of product defects with similar properties. For the optimal number of clusters, the determination of the BIC method has been used. Five clusters of products have been made although only three can be classed as important for further analysis. These groups of products should be directly subjected to the analysis in the production process, which can assist in identifying the source of scarcity.


Introduction
The last decades have seen life undergoing a turbulent and fast-changing environment. Nowadays, due to rapid technological changes, automation, and robotics, a new technological revolution is taking place. A new era of the phenomenon of Industry 4.0 and smart factories is in progress and companies now face many challenges, such as short product life cycles, volatile demand and high customisation (Gaub, 2016). High-value manufacturing processes are increasingly moving towards flexible, intelligent production systems. To compete in future markets, manufacturing companies should be able to produce small batch sizes of a product or even a single item in a timely and costeffective manner. They need to have sufficient functionality, scalability, and connectivity with customers and suppliers to meet these requirements (Schumacher et al., 2016). At the same time, to stay or strengthen the position of an organisation on the market, a modern business needs to follow the principles of quality control in its actions. In addition, to meet such challenges, systems will become more complex and difficult to monitor and control (Mabkhot et al., 2018). It is now common manufacturing practice to reduce and minimise the number of defects and errors in a process and to do things precisely at the first attempt. The ultimate aim is to reduce the number of defective products (Wang, 2013).
In this research, we focus on defects detected in manufacturing companies, specifically in companies in the metalworking industry. This paper proposes a data mining-based knowledge discovery approach using a sequence of two different types of clustering methods for detecting the major groups of products with a similar cause.

Literature Review
Quality is a term that is complex and difficult to specify. The word quality has many meanings, such as a degree of excellence, conformance with requirements, the totality of the characteristics of an entity that impact its ability to satisfy stated or implied needs, fitness for use, freedom from defects, imperfections or contamination and delighting customers (Hoyle, 1994). Various authors explain this notion differently. One of the "gurus" in quality control, William Edwards Deming (1982), defines quality as a predictable degree of uniformity and dependability at low cost and suited to the market. According to the American Society for Quality and Goetsch and Davis (2010), quality denotes excellence in goods and services, especially to the degree that they conform to the requirements and satisfy customers. The definition of quality stated by the International Organisation for Standardization (ISO) is: "The totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs" (AS/NZS ISO, 1994, p. 7). Put more simply, one can say that a product has good quality when it complies with the requirements specified by the client (Knowles, 2011). In this research, we view quality as the compliance of product properties and dimensions with pre-specified company standards. Product quality is a crucial issue for manufacturing companies. It is essential for customer satisfaction, and so is directly connected with the company's revenues and market share. Quality is also closely connected to company performance. Sadikoglu and Zehir (2010) stated that Quality Management (QM) is a systematic, proven approach to improvements in organisational performance. Numerous empirical studies have attempted to investigate the relationship between QM practices and company performance (Mehran and Mehran, 2013).
There are many traditional approaches to quality management although nowadays data mining methods have become more useful and successful in manufacturing companies. Traditional methods, such as Total Quality Management (TQM) focuses on quality for customer satisfaction and concurrently sustains a company's competitive advantage in today's challenging and dynamic business environment (Yin et al., 2018). Other methodology such as Lean Six Sigma combines the Six Sigma techniques, which enable companies to reduce manufacturing defects with the lean manufacturing principles to help companies benefit from faster processing for lower costs with superior quality (Dragulanescu and Popescu, 2012). However, even traditional "Six-Sigma" approaches cannot eliminate all the defects in manufacturing, only a very small share, given their limitation in dealing with complex and dynamic datasets. The Zero Defects concept developed by Philip Crosby (1979) means flawless production. The concept consists of preventing the occurrence of defects and flaws in all production stages. Quality management tools must be used to achieve this (Wang, 2013). All these methods are powerful tools for quality improvement but now at the time of Industry 4.0 and smart factories, which already have highly elaborate quality control process, these methods are no longer appropriate to use. In smart factories, mass data collected from various sensors and critical manufacturing-related knowledge can be hidden in the data. An example of such knowledge can include rules or regulations for identifying defects to the quality of the products. Human operators may never find the rules through manual investigation. This means they may never discover such hidden knowledge from the data. Traditional data analysis methods are no longer the best alternative to be used (Wang et al., 2006).
Quality improvement (QI) of industrial products and processes requires collection and analyses of data to solve quality related manufacturing problems. Traditional statistical process control approaches are less effective than data mining, especially when dealing with multivariate and autocorrelated processes (Evans, 2015). With the continual increase in process complexity, this inefficiency is becoming more apparent. A special multivariate and autocorrelated process is a process occurring within a heterogeneous production environment (a variety of types of machines, pots, etc. used for the same task). This makes the quality control of such processes more difficult (Horvath and Vircikova, 2012). Although traditional data analysis tools have been successfully used to improve the quality of products and processes, better tools now exist to mine massive data sets collected through computerised systems in the industry (Köksal et al., 2011). Data mining tools can be highly beneficial for discovering interesting and useful patterns, even in complicated manufacturing processes. However, data accumulated in manufacturing plants has unique characteristics, such as an unbalanced distribution of the target attribute, and a small training set relative to the number of input features. Thus, conventional methods are inaccurate in quality improvement cases (Choudhary et al., 2009). Data mining tools are useful in many areas of manufacturing such as defect analysis, yield improvement, quality monitoring, and process control, etc. (Rokach et al., 2008). Data mining tools can be used to extract knowledge from process data sets. The knowledge acquired can be used to minimise the number of defective products and to achieve the desired level of process performance and product quality (Ramana and Reddy, 2012).

Data Mining Application for Defect Detection
There are many data mining methods that are useful for application in manufacturing. Rough set theory or clustering analyses are frequently used to solve defect detection problems in manufacturing neural networks, association rules, and types of regression. For example, the journal paper, proposed by Bhuvaneswari and Sabarathinam (2013), examines the detection of defects in manufactured ceramic tiles to ensure high-density quality. The problem is concerned with the automatic inspection of ceramic tiles using an Artificial Neural Network (ANN). A detailed comparison between traditional statistical methods, the RST approach, and the extended RST approach is presented by Tseng, Jothishankar and Tong (2004). The developed algorithm was applied to an industrial case study involving quality control of printed circuit boards (PCB), especially solder ball defects. The paper written by Sabet et al. (2017) presents a method for identifying unknown patterns between the manufacturing process parameters and the defects of the output products. The proposed method of fuzzy association rules also identifies the relationships between the defects.
Many articles on the implementation of clustering methods for defect detection in manufacturing have already been published, such as Defect Segmentation of Semiconductor Wafer Image Using k-Means Clustering by Saad et al. (2015). The K-means clustering partitional method used to identify and classify bearing defects was examined by Yiakopoulos et al. (2011). Another example of clustering use is the condition monitoring architecture of dynamical systems with unknown gradual faults using a dynamical clustering algorithm, which allows a continuous update of the operating modes of the system, was proposed by Chammas et al. (2014). The use of a combination of clustering algorithms with another method also became more frequent. The new method using the K-means clustering algorithm in combination with a self-organising map was used by Saludes-Rodil (2015) for the classification of surface defects in wire rod production. Yusof et al. (2018) in his research applied the principle component analysis (PCA) as a pre-processing method for hierarchical clustering analysis on the frequency spectrum of the vibration signal.
According to the literature review, we conclude that application combinations of cluster analysis for defect detection problem solving are not as usual as the other data mining methods mentioned above or clustering analysis itself. The aim of this paper is to propose the combination of two different types of a clustering algorithm for the detection of poor-quality products and the creation of groups of products with a similar cause of errors. To achieve the aim, we have formulated the following research question: Can we identify groups of poor-quality products with the same cause of error by using a sequence of two different clustering algorithms?

Dataset
The dataset used in this paper is from a complex modern semiconductor manufacturing SECOM process (McCann and Johnston, 2008). These are records of the monitoring of signals/variables collected from sensors and process measurement points. However, not all these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information plus irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then the feature selection can be applied to identify the most relevant signals. The process engineers can then use these signals to determine key factors contributing to yield excursions downstream in the process. The dataset presented in this case is a selection of those features where each example represents a single production entity with associated measured features.
There are 1567 examples taken from a wafer fabrication production line. There are both failed and passed products in the quality control system. For product quality, 590 measuring sensors and process measurement points (variables) were used. In other words, each example is a vector of 590 sensor measurements. This results in a dataset of 924530 values measured during the production process. For such a large volume of measurement data, automatic fault detection technique is essential. The large amount of metrology data obtained from hundreds of sensors make this dataset difficult to accurately analyse. Thus, our main focus is to devise a method based on data mining techniques to build an accurate model for fault detection. There are also 5% of missing values and collinear variables in the data set, which is necessary to be resolved before the clustering.
Various papers using this dataset have been already published, such as Feature Selection and Boosting Techniques to Improve Fault Detection Accuracy in the Semiconductor Manufacturing Process written by Kerdprasop and Kerdprasop (2011). In this paper, the authors investigate the application of data mining techniques such as decision tree induction, naïve Bayes analysis, logistic regression, and k-nearest neighbour classification to create an accurate model for fault case detection. Further research using the same data set is Quality prediction modelling for multistage manufacturing based on classification and association rule mining written by Kao et al. (2017) in which the authors introduce a framework for quality prediction modelling in a multistage manufacturing system (MMS) environment.
For cluster analysis implementation, R studio software has been used. First, we will make the data cleaning, deleting irrelevant variables and imputing the missing values. Then, using the K-means clustering method, we will split the monitored products into two groups: one group of failed products and one group of flawless products. The cluster analysis will then be applied only to the dataset of failed products. We will apply the hierarchical clustering method and will change the settings in the method.

Methodological Approach
First, it is necessary to prepare the data set for the following analysis. For this purpose, the method of data imputation will be chosen. Then we will apply different types of clustering methods on the data and make a comparison. Several variants of algorithm settings will be used.

Preparing and Cleaning Data
The simplest solution for the missing values imputation problem is the reduction of the data set and the elimination of all missing values. This can be done by eliminating the samples (rows) with missing values (Kantardzic, 2003) or eliminating the attributes (columns) with missing values. Both approaches can be combined. Elimination of all samples is also known as complete case analysis (Kaiser, 2014). In this case, we will reduce the attributes because there are many constant attributes and collinear variables. We will use the Variance Influence Factor (VIF) method for multicollinearity reduction in the dataset (Paul, 2006). We can compute the VIF with the formula, where the symbol R i 2 means the coefficient of determination and analyse the magnitude of multicollinearity by considering the size of VIF i . A rule of thumb is that if VIF i > 5 then multicollinearity is high (Kuther et al., 2005). Variables with a high VIF will be deleted.
After the reduction of the data set, we can proceed to the data imputation. There are two basic types of imputation: single and multiple. Single refers to a single estimate of the missing value and is popular because it is conceptually simple and because the resulting sample has the same number of observations as the full data set (D'ambrosio et al., 2012). Some imputation methods result in biased parameter estimates, such as means, correlations, and regression coefficients, unless the data is MCAR (Missing completely at random). The bias is often worse than with listwise deletion, the default in most software. An advantage of multiple imputations over single imputation and complete case methods is that multiple imputations are flexible and can be used in a wide variety of scenarios. Multiple imputations can be used in cases where the data is missing completely at random and even when the data is missing not at random. However, the primary method of multiple imputations is multiple imputations by chained equations (MICE). It is also known as "fully conditional specification" and, "sequential regression multiple imputations" (Wulff and Ejlskov, 2017).
MICE is a popular adaption of missing imputation and is available to the user through the most commonly used software packages. MICE changes the imputation problem to a series of estimations where each variable takes its turn in being regressed on the other variables (Kaiser, 2014). MICE loops through the variables predicting each variable dependent on the others. This procedure provides excellent flexibility as each variable can be assigned a suitable distribution, e.g., poisson, linear or binomial (Wulff and Ejlskov, 2017).
Another notable local approach is the MICE-CART, which consists of multiple imputations by chained equations (MICE) and classification and regression trees (CART). It is a nonparametric approach made to perform multiple imputations through chained equations using sequential regression trees as the conditional models (Moorthy et al., 2014). In CART methodology, the best split is found over all possible splits generated by all predictors, which minimises the impurity of the response variable within the two sub-nodes where the impurity is a measure of deviance or variation for a numerical response (in regression trees) and a measure of heterogeneity or entropy for a categorical response (in classification trees) (Edwards and Finch, 2018).

Clustering
Clustering is an essential data mining tool for the analysis of Big Data and aims to consolidate the significant class data objects (clusters) so that objects grouped in the same cluster are similar and consistent according to specific parameters (Zerhani et al., 2015). The task is to arrange a set of objects so that the objects in the identical group are more related to each other than to those in other groups (clusters). Clustering belongs to unsupervised learning. Clustering algorithms can be classified into partition-based algorithms, hierarchical-based algorithms, density-based algorithms and grid-based algorithms (Chitra and Maheswar, 2017).

Hierarchical clustering
Hierarchical clustering is a recursive partitioning of a dataset into successively smaller clusters. The input is a weighted graph where the edge weights represent pairwise similarities or dissimilarities between data points (Tan et al., 2018). Hierarchical clustering is represented by a rooted tree where each leaf represents a data point and each internal node represents a cluster containing its descendant leaves. Computing a hierarchical clustering is a fundamental problem in data analysis; it is routinely used to analyse, classify, and pre-process large datasets (Cohen-Addad et al., 2018). There is extensive literature available on hierarchical clustering and its applications although it is impossible to discuss most of it in this paper. For some applications, the reader may refer to, e.g., (Hubert, 1977;Felsenstein, 2003;Castro et al., 2004).

Source: Authors' own processing
The key operation of this algorithm is the computation of the proximity between two clusters, and it is the definition of cluster proximity that differentiates the various agglomerative hierarchical techniques that we will discuss. Cluster proximity is typically defined with a particular type of cluster in mind. Many agglomerative hierarchical clustering techniques come from a graph-based view of clusters (Rani and Rohil, 2013).

Determining the number of clusters
For determining the number of clusters, we will use the McClust method where the number of mixing components and the covariance parameterisation are selected using the Bayesian Information Criterion (BIC). In one dimension, there are just two models: E for equal variance and V for varying variance. In the multivariate setting, the volume, shape, and orientation of the covariances can be constrained to be equal or variable across groups. Thus, fourteen possible models can be specified (Scrucca et al., 2016).

Ward's method
We can also take a prototype-based view, in which each cluster is represented by a centroid. The centroid method uses the centroid (centre of the group of cases) to determine the average distance between clusters of cases. An alternative technique to the usual centroid method is Ward's method. This method assumes that a cluster is represented by its centroid, but it measures the proximity between two clusters in terms of the increase in the SSE (squared error) that results from merging the two clusters. Similar to K-means, Ward's method attempts to minimise the sum of the squared distances of points from their cluster centroids (Tan et al., 2018).

Partitional clustering
Partitional clustering is the most popular class of clustering algorithm and is also known as an iterative relocation algorithm. These algorithms minimise a given clustering criterion by iteratively relocating data points between clusters until an optimal partition is attained (Chitra and Maheswar, 2017). A partitioning clustering algorithm splits the data points into k division, where each division represents a cluster and , where n is the number of data points. Partitioning methods are based on the idea that a cluster can be represented by a centre point. The partition is based on a certain objective function. The clusters are formed to optimise an objective partitioning criterion, such as a dissimilarity function based on distance, so that the objects within a cluster are "similar", whereas the objects in different clusters are "dissimilar". Partitioning clustering methods are useful for applications where a fixed number of clusters are required. K-means, PAM (Partition around mediods) and CLARA are some of the partitioning clustering algorithms (Popat et al., 2014).

K-means clustering
K-means is one of the most popular partition-based methods and partitions the dataset into k disjoint subsets, where k is predetermined. The algorithm keeps adjusting the assignment of the objects to the closest current cluster mean until no new assignments of objects to clusters can be made (Elavarasi et al., 2011). One advantage of this algorithm is its simplicity. It also has several drawbacks. It is very difficult to specify the number of clusters in advance. Since it works with squared distances, it is also sensitive to outliers. Another drawback is that the centroids are not meaningful in most problems (Popat et al., 2014). In this algorithm, a cluster is represented by its centroid, which is a mean (average) of the points within a cluster. This only works efficiently with numerical attributes and can be negatively affected by a single outlier. The k-means algorithm is the most popular clustering tool that is used in scientific and industrial applications. The technique aims to partition n observations into k clusters in which every observation belongs to the cluster with the nearby mean (Chitra and Maheswar, 2017). The K-means algorithm has several significant properties, such as high effectivity in dealing with huge data sets, and it only works with numeric values; the resulting clusters have convex shapes and this method frequently terminates at a local optimum, and not the global optimum, which is also one of the major disadvantages of this method. Another fact that can be considered as a disadvantage, namely that this algorithm can be used only when the mean of the data set is defined and requires specifying k, the number of clusters, in advance (Vijayalakshmi and Devi, 2012).
For the K-means method, there are several specific types of functions for measuring the distance between clusters. The most common is a Euclidean distance, which computes the root of the square differences between the coordinates of a pair of objects, as follows: The Manhattan distance, or city block distance, represents the distance between points in a city road grid. It computes the absolute differences between the coordinates of a pair of objects (Grabusts, 2011). There are also other methods for distance measuring, such as the Minkowski, Cosine and Chebyshev functions (Bora and Gupta, 2014).
There are many articles concerning the K-means method application topic, such as Constrained K-means Clustering with Background Knowledge (Wagstaff et al., 2001), Improving the Accuracy and Efficiency of the K-means Clustering Algorithm (Nazeer and Sebastian, 2009) or An Algorithm for Online K-Means Clustering (Liberty et al., 2016) and many others. There is also an interesting option of Merging K-means with hierarchical clustering for identifying general shaped groups proposed by Peterson et al. (2018) although this is not our aim at this time.

Data Preparation
Almost 5% of the missing data points can be found in the dataset because some sensors did not work properly. First, it is necessary to choose a method and make an imputation of missing values to the dataset. During data imputation processing, multicollinearity in the dataset was found. For localisation and deleting the collinear variable, we used the VIF method. Fifty-nine variables were perfectly correlated, so they had to be deleted because they give the same information as the other variables present in the data file. In the table below (Table 1) are the basic statistics of the counted VIF for each variable. There is the lowest and highest value of VIF, median, average and quartiles. All variables where VIF is greater than five can be explained by other variables, which means that they can be deleted. After this data cleaning process, we obtained the remaining 55 variables, which can be reasonably included in the model.
After deletion of collinear and constant variables, we can proceed to the missing values imputation. In our dataset is the random missing data, as you can be seen in Figure 4.  For the missing data imputation, we have chosen the MICE-CART function, which is more accurate than the simple imputation of the mean, median or constant value. MICE-CART improves upon the standard MICE approach by automatically accounting for interaction effects among the variables for which imputation is needed (Moorthy et al., 2014). Now, there is the full dataset without collinear and constant variables, so the clustering analysis can begin.

Clustering Analysis
First, the clustering method is applied to the full dataset to recognise those products which passed quality control and the ones that failed. In this case, we want to have two clusters because we need to separate the products that passed from those that failed. The number of clusters intended is predetermined, so we will use the K-means method. After we determine the group of failed products, the hierarchical clustering method will be used for further analysis of the location of the origin of the defects.

K-Means Clustering for All Products
Applying the K-means clustering method produces two significant clusters, as can be seen in Figure 5. The two components on the axes in the plot are the result of applying the principal component analysis to the data. These are linear combinations of the input variables, which account for most of the variability of the observations. We assume that the smaller (black) one represents the group of failed products evaluated on the quality control station. The second (red) one represents the group of products that are correct. According to this model, there are 1486 flawless products and 81 defective products. For the subsequent analysis, only the set of defective products will be used to identify groups of defect products with similar properties. Source: Authors` own processing (RStudio)

Hierarchical Clustering for Failed Products
The set of failed products was determined from the previous analysis; now, we will identify the groups of products with similar parameters by analysing only the failed products in order to recognise which products have similar defects. Through consecutive analysis directly in the production process, the results can be used for easier detection of the point in the process where the defects or the potential causes of the defects occur. This time, we will use the hierarchical clustering analysis because we first need to determine the number of clusters that will be created.
In the following graph, we determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximisation, initialised by hierarchical clustering for parameterised Gaussian mixture models. The plot showing the BIC traces (see Figure 6) for all the models is considered. We adjusted the range of the y-axis to remove those models with lower BIC values. There is a clear indication of the best option that is rendered by the EEI curve, according to the shape of which, we determine that the optimal number of clusters is five. The hierarchical clustering method is now applied to five clusters using the Ward method and Euclidean distance measuring. The following dendrogram (Figure 7) shows the solution of this analysis where five clusters have been created. The products were grouped according to their parameters, the values which have been measured in the different stage of the production process. The smallest two clusters can only be inaccuracies in measuring or due to random employee mistakes. These defects will be difficult to analyse and will result in small costs for the company, so it is unnecessary to search now for their cause.
The other 3 clusters are of more interest to us. The defects to the products in these clusters are probably caused by the same event in the production process. Such an event may be, for example, bad settings on the machine, human failure or defects in the material used. These errors in the production process can cause huge additional costs for the organisation or loss of profit or market position. To find the exact cause of these defects, it is necessary to analyse the production process and map the material and resources flow.
Three considerable clusters of defective products with similar features or parameters appeared. At this point, it would be necessary to conduct an analysis of the production process but as the dataset was created by someone else, it makes it impossible. For this reason, we can only estimate the cause of the defects that arise.
From the resulting graph (Figure 7) it appears that there could be a connection between the products in the clusters due to their serial numbers. It is possible that a specific event in the production process can occur, such as machine failure, which could cause a few consecutive defect products. To determine particular causes, we would need more information about the production process, a record of machine failures, material quality review etc.

Conclusion
In the presented paper, we analysed the data concerning the scrap in semiconductor manufacturing in order to identify the main types and causes of defects in manufactured products. For this task, we used VIF and MICE-CART methods for data pre-processing (deletion of factors causing multicollinearity, imputation) and cluster analysis (K-means, hierarchical clustering).
Using the above-mentioned approaches, we detected 81 defective products from the total of 1567 products examined. The defective products have been divided into five clusters according to their similar properties. From the results of the hierarchical clustering analysis, it is obvious that there are three substantial sources of defects in the production process. We assume that the products in these groups have the same or similar cause of error. For closer investigation of the cause, it will be necessary to analyse the production process itself. We would need more information about the production process such as mapping of material and resources flows, records of machine failures or a material quality review. The other two clusters are insignificant because they are too small. These defects can be caused by a random event or human resource failure; the search for their cause would probably be more demanding than the potential cost savings made when implementing the corrective action. We have also proved that a combination of two different clustering algorithms in a sequence is possibly an effective and successful method of identifying and classifying the defects in the manufacturing process.
The limitations in this research are specifically the nature and the source of the data. In this case, the dataset came from the public source, so the supplementary information about the production environment was very limited. Thus, the quality of the dataset can also be considered as a limitation. We had to perform data imputation and despite the nontrivial method of data, the results of the analysis could have been influenced by this. For the further improvement of accuracy of the research, it is theoretically possible to contact the authors of the dataset and obtain more details about the dataset, the manufacturing company involved and its processes.