Human-in-the-loop: Towards label embeddings for assessing classification difficulty

Abstract

Uncertainty in machine learning models is a timely and vast field of research. In supervised learning, uncertainty can already occur in the first stage of the training process, the annotation phase. This scenario is particularly evident when some instances cannot be definitively classified. In other words, there is inevitable ambiguity in the annotation step and hence, not necessarily a single ‘ground truth’ associated with each instance. This work approaches the problem from a statistical modelling perspective. The main idea is to drop the assumption of a ground truth label and instead embed the annotations into a multidimensional space. This embedding is derived from the empirical distribution of annotations within a Bayesian setup, modelled using a Dirichlet-Multinomial framework. We estimate the model parameters and posteriors using a stochastic Expectation Maximizsation algorithm with Markov Chain Monte Carlo (MCMC) steps. The methods developed in this article readily extend to various situations in which multiple annotators independently label instances. To showcase the generality of the proposed approach, we apply our approach to three benchmark datasets for image classification and natural language inference (NLI), in which multiple annotations per instance are available. Besides the embeddings, we can investigate the resulting correlation matrices, which reflect the semantic similarities of the original classes for all three exemplary datasets.

Keywords

Annotation uncertainty classification and clustering Dirichlet-Multinomial Model label variation multiple labels

1 Introduction

Machine Learning models are increasingly used for a growing number of applications, one of which is supervised classification, for example in the form of images or texts. While such models have achieved impressive standards in recent years in terms of accuracy, the assessment of uncertainty remains an active field of open problems and research challenges. Recent survey articles discussing the field include Gawlikowski et al. (2023) or Hüllermeier and Waegeman (2021). Uncertainty thereby has numerous and intertwined sources as discussed in Gruber et al. (2023) or Baan et al. (2023) and is heavily impacted at multiple stages of the common machine learning pipeline. Gruber et al. (2023) also explicitly emphasize the role of the data itself for appropriately assessing uncertainty entirely.

In the field of deep learning, multiple major streams of research related to the quantification of uncertainty exist. Besides ensemble methods and Bayesian approaches, evidential neural networks have been gained attention as a deterministic method for uncertainty quantification (Sensoy et al., 2018). Specifically, these methods conceptualize learning as the acquisition of evidence, where each new training example adds support to a learned evidential distribution. A recent survey by Ulmer et al. (2023) provides an extensive overview of evidential deep learning and discusses its strengths and weaknesses in depth. However, some lines of work also advise caution when employing evidential networks for uncertainty quantification. Jürgens et al. (2024) state that generally, epistemic uncertainty is not reliably represented by those methods and Meinert et al. (2023) showcase the issue of overparameterization for evidential regression.

However, while some parts of the overall uncertainty are already heavily studied in research, less attention has been paid to one of the major prerequisites for training classification models, namely the (un)availability of reliable ground truth labels for the training data and their uncertainty. In fact, uncertainty already starts in the labelling process for supervised machine learning, where human annotators label images or texts. Any supervised model will rely on these ‘ground truth’ labels, inherently incorporating their associated uncertainty. We refer to this type of uncertainty as ‘label uncertainty’. Commonly, such gold labels are acquired with human labelling effort, leading to multiple annotations per instance. Depending on the complexity of the problem at hand, it might suffice to aggregate the annotations into a single ground truth label, for example, by majority voting. However, in many realistic application areas, such as the classification of complex images or the assessment of language and speech, this assumption does not hold true.

Of course, humans are naturally prone to errors, leading to unreliable annotations or mistakes and therewith, label noise or label errors. The problem was already tackled and discussed early in the statistical literature; see for example, Dawid and Skene (1979). In recent years, more and more methods have been developed for handling data despite human errors, for example in the context of neural networks (Dgani et al., 2018). Peterson et al. (2019) argue that incorporating human ambiguity can improve classification models in terms of robustness. However, training supervised machine learning models based on noisy or deficient labels can lead to poor performance and highuncertainties; see, for example, Frénay et al. (2014) or more recently Frénay and Verleysen (2014) for an overview. Also, the labels might introduce some bias, as shown by Jiang and Nachum (2020), that needs to be identified and corrected if possible. Different algorithms have been introduced to tackle the problem of noisy labels, see Algan and Ulusoy (2021) for an extensive survey on various methods.

However, ambiguity in annotations cannot always be attributed to the fallibility of human annotators. Instead, label variation is also likely to arise if the assumption of a singular ground truth label for each instance is questionable. In the context of language, Plank (2022) discusses the sources of label variation. Particularly, the authors argue that the absence of a singular ground truth is often reasonable and should not be considered erroneous by default. In this line, the survey by Uma et al. (2021) discusses the disagreement of annotators. The authors conclude that suitable evaluation methods are required if a single gold label cannot be assigned. Various works in multiple domains show that disregarding label variation and leaving it untreated can indeed lead to quality issues and uncertainty. The common approach to simply summarize the annotations into a single label does not only discard valuable information, it is also an inappropriate representation of the truth and introduces remarkable amounts of uncertainty, in particular in the ‘gold’ label (Uma et al., 2021 or Aroyo and Welty, 2013; Davani et al., 2022).

This problem is prevalent across various classification domains, specifically for application areas characterized by inherent ambiguity. To showcase this, let us first consider the domain of natural language processing (NLP) or more specifically NLI, where ambiguity is ubiquitous due to the subjective interpretation of language and speech. This issue has been already extensively discussed, see, for example, Plank (2022). NLI corresponds to the task of discerning the logical relationship between two sentences, typically whether one entails the other, contradicts it or is unrelated to it. Naturally, the perception of language differs for the human annotators causing high rates of disagreement (Pavlick and Kwiatkowski, 2019; Nie et al., 2020). Table 1 shows example cases of such ambiguities. Each sentence (left column, called ‘context’) is accompanied by a second sentence (middle column, called ‘statement’). A group of 100 human labellers are asked to classify the sentence pair into either C = contradiction, N = neutral or E = entailment. Contradiction means, that the ‘statement’ contradicts the ‘context’, entailment means that the two sentences mean the same just with different wording and neutral means that the contents of the two sentences are unrelated. The ambiguity in the labelling process is clearly reflected. Gruber et al. (2024) provide a statistical approach for modelling the data-generating process in order to gain a better understanding of the label uncertainty. However, their modelling approach assumes a latent ground truth label associated with each sentence pair. While this is only a modelling assumption, numerous works claim that the assumption of a single ground truth is not appropriate for NLI tasks and instead, a more realistic representation of the labels should be used, see, for example, Aroyo and Welty (2015), Uma et al. (2021) or Plank (2022).

Table 1

The table shows 4 examples of sentence pairs from ChaosSNLI, along with the annotations, see Gruber et al. (2024). Each pair of context and statement is classified by 100 human annotators with the categories ‘contradiction’ (C), ‘neutral’ (N) and ‘entailment’ (E).

Context/Premise	Statement/Hypothesis	Human Votes [C, N, E]
A man running a marathon talks to his friend.	There is a man running.	[0, 0, 100]
A black and white dog running through shallow water.	Two dogs running through water.	[42, 14, 44]
A woman holding a child in a purple shirt.	The woman is asleep at home.	[46, 53, 1]
An elderly woman crafts a design on a loom.	The woman is sewing.	[34, 31, 35]

Similar problems arise in the domain of image classification if either the categories or the images themselves are ambiguous. Depending on the nature of the problem, assigning a singular ground truth label is often simply impossible. We consider two examples here. First, we utilize a benchmark dataset on image classification; secondly, we extend our previous work on remote sensing image classification (Hechinger et al., 2024). For the interest of space, details on the latter example are found in the supplementary material. Focussing on image classification, there are situations without a distinct ground truth, as shown in Figure 1. The exemplary images are part of the benchmark dataset Cifar-10H, introduced by Peterson et al. (2019). The dataset is designed such that each instance is assigned to a single unambiguous class, at least in theory. Still, some images defy easy classification due to ambiguities caused by the size or quality of the picture, leading to high disagreement rates within the annotations. Consequently, relying solely on majority voting to assign a singular label in such cases does not accurately reflect the underlying truth. A similar problem occurs in the classification of satellite images, which is discussed in detail in the supplementary material. Finally, we want to emphasize that the three datasets are constructed in that the observations are not random samples from some ‘super-population’. This raises the question of whether any trained model can be deployed to novel data. We do not discuss this question in this article, since our focus is on modelling uncertainty and not on the deployment to new data, if the training data are biased.

Figure 1

The figure shows exemplary images from Cifar-10H along with their original labels, where a high disagreement rate between the annotators could be observed hinting at the ambiguity of the images.

In this work, we propose to move away from the premise of a sole ground truth. The three examples demonstrated in this world underpin that the assumption of a ground truth is not always appropriate. Therefore, we explicitly allow for ambiguity for each instance. In this article, we aim to statistically model such situations in a distributional framework. Namely, we employ a Dirichlet-Multinomial Model, as discussed in Minka (2000) or Mosimann (1962). Variants of this model class have, for example, been used for clustering of text documents (Yin and Wang, 2014) and genomics data (Holmes et al., 2012 or Harrison et al., 2020). Avetisyan and Fox (2012) deploy a Dirichlet-Multinomial Mixture Model to estimate survey response rates. Eswaran et al. (2017) also connected this model class to uncertainty quantification and modeled beliefs as Dirichlet distributions to capture uncertainty. In this work, we propose a Dirichlet-Multinomial Model to estimate embedded ground truth values that express the classification difficulty and uncertainty for the respective images based on human annotations. Specifically, we construct an embedding space so that each image is located in a K dimensional space, with K as the number of categories. To do so, we pursue an empirical Bayes approach in combination with MCMC sampling and a stochastic version of the Expectation Maximization (EM) algorithm for estimation, as proposed by Celeux et al. (1996). The presented strategy provides insights into the correlation (or confusion) patterns between different classes and simultaneously allows expression and quantification of uncertainty. The use of the embedding idea combined with machine learning has been recently published in Schweden et al. (2025) in the framework of remote sensing.

The article is structured as follows. Section 2 describes the distributional framework and the algorithm used for estimation. The results on two different datasets are reported in Section 3, the third example (remote sensing) is found in the supplementary material. We consider some possible further steps and applications in Section 4. Section 5 concludes the article with a detailed discussion. The code and the data are available via github; see the Appendix for details.

2 Model

2.1 Notations

Each image i, with $i = 1, \dots, n$ is assessed by a set of annotators (labellers, voters) indexed with j, where $j = 1, \dots, J_{i}$ . We consider the images as independent and the same holds for the annotators. The labellers classify each image individually into the class k, where $k = 1, \dots, K$ . The corresponding vote of the expert is denoted by $V_{i j} \in \{1, \dots, K\}$ . It is notationally helpful to rewrite this vote into the K dimensional indicator vector, which we denote in bold with $V_{i j} = (1 \{V_{i j} = 1\}, \dots, 1 \{V_{i j} = K\})$ , with $1 \{\cdot\}$ as indicator function. This allows to accumulate the annotators’ votes into $Y_{i} = (Y_{i 1}, \dots, Y_{i K})$ with $Y_{i k} = \sum_{i = 1}^{J_{i}} 1 (V_{i j} = k)$ . This vector can be considered as the vote distribution for image i. To keep the notation simple we will drop index i from the number of voters per image and write J subsequently. We emphasize though, that images can be labelled by different numbers of voters, as our examples demonstrate.

2.2 Binary case: K = 2

For a more straightforward presentation and interpretation of our modelling strategy, we start with the binary case K = 2. We assume a binary label representation, which is embedded into the two-dimensional space. That is, each instance (image or text) is allocated as following:

Z_{i} = (Z_{i 1}, Z_{i 2}) \in ℝ^{2} .

The vector $Z_{i}$ can be interpreted as embedding or embedded ground truth values, meaning that we represent the labelled instance i as a point in a two-dimensional space. To simplify the notation we will drop the index i in the following.

The embedding steers ambiguity as well as the uncertainty of the labelling process. This is achieved by relating Z to the coefficients of a Beta distribution. To be specific we define $α_{Z} = exp (Z_{1})$ and $β_{Z} = exp (Z_{2})$ as parameters of a Beta distribution, from which we draw the binomial parameter π as $π \sim Beta (α_{Z}, β_{Z})$ . Given π, we obtain the image labels by drawing from the binomial distribution $Y ∣ π \sim B (J, π)$ , where J is the number of votes or annotations of the respective image. Note that J can vary for different instances, which for simplicity of notation we ignore here. If π is close to 0 or 1, the image has no or little ambiguity, that is, it is easy to label. Note that π remains unobserved, so that given Z, we have

\begin{matrix} P (Y = y ∣ Z) \propto \int_{0}^{1} (\begin{matrix} J \\ y \end{matrix}) π^{y} (1 - π)^{(J - y)}) π^{α_{Z} - 1} (1 - π)^{β_{Z} - 1} d π) \\ \propto (\begin{matrix} J \\ y \end{matrix}) \frac{B (α_{Z} + y, β_{Z} + J - y)}{B (α_{Z}, β_{Z})}, \end{matrix}

(2.1)

where B(.) denotes the univariate Beta function.

Within this model setup, we can derive a couple of interpretations. Interpreting $Z \in ℝ^{2}$ as ground truth, we obtain the Beta-Binomial model (2.1). This in turn allows us to derive the mean value of π through

E (π | Z) = \frac{exp (Z_{1})}{exp (Z_{1}) + exp (Z_{2})} .

Additionally, we can also quantify uncertainty by calculating the variance, which results as following:

Var (π ∣ Z) = \frac{exp (Z_{1}) exp (Z_{2})}{{(exp (Z_{1}) + exp (Z_{2}))}^{2} (exp (Z_{1}) + exp (Z_{2}) + 1)} .

Figure 2 The figure shows the mean (left) and log-variance (right) of the Beta-Binomial distribution for different values of Z , expressed through colour.

Figure 2

The figure shows the mean (left) and log-variance (right) of the Beta-Binomial distribution for different values of Z, expressed through colour.

For different values of Z, we plot the mean and the (log)-variance of the Beta-Binomial distribution in Figure 2. The variance expresses the uncertainty, which is how likely an image or text is misclassified given the data at hand. The larger $Z_{1}$ , the more likely the instance is classified as ‘one'. On the contrary, the larger $Z_{2}$ , the more likely the image or text is classified as ‘zero'. Moreover, the smaller the values of $Z_{1}$ and $Z_{2}$ and the smaller the difference between them, the larger the variance. Hence, the concrete location of $Z_{1}$ and $Z_{2}$ expresses how likely it is that we can quantify an image or text in one category and how certain we are with respect to the class.

The quantity Z is latent, but we aim to draw information about Z given the votes Y. The two variables are connected via the Beta-Binomial model (2.1) and we can estimate the distribution of Z for given votes Y by drawing MCMC samples. To do so, we sample from the posterior as follows:

\begin{matrix} f (Z | Y = y) \propto f (y | Z) \cdot f_{prior} (Z) \\ \propto (\begin{matrix} J \\ y \end{matrix}) \frac{B (α_{Z} + y, β_{Z} + J - y)}{B (α_{Z}, β_{Z})} \cdot f_{prior} (Z) . \end{matrix}

As prior distribution for Z, we use a bivariate normal with mean μ and variance matrix $Σ$ and estimate these parameters following an empirical Bayes approach. While this prior distribution is presumably too simple to completely capture the true hidden structure in the data and hence does not constitute the underlying data-generating process, it is a convenient modelling assumption. Postulating a multivariate Gaussian distribution for the latent embedded ground truth provides numerical stability during estimation and allows to express uncertainty, even if only one annotation is available.

With $Y_{i}$ as votes on image $i, J_{i}$ referring to the number of annotations on image i and $Z_{i} \in ℝ^{2}$ as the embedded ‘true location’ of image i within the two-dimensional space, we run the estimation algorithm laid out in Figure 3. We also refer to Appendix A for additional computational details. As a result, we obtain estimates ${\hat{z}}_{i}$ for each image by averaging the MCMC samples for instance i of the last iteration. The point itself expresses the classification difficulty for the respective image and is thus far more informative than a singular ground truth label.

Figure 3

Estimating the embeddings.

2.3 Multiclass case: K > 2

The binary model can now be easily extended to more than two classes by employing the Dirichlet distribution. Now, $Z = (Z_{1}, \dots, Z_{K}) \in ℝ^{K}$ is the embedded ground truth and we obtain the parameters $α_{k} = exp (Z_{k}), \forall k = 1, \dots, K$ . From these, we can draw $π \sim Dir (α_{1}, \dots, α_{k})$ . This results in the multinomial parameter vector $π = (π_{1}, \dots, π_{K})$ , where K corresponds to the number of classes. Given π, the votes are assumed to be drawn from a multinomial distribution $Y | π \sim Mult (π, J)$ with J denoting the number of votes. This leads to the following two probability functions:

\begin{array}{l} f (Y = y | π) = \frac{J!}{y_{1}! \dots y_{K}!} \prod_{k} π_{k}^{y_{k}} \\ f (π | α) = \frac{1}{B (α)} \prod_{k} π_{k}^{α_{k} - 1} . \end{array}

In this case, the function B(.) denotes the multivariate version of the Beta function. The vector π remains unobserved and we can calculate the probability of Y given Z by marginalizing over π as follows:

\begin{matrix} P (Y = y | α) = \int_{π} f (y | π) f (π | α) d π = \frac{J!}{y_{1}! \dots y_{K}!} \cdot \frac{1}{B (α)} \cdot \int_{π} \prod_{k} π_{k}^{y_{k} + α_{k} - 1} d π \\ = \frac{J!}{y_{1}! \dots y_{K}!} \cdot \frac{Γ (\sum α_{k})}{\prod_{k} Γ (α_{k})} \cdot \frac{\prod_{k} (Γ (α_{k} + y_{k}))}{Γ (\sum_{k} (α_{k} + y_{k}))} = \frac{J!}{y_{1}! \dots y_{K}!} \cdot \frac{Γ (\sum α_{k})}{Γ (\sum_{k} (α_{k} + y_{k}))} \cdot \prod_{k} \frac{Γ (α_{k} + y_{k})}{Γ (α_{k})} \\ = \frac{J!}{y_{1}! \dots y_{K}!} \cdot \frac{B (α + y)}{B (α)} . \end{matrix}

Again, the embedded ground truth values Z can be estimated given the votes Y using MCMC samples with the stochastic EM algorithm. Following the binary case and assuming a multivariate A Gaussian prior for the embeddings Z leads to the posterior distribution $f (Z | Y) \propto f (Y | Z) f_{prior} (Z)$ .

We obtain a Dirichlet-Multinomial Model by assuming a K-dimensional embedded ground truth Z for each image. The parameter $π = (π_{1}, \dots, π_{K})$ follows a Dirichlet distribution given Z and we can easily derive expectation and variance for all entries $Z_{k} \in Z, k = 1, \dots, K :$

\begin{array}{l} E (π_{k} | Z) = \frac{exp (Z_{k})}{\sum_{k^{'} = 1}^{K} exp (Z_{k^{'}})} \\ Cov (π_{k}, π_{k^{'}} ∣ Z) = \frac{1}{1 + \sum_{k^{'} = 1}^{K} exp (Z_{k^{'}})} \cdot \frac{exp (Z_{k})}{\sum_{k^{'} = 1}^{K} exp (Z_{k^{'}})} \cdot (1 - \frac{exp (Z_{k})}{\sum_{k^{'} = 1}^{K} exp (Z_{k^{'}})}) . \end{array}

Each entry of Z corresponds to one of the K classes. The concrete values can again be interpreted in two ways. On the one hand, $Z_{k}$ hints at how likely the image is classified into category k. On the other hand, the difference between the entries of Z , that is, the distance between classes k and $k^{'}$ for $k^{'} \neq k$ , corresponds to the certainty about the category k versus $k^{'}$ .

Following the estimation procedure described in Section 2.2 adapted to the multiclass case leads to values ${\hat{z}}_{i}$ for all images $i = 1, \dots, n$ . As above, these values form an embedding of the image in the K dimensional space. These embeddings express the classification (un)certainty of the individual images. As in the two dimensional case, the variance decreases with increasing values of the embedding. The concrete algorithm is comparable to the case $K = 2$ discussed above and therefore not explicitly laid out here again.

3 Results

To showcase the generality and versatility of the proposed approach in various applications, the proposed model is applied to the three datasets described in the introduction. The datasets are typical examples in the field of multiple annotations and annotator disagreement and hence provide ground for analyzing the uncertainty associated with the labels. Table 2 contains general information about the three datasets discussed in this section.

Table 2

Overview of the datasets.

Dataset	#Images N	#Classes K	#Distinct Annotation Patterns	#Annotations J
ChaosSNLI	1514	3	832	100
So2Sat LCZ42	159581	16	360	11
Cifar-10H	10000	10	3406	[50,63]

3.1 ChaosSNLI

First, we explore the advantages of the proposed methodology in the context of the classification of language, that is, the domain of NLI, as shortly introduced in Section 1. The multi-annotator dataset ChaosSNLI¹ is based on the development set of the Standford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) and was introduced in the context of label ambiguity by Nie et al. (2020). It contains multiple annotations for sentence pairs, that is, pairs of premise and hypothesis. For each premise, three hypotheses are originally generated by an annotator, as an entailing, neutral and contradicting description of the premise. The resulting sentence pairs of premise and hypothesis can therefore be classified as entailment, neutral or contradiction. Note that a subjective ground truth, namely the original intention of the first annotator, is available for this specific dataset. However, it cannot be recovered by the annotators in many cases and the dataset ChaosSNLI especially showcases this problem as it contains sentence pairs exhibiting a high rate of disagreement. Particularly, N = 1514 sentence pairs are re-assessed by a large number of annotators, that is, J = 100 and assigned to one of the three classes, as shown in Table 1. Due to the ambiguous nature of language and the individual perception, the disagreement rate in the annotations is high and the original true label cannot be recovered reliably. For the classification of language, the existence of a single ‘gold’ label is especially doubtful and hence, the need for alternative and more appropriate representations of labels persists. Applying the methodology proposed in Section 2 allows us to estimate embedded ground truth vectors for the observations based on the provided annotations, which will be analyzed in this subsection. First, let us return to the exemplary sentence pairs provided in Table 1 and inspect the respective embeddings. Figure 4 shows the estimated values for the exemplary sentence pairs, along with the observed annotations as well as the MCMC samples. While the estimated values for observation 34 (upper left plot) express clear affiliation to class entailment, all other embeddings reflect the ambiguity within the sentence pairs and also the associated annotations. Not only are the class-specific estimated values rather small and hence similar across the three categories, we also see a tendency towards class neutral for ambiguous instances, even though the majority vote might advocate otherwise. This expresses ambiguity with respect to classification and the ‘weaker’, interpretation of class neutral compared to the other two categories. If all three classes received annotations, it is semantically unlikely that the respective instance can be uniquely classified into either entailment or contradiction. Hence, the model favours class neutral in such situations.

Figure 4

The plots show the estimated embedded ground truth vectors for exemplary sentence pairs from the dataset ChaosSNLI. The actual estimated vector is shown as orange line, the green lines represent the MCMC samples and the actual annotations are shown as grey bars.

For this particular application, the classes themselves are by definition uncorrelated or negatively correlated. This property is also expressed by the resulting estimated embeddings. To visually inspect the results, we employ dimensionality reduction techniques for easier exploration. We specifically utilize principal component analysis (PCA) to extract the principal components from the estimated embeddings. PCA is a widely used technique for linear dimensionality reduction, commonly employed for exploratory analysis and visualization of high-dimensional data. For detailed information on the methodology, refer to, for example, Jolliffe (2002). Here, we especially focus on the visualization benefits of the technique. Namely, it is possible to plot the observations in a so-called two-dimensional biplot after applying PCA. Figure 5 shows the respective plot for the dataset ChaosSNLI. The estimated embeddings are projected onto the two-dimensional space, spanned by the two first principal components. The embeddings of the instances are visualized as scattered points, where their overall similarity is expressed by proximity. In this case, no specific clustering is apparent. By coloruing the scattered points according to the observed majority voting, we observe some overlap of the singular classes in the two-dimensional embedding space. Additionally, the vectors correspond to the original variables, that is, the categories and dimensions of the embeddings. The angles between the vectors express the degree of correlation between the variables, that is, small angles suggest a high positive correlation. However, in this case, the angles between variable neutral and the other two variables are roughly 90°, indicating no correlation between the quantities. In contrast, the angle between variables entailment and contradiction is close to 180°, hence expressing a negative correlation. Of course, this is reasonable from a semantic perspective and in line with the interpretation of the classes.

ID	Context/Premise	Statement/Hypothesis
34	A man running a marathon talks to his friend.	There is a man running.
1168	A black and white dog running through shallow water.	Two dogs running through water.
1177	A woman holding a child in a purple shirt.	The woman is asleep at home.
1371	An elderly woman crafts a design on a loom.	The woman is sewing.

3.2 Cifar-10H

The second dataset is a version of Cifar-10, a popular benchmark dataset for image classification, as introduced by Krizhevsky et al. (2009). The subset Cifar-10H² as introduced by Peterson et al. (2019) contains multiple annotations for images in the test set, reflecting the uncertainty stemming from differences in human perception. Here, the natural images are categorized into unambiguous classes, see Table 2. The original dataset has been extended with soft labels, that is, multiple annotations, to achieve better generalization for classification models, specifically on out-of-sample datasets, see Peterson et al. (2019) and Battleday et al. (2020). Therefore, N = 10000 images of K = 10 classes from the test set of Cifar-10 were annotated by 2571 Amazon Mechanical Turk workers. After an initial training phase, each worker annotated 200 images, 20 per category. To identify and remove low-performance annotators, attention checks were introduced after every 20 trials.

Figure 5

The biplot shows the estimated embeddings for ChaosSNLI, projected into two dimensions. The scatterpoints represent the instances, coloured by majority vote. The original dimensions are represented as arrows.

The current setting differs from the previously discussed dataset. Most images belong to one of the unambiguous categories and can be reliably classified by untrained annotators. Nevertheless, it is helpful to additionally inspect label embeddings reflecting the individual human perception, which can still be ambiguous. Due to the small size of the images, the pictured class is also not always identifiable, as shown in Figure 1. The dataset contains a high degree of human consensus due to its nature but also enough images where the annotation is still uncertain. This is also visible in the majority votes. While each class originally contains 1000 images, the number of images classified into the categories according to the majority vote varies slightly between 981 and 1015. Additionally, the images can be easily assessed and evaluated against the annotations, in contrast to the dataset presented previously. By applying the proposed model we generate embeddings of the images in the appropriate label space, which contain a notion of uncertainty and reflect the original annotations, without the loss of information by taking the majority voting.

Returning to the exemplary images from Figure 1, the estimated ground truth embeddings are shown in Figure 6. The upper plot in Figure 6 shows the label embedding for an image of a ship, which is clearly visible in the picture and therefore also identifiable by the annotators. This is reflected in the respective label embedding with a high positive value for class ship. For the second image, two annotators did not agree with the others and labelled the picture of a frog as horse or deer. Most of the labellers could correctly assign the label frog. The estimated label embedding reflects this by assigning the highest value to the class frog and small positive values to the other two classes. Nevertheless, the classification of the image is easy, which is expressed by the embedding. This does not hold for the images, which correspond to the two lower plots. The correct label for the third image was cat, correctly identified by the majority vote. Nevertheless, the image is quite ambiguous, which is reflected in the annotations and therefore also in the label embedding. Using only the majority vote would therefore lead to a correct label while losing a large amount of information about the inherent uncertainty. For the last image, the annotators did not agree at all and could not recover the label deer. In fact, almost every class received votes. In this case, the label embedding clearly reflects this confusion by assigning similar values close to zero to all classes, reflecting the classification uncertainty.

Figure 6

The plots show the estimated embeddings (orange) for exemplary images of the dataset Cifar-10H, along with the votes (bars) and the MCMC samples (green).

Next, we repeat the analyses for the previous datasets, that is, plotting the estimates of the embeddings on a two-dimensional biplot via PCA as well as calculating their correlation matrix. The biplot of the projected embeddings is displayed in Figure 7a and shows a clear separation between classes referring to animals and classes referring to objects. This is also expressed by the correlation matrix, shown in Figure 7b. Again, this reflects possible similarities between images from correlated classes, which occur due to individual human perception despite the clear separation of the classes by definition.

Figure 7

The subfigures show additional results for the dataset Cifar-10H via the biplot and the correlation matrix of the estimated embeddings.

3.3 Results on So2Sat LCZ42

Final, for the results on the remote sensing example we refer to the supplementary material.

4 Outlook

Naturally, the question arises of how to use the information gained from embedding the ambiguous labels into a multidimensional space. The two main goals are to improve the supervised model assigned with the corresponding classification task and to possibly refine its uncertainty estimates. In many applications, training the classification model based on averaged labels or labels obtained via majority voting is still common practice. In the case of high annotator disagreement due to ambiguities, this can lead to major problems related to the associated uncertainties (Davani et al., 2022; Plank, 2022; Baan et al., 2023). Koller et al. (2024) propose to instead integrate the annotation uncertainty via the empirical distribution of the annotations. Their work shows that incorporating this uncertainty leads to better generalization and calibration of the classification model. However, the benefit of the empirical distribution of course strongly depends on the number of annotations and is limited to the observed disagreement for one single instance only. The idea of estimating label embeddings via a distributional approach presented in this work offers a possibility to overcome said limitations. In particular, it is possible to train a classification model directly on the estimated embeddings $\hat{z} = ({\hat{z}}_{1}, \dots, {\hat{z}}_{n})$ , resulting from the estimation process as the mean of the MCMC samples in the last iteration. These embedded ground truth values retain information about all annotations for the respective observation and additionally incorporate knowledge about the annotations globally, across all instances. This leads to a more sound representation of the labels expressing uncertainties due to ambiguities of the instances themselves and also ambiguities due to the similarity of specific categories. Hence, this approach naturally handles images that cannot be directly classified into one class only. To integrate the embeddings into a deep learning framework, several strategies are available. While it is possible to directly learn the embedded ground truth vectors via a regression framework, reformulating the label embeddings into a Dirichlet function also allows us to stay within the world of classification. Either way, by incorporating the embeddings as labels in a machine learning framework, we expect the model to be better calibrated and yield more expressive predictive uncertainties. While this is beyond the scope of this article, the results presented here serve as a valuable starting point for future work.

5 Discussion

For classification models, the dependence on labelled training data is a common practice, that is, each instance is linked to an established ground truth label or ‘gold’ label. Generating these ground truth labels requires substantial human effort and is prone to errors causing uncertainty. However, unreliable labels cannot always be attributed to human failure. In many applications, assigning a single label is unrealistic or even impossible due to the ambiguity of the instances themselves. A single ground truth label often cannot account for the complexity of, for example, images or sentences. This is often expressed through a high rate of disagreement in the annotations received from human labellers. Hence, the single-label approach results in a substantial loss of information and introduces additional uncertainty into the classification process. Therefore, moving beyond this limiting assumption is necessary in certain applications. This can be done by considering more flexible and adaptive strategies.

This article focuses on classifying text or images addressing the specific case where we cannot assume that every observation can be uniquely classified into one class. Based on multiple annotations per observation, we propose to embed the images into a K-dimensional space instead of restricting them to a single label.

The proposed estimation procedure leads to interesting results, as reported in Section 3. We estimate label embeddings for three different datasets, emphasizing the generality of our approach and its usefulness in diverse settings. First, we apply the method to the dataset ChaosSNLI from the domain of language classification. The dataset contains especially ambiguous sentence pairs and a high number of annotations per instance. The assumption of a singular gold label is especially doubtful for the classification of language, due to its inherent ambiguity and subjectivity. Instead, multi-dimensional embeddings can serve as a more appropriate representation of the underlying truth. Moreover, we move away from expert labels and inspect the performance of our model on a crowd-sourced dataset, namely the multiply annotated dataset Cifar-10H. The results show that even the classification of images into well-separated and naturally distinguishable categories could benefit from using label embeddings instead of hard-coded labels. Third, we apply the proposed method to the earth observation dataset So2Sat LCZ42, as provided in the supplementary material. Here, the satellite images themselves exhibit a high degree of ambiguity but also the categories are similar in terms of their composition, complicating the assignment of a singular label even more.

The proposed model and the estimation framework are very flexible and hence, the presented work can be easily adapted to any classification problem with multiple annotations.

These insights can be valuable in multiple regards and pave the way for future research in various directions. While the presented results already deliver interesting insights into the annotation tasks, they of course rather serve as a preprocessing step for further work. The long-term goal is to use label embeddings within a complete machine-learning framework. In particular, we are interested in training classification models on multi-dimensional embeddings instead of single labels, that is, incorporating information about label uncertainty directly into the model. This work can also serve as a basis for analyzing different design choices for label generation for image classification problems. The trade-off between the number of instances and the number of annotators is a well-known problem, related to experimental design. For problems with a high degree of ambiguity, determined by the proposed model, acquiring more annotations instead of more instances is beneficial. Vice versa, if classification is ‘easy’, that is, the embeddings reflect clear class affiliations, a smaller number of annotations might be sufficient and one should concentrate on generating more labelled instances instead. This boils down to the question ‘more labels or more cases’ with some first ideas in the field of NLP discussed in Gruber et al. (2024). We believe that our modelling framework could be of great benefit for future steps towards better handling label uncertainty for machine learning models.

Footnotes

Acknowledgements

The present contribution is supported by the Helmholtz Association under the joint research school ‘HIDSS-006 - Munich School for Data Science@Helmholtz, TUM and LMU’. The last author also acknowledges the Munich Center for Machine Learning (MCML).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author received no financial support for the research, authorship and/or publication of this article.

Notes

Supplementary material

Appendix

References

Algan

and Ulusoy

(2021) Image classification with deep learning in the presence of noisy labels: A survey. Knowledge-Based Systems , 215, 106771.

Aroyo

and Welty

(2013) Crowd truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. WebSci2013 . ACM, (2013).

Aroyo

and Welty

(2015) Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine , 36(1), 15–24.

Avetisyan

and J-P

Fox

(2012) The dirichletmultinomial model for multivariate randomized response data and small samples. Psicologica: International Journal of Methodology and Experimental Psychology , 33(2), 362–90.

Baan

, Daheim

, Ilia

, Ulmer

, Li

H.-S

, Fernández

, Plank

, Sennrich

, Zerva

and Aziz

(2023) Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv :2307. 15703.

Battleday

, Peterson

and Griffiths

(2020) Capturing human categorization of natural images by combining deep networks and cognitive models. Nature Communications , 11, 5418.

Bowman

, Angeli

, Potts

and Manning

(2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, 2015. Association for Computational Linguistics. doi:10.18653/v1/D15- 1075. URL http://aclweb.org/anthology/D15-1075.

Celeux

, Chauveau

and Diebolt

(1996) Stochastic versions of the EM algorithm: An experimental study in the mixture case. Journal of Statistical Computation and Simulation , 55, 287–314. doi:10.1080/00949659608811772.

Davani

, Díaz

and Prabhakaran

(2022) Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics , 10, 92–110.

10.

Dawid

and Skene

(1979) Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 28, 20–8.

11.

Dgani

, Greenspan

and Goldberger

(2018) Training a neural network based on unreliable human annotation of medical images. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 39–42. IEEE.

12.

Eswaran

, Günnemann

and Faloutsos

(2017) The power of certainty: A dirichletmultinomial model for belief propagation. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 144–52. SIAM.

13.

Frénay

and Verleysen

(2014) Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems , 25, 845–69. doi:10.1109/TNNLS.2013.2292894.

14.

Frénay

and Kabán

(2014) A comprehensive introduction to label noise. ESANN 2014 Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages 23-5. Bruges.

15.

Gawlikowski

, Tassi

C. R. N

, Ali

, Lee

, Humt

, Feng

, Kruspe

, Triebel

, Jung

, Roscher

, . (2023) A survey of uncertainty in deep neural networks. Artificial Intelligence Review , 56, 1513–89.

16.

Gruber

, Schenk

, Schierholz

, Kreuter

and Kauermann

(2023) Sources of un-certainty in machine learning-a statisticians’ view. ar Xiv preprint ar Xiv :2305.16703.

17.

Gruber

, Hechinger

, Aßenmacher

, Kauermann

and Plank

(2024) More labels or cases? assessing label variation in natural language inference. In The Third Workshop on Understanding Implicit and Underspecified Language. URL https://openreview.net/forum?id=9vL3GBWt9w.

18.

Harrison

, Calder

, Shastry

and Buerkle

(2020) Dirichletmultinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data. Molecular Ecology Resources , 20, 481–97.

19.

Hechinger

, Zhu

and Kauermann

(2024) Categorising the world into local climate zones: towards quantifying labelling uncertainty for machine learning models. Journal of the Royal Statistical Society Series C: Applied Statistics , 73, 143–61.

20.

Holmes

, Harris

and Quince

(2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PloS One , 7, e30126.

21.

Hüllermeier

and Waegeman

(2021) Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning , 110, 457–506.

22.

Jiang

and Nachum

(2020) Identifying and correcting label bias in machine learning. In International Conference on Artificial Intelligence and Statistics, pages 702-12. PMLR.

23.

Jolliffe

(2002) Principal Component Analysis for Special Types of Data . 2nd edition. New York: Springer.

24.

Jürgens

, Meinert

, Bengs

, Hüllermeier

and Waegeman

(2024) Is epistemic uncertainty faithfully represented by evidential deep learning methods? ar Xiv preprint ar Xiv :2402.09056.

25.

Koller

, Kauermann

and Zhu

(2024) Going beyond one-hot encoding in classification: Can human uncertainty improve model performance in earth observation? IEEE Transactions on Geoscience and Remote Sensing , 62, 1–11. doi:10.1109/TGRS.2023. 3336357.

26.

Krizhevsky

and Hinton

(2009) Learning multiple layers of features from tiny images. 7. https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf.

27.

Martin

, Quinn

and Park

(2011) MCMCpack: Markov chain monte carlo in R. Journal of Statistical Software , 42, 22. doi:10.18637/jss.v042.i09.

28.

Meinert

, Gawlikowski

and Lavin

(2023) The unreasonable effectiveness of deep evidential regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9134-42.

29.

Minka

(2000) Estimating a Dirichlet distribution Technical Report . MIT. https://vismod.media.mit.edu/pub/tpminka/papers/minkadirichlet.ps.gz

30.

Mosimann

(1962) On the compound multinomial distribution, the multivariate β-distribution and correlations among proportions. Biometrika , 49, 65–82.

31.

Nie

, Zhou

and Bansal

(2020) What can we learn from collective human opinions on natural language inference data? URL http://arxiv.org/abs/2010.03532.arXiv:2010.03532[cs].

32.

Pavlick

and Kwiatkowski

(2019) Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics , 7, 677–694.

33.

Peterson

, Battleday

, Griffiths

and Russakovsky

(2019) Human uncertainty makes classification more robust. Proceedings of the IEEE International Conference on Computer Vision , pages 9616-25. doi:10.1109/ICCV.2019.00971.

34.

Plank

(2022) The ‘problem’ of human label variation: On ground truth in data, modelling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1067110682. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.731.

35.

Schweden

, Hechinger

, Kauermann

and Zhu

(2025) Can uncertainty quantification benefit from label embeddings? A case study on local climate zone classification. IEEE Transactions on Geoscience and Remote Sensing . 63, 1–14.

36.

Sensoy

, Kaplan

and Kandemir

(2018) Evidential deep learning to quantify classification uncertainty. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. https://proceedings.neurips.cc/paper/2018/file/a981f2b708044d6fb4a71a1463242520Paper.pdf.

37.

Ulmer

, Hardmeier

and Frellsen

(2023) Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation. Transactions on Machine Learning Research . https://openreview.net/forum?id=xqS8k9E75c

38.

Uma

, Fornaciari

, Hovy

, Paun

, Plank

and Poesio

(2021) Learning from disagreement: A survey. Journal of Artificial Intelligence Research , 72, 1385–1470.

39.

Yin

and Wang

(2014) A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, pages 233-42.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

6.22 MB

0.00 MB