Abstract
Semantic image interpretation (SII) is the process of generating meaningful descriptions of the content of images. Background knowledge (BK), in the form of logical theories, is extremely useful for SII. State-of-the-art algorithms for SII mainly adopt a bottom-up approach, which generates semantic interpretations of images starting from their low-level features. In these approaches BK is used only at a late stage for both enriching the semantic descriptions and improving image retrieval. In this paper, we show how BK plays an important role also during the early phase of SII. To this aim, we propose: (i) a reference framework where a semantic image description is a partial model of the BK. The elements of the partial model are grounded (linked) to a (set of) image segment(s). (ii) A loss function that evaluates how well this partial model fits the picture; (iii) a clustering-based optimization process that searches the partial model that better fits a picture. BK is used to prune branches of the search space that correspond to partial models which are inconsistent with BK. To evaluate our approach, we built a gold standard dataset of 203 pictures annotated with complex objects and their parts. We also evaluated our method on a reference dataset in Computer Vision, namely, the
Introduction
Semantic image interpretation (SII) is the task of generating a meaningful explanation of the content of images [18]. SII is much more than image labelling—the description of image content with a (set of) label(s)—SII aims at detecting the objects in images, their types and the relations between them.
The main challenge in SII is bridging the so called semantic gap [22], which is the complex correlation between low-level image features and high-level semantic concepts. In recent years, it became clear that ontological knowledge about image context and content plays a key role in bridging the semantic gap [35, 37]. Examples of useful ontological knowledge are: knowledge about objects qualities, e.g., color, shape, relative size, knowledge about topological and spatial properties of objects, e.g., the context where objects usually appear, the relative position where an object is likely to be; relational knowledge,e.g., the parts of complex objects or the co-occurrences of objects; taxonomical knowledge, i.e., hierarchies of object types. In the following we use the term background knowledge (BK) to indicate the ontological knowledge exploited in SII. Background knowledge is nowadays largely available in the form of RDF resources and OWL ontologies with the spread of the Semantic Web (SW) and Linked Open Data.
Most of the current works on SII that exploit BK [3, 35] are based on the so called bottom-up approach. Object types and relations are obtained by lifting up, at the semantic level, low-level image features such as colours, texture and contours. BK is used to check the consistency of the image descriptions generated by the bottom-up algorithm. BK is also used to enrich the description with new facts that logically follows from the ones obtained by the bottom-up approach. The main drawback of these approaches is that BK cannot affect the process of constructing/searching for the image interpretation, with the effect that non-optimal explanations could be generated. In this paper we want to overcome this limitation by making BK more active in the generation/searching process of SII. To this aim, we propose the following: A reference framework where the semantic interpretation of an image is a partial model of an ontology representing the BK. The elements of the partial model are grounded (linked) to a (set of) image segment(s). Figures 1 and 2 show the input and the output of our SII approach. A loss function that measures the semantic gap between a partial model (the semantics) and the image content (low-level features). The partial model that minimizes the loss best fits the image content. An incremental clustering-based algorithm that approximates the minimum of the loss function and, thus, returns the partial model that better fits a picture content. BK is used to prune the search space and to guide the algorithm.
In a nutshell, given an ontology 𝒪, representing our BK, and a labelled picture 𝒫—a semantically segmented picture where each segment has been assigned with a set of weighted semantic labels—(see Fig. 1) our method incrementally builds a partial model ℐ
p
, consistent with 𝒪, that minimizes a loss function. The evaluation of our method investigates these aspects: the quality of the predicted partial models with respect to the standard We compared the performance on these datasets with a baseline where the method uses only numeric features. The joint use of semantic and numeric information outperforms the baseline. We tested if and how the cardinality axioms of the ontology affect the performance. The best performance are obtained by removing the cardinality axioms.
Moreover, we implemented the entire pipeline from an image to its partial model. The input of our clustering algorithm (an image with detected parts of objects) is provided by a deep learning-based object detector [14] (Fast R-CNN). We compared the composite objects detected by our method with the composite objects detected by Fast R-CNN. From a qualitative analysis it emerges that there are cases where R-CNN on parts plus our method performs better than R-CNN on composite objects.
This work extends and improves our previous work [9] in the following aspects: we explicitly define the loss as a function of the semantic and euclidean distance between segments of the picture. This improvement lead to an increase of the performance. We extend the evaluation to a larger dataset and to a standard dataset of the Computer Vision community. Finally, we implemented the entire pipeline from an image to its partial model where the input of our algorithm is obtained with the use of a standard object detector.
Related work
Among the vast literature on SII, we concentrate on two groups of approaches. In the first group BK is encoded in a logical theory and logical reasoning is exploited during SII. In the other group, BK is textual knowledge exploited by neural systems during SII. The output is a textual (not formal) description of the image content and key terms occurring in the text are aligned with the corresponding regions of thepicture.
Logic-based approaches
The seminal work of [30] proposed to formalize the whole SII process as a reasoning task in First Order Logic (FOL). This approach assumes that basic elements of the scene and their spatial relations are already identified by some low-level image processing. The description image content is derived, via pure logical reasoning, from these basic facts. Lately [32] observed that a complete description of the world cannot be obtained from a picture and proposed to represent the interpretation of images content with a partial model of a knowledge base. The process of generating a picture interpretation was a pure logical process and information of low-level image features were ignored. Neumann and Möller [26] proposed to integrate low-level features in the BK represented as a Description Logic (DL) theory. Axioms represent the connection between semantic types and low-level features (via data properties and concrete domains). For example, the standard dimension of a plate is formalized by a restriction on the data property
To summarize, our work tries to take the best of the above-mentioned works: it uses a very expressive DL, it deals with multiple labels coming from a low-level image analysis. It embeds logical constraints and numeric information into a machine learning algorithm to jointly reason on semantic and numeric features. It uses abduction to introduce new elements that better explain the image content, it is based on an unsupervised approach.
Caption-based approaches
Alternative methods for SII rely on neural networks. The combination of a Convolutional Neural Network over image regions and a Recurrent Neural Network over sentences allows the learning of a model to align image regions with fragments of text [20]. This enables the automatic generation of captions describing the content of the image with the alignment between image regions and words. Another recent model for caption generation that learns words-image alignments first extracts a set of features from the image and then a long short-term memory network produces a caption word by word [36].
From the alignment between images and words it is possible to extract a graph that describes the image, but the lack of a formal semantics does not allow to express high-level constraints on the produced description, e.g., a person cannot have more than two legs. Moreover, it is not possible to perform complex reasoning and check the consistency of the image description. These methods need large annotated dataset for training and the alignment they produce is not fine-grained, e.g., the noun phrase “three dogs” is aligned with a single region showing three dogs but the model is not able to recognize the presence of three different individuals of type “dog”. In addition, only few objects are aligned with thetext.
Semantic image interpretation problem
We start from semantically segmented pictures which labels are potential object classes by using state-of-the-art semantic segmentation algorithms or object detectors, e.g., [23, 14]. Each segment has a set of weighted labels. Weights represent the level of confidence of the output of the object detection. Labels are taken from the signature Σ used to specify the BK. Formally: a labelled picture is a pair 𝒫=〈S, L〉 where S = {s1, …, s n } is a set of segments of the picture 𝒫, a segment is a set of pixels, and L is a function that associates to each segment s ∈ S a set L (s) ⊆ Σ × (0, 1] of weighted labels 〈l, w〉.
Starting from a semantically labelled picture, to generate a semantic interpretation of the image the following issues have to be solved: to select the correct label among those associated to each segment; to decide if two segments with the same label are part of the same object; to find the semantic relations between the objects corresponding to segments; to cluster simple objects in a composite object, e.g., clustering the segments labelled with “wheel”, “windows” and “door” into a composite segment of type “car”.
To accomplish these tasks, we exploit low-level features of the picture and BK. We suppose that BK is encoded in a DL [2] knowledge base 𝒦ℬ on the signature Σ used for labelling segments.
We briefly introduce one of the most common description logics called
1
. Given a signature Σ = Σ
C
⊎ Σ
R
⊎ Σ
I
, composed of three disjoint sets of symbols for concepts, relations (or roles) and individuals respectively, a concept is defined by the following grammar:
An interpretation ℐ of Σ is a pair 〈Δℐ, · ℐ〉, where Δℐ is a non empty set called the interpretation domain of ℐ, and ·ℐ is a function that maps concepts names in subsets of Δℐ, relation names in subsets of Δℐ × Δℐ and individuals in elements of Δℐ. The interpretation function is extended to all concepts as follows:
An interpretation is a complete abstract description of the state of the world in terms of existing objects (i.e., the elements of Δℐ), object types (i.e., the interpretations via ·ℐ of the symbols in Σ
C
) and relations between objects (i.e., the interpretations of the symbols in Σ
R
). A knowledge base 𝒦ℬ on Σ is a set of TBox and ABox axioms. ℐ is a model of a knowledge base 𝒦ℬ if it satisfies all the axioms in 𝒦ℬ. The axioms of the knowledge base are constraints on the states of the world. For instance, the axiom “
To better explain our proposal, we use a simple running example of Fig. 1. The input required by our method is a labelled picture with each segment having simple (not composite) object types and weights, and an ontology that relates composite objects with their parts. Notice that, if we search, at this stage, for a picture containing a horse and a person the picture will not be returned. Intuitively our goal is to infer the fact that the picture contains a horse and a person along with their parts. I.e., we want to generate a graph and an alignment like the ones of Fig. 2.
Pictures provide partial views of the state of the world. E.g., due to occlusions, only one leg of a person might be visible. Consequently, picture content should be represented with a partial view of a model, i.e., a partial model 2 .
ℐ
p
= 〈Δℐ
p
, · ℐ
p
〉 is a partial model of ; 𝒢 ⊆ Δ
ℐp
× S is a left-total
3
relation called grounding relation.
The grounding of every d ∈ Δℐ
p
, denoted by 𝒢 (d), is the set {s ∈ S ∣ 〈d, s〉 ∈ 𝒢}.
Figure 2 shows a semantically labelled picture that describes our running example. The partial model contains a horse with four legs, a muzzle and a tail, and a person with a leg, an arm and a face. The grounding of the parts are the corresponding initial segments, whereas the grounding of the horse and the person is the union of the segments associated to their parts.
Definition 2 does not provide any criteria to select the partial model that describes the content of a picture. We therefore need a criteria to decide whether a partial model is a good explanation of the picture content. We introduce a loss function ℒ𝒦ℬ that measures the “distance” between the partial model and the image content: the most plausible partial model is the partial model that minimizes ℒ𝒦ℬ:
Semantic image interpretation problem Given a knowledge base , a labelled picture and a loss function ℒ𝒦ℬ, the semantic image interpretation problem is finding a partial model ℐ p and a grounding 𝒢 that minimize ℒ𝒦ℬ(𝒫, ℐ p , 𝒢).
The loss function ℒ𝒦ℬ measures the (dis)agreement between a partial model ℐ
p
and a labelled picture 𝒫 aligned by 𝒢. The higher ℒ𝒦ℬ(𝒫, ℐ
p
, 𝒢) the less the agreement between ℐ
p
and 𝒫. For instance, if the element d ∈ Δℐ
p
of ℐ
p
is grounded to the segment s, i.e., 𝒢(d) = {s}, ℒ𝒦ℬ is lower when ℐ
p
assigns to d the types that correspond to the labels of s with higher weights. Similarly, ℒ𝒦ℬ penalizes the partial models that satisfy R(d, d′) when the low-level features of 𝒢 (d) and 𝒢(d′) are in disagreement with the relation R. E.g., ℒ𝒦ℬ penalizes the models that satisfy
Clustering is the problem of grouping a set of input elements into groups (clusters) so that the intra-cluster similarity is maximised and the inter-cluster similarity is minimised [19]. Intra-cluster similarity is a measure of the similarity among the elements within the same cluster. Inter-cluster similarity measures the similarity among different clusters. Hierarchical clustering is a generalisation of clustering where clusters can be recursively clustered in higher-level clusters. Recognising the presence of complex objects and their parts, starting from a known set of atomic objects, can be seen as a hierarchical clustering problem with the additional task of typing the intermediate nodes. More precisely: the clustering solution associated to a semantically interpreted picture is equal to where each . If we assume the hasPart relation is inverse functional, transitive and irreflexive 4 the clustering is guaranteed to be hierarchical.
Clustering algorithms are based on a distance measure between input elements. We propose a distance measure δ(d, d′) that combines the Euclidean distance δ𝒢(d, d′) between the centroids of 𝒢(d) and 𝒢(d′) called grounding distance, and a semantic quasidistance
5
δ𝒦ℬ(d, d′) between the types of d and d′ in ℐ
p
. As grounding distance we use the norm on the centroids of the segments: (the centroids are scaled to the interval [0, 1])
6
. For the semantic quasidistance we specialize the Hirst and St-Onge measure (HSO) defined in [16]. Here concepts have a big distance if (1) the ontology path (number of arcs) between them is high and (2) this path has a large number of changes of directions. If a path is composed by upward (or downward) ISA arcs then it has no changes of directions. Whereas, if a path is composed by composing upward, downward ISA arcs with a partOf arc then the path has changes of directions. With this idea in mind our semantic quasidistance δ𝒦ℬ(d, d′) assigns a small value to concepts constrained with the hasPart relation. Whereas δ𝒦ℬ(d, d′) assigns a larger values to pairs of concepts with no hasPart constraint between them or with a negative hasPart constraint. We define δ𝒦ℬ(d, d′) as
Minimizing Equation (2) analytically is not possible as ℒ𝒦ℬ is not expressed in an analitical form. We therefore developed the
Starting from , with |S| clusters, the algorithm passes through O (|S|) levels of the lattice to reach the bottom. Every level contains clusterings with n clusters. For every level the algorithm performs O(n2) operations due to the super-clusterings generation and the greedy choice. Thus, the whole algorithm visits O(|S|3) nodes of the lattice. In the running example, the algorithm starts with 𝒞0 = {{l1} , {l2} , {l3} , {l4} , {l5} , {a1} , {m1} , {t1} , {f1}} (for simplicity we use individuals instead of segments).
Δ
ℐp
is defined as a set of 9 individuals corresponding to simple objects,
Δ
ℐp
is extended with 9 new individuals corresponding to composite objects, then the part-whole assertions are added (e.g., hasPart ( then Finally
Then PWCA enters the while loop and the best child clustering of , selected by the
Datasets
To evaluate our approach we need a gold standard dataset of segmented images, such that segments are correctly labelled with parts and whole object types and the part-whole relations are explicitly specified. One such a dataset is the
The images of the
Evaluation criteria
To compute the performance of a SII algorithm (SIIA) w.r.t. a gold standard dataset 𝒟, we need to define a distance measure between the output of SIIA and the annotations on the elements of 𝒟. We suppose that both, the output of SSIA, and the annotations on 𝒟, are represented with ABoxes. We, therefore, need to define a distance between ABoxes. Let be the ABox that represents the output of SIIA on 𝒫 and let the ABox associated to the picture 𝒫 in the dataset 𝒟. We define the following two measures.
For both measures the F1 is defined as usual.
Experiments
We run PWCA on the images of both
To evaluate the impact of semantic information in the problem of recognition of the part-whole relation in images, we run different configurations of PWCA. PWCA where the parameters α and β have been optimized (PWCAbest(α,β)); PWCA with β = 1.0 and α optimized, i.e, no semantic information is taken into account (PWCAbest(α),β=1); PWCA with optimal parameters α and β where the ontology has been extended with a set of axioms that restrict the number of parts of the same class for each complex object (e.g., a cow has exactly four legs). These axioms are called cardinality axioms (PWCAbest(α,β) +CA). Our previous version of PWCA [9] (PWCA
SOM
);
Table 2 shows the details of the evaluation. We run the experiments on an Intel Xeon E5-1660 v3 3.00 GHZ, 16 core, 32 GB DDR4.
The performance of PWCAbest(α,β) on the
To show the impact of background knowledge on the performance of PWCA, we compared PWCA with a
Most of the Semantic Web resources about meronomy (
Finally, we compare PWCA with
The above results support the intuition that it is worth to jointly use numeric and semantic features of simple objects (parts) to detect composite objects in images.
Towards a whole SII pipeline
PWCA covers only the final part of the SII pipeline. Indeed PWCA takes in input a picture where segments about parts are already given and labelled. The previous section reports the evaluation of PWCA under the assumption of perfect semantic segmentation of parts. In this section, we consider the realistic situation where semantic segmentation is performed automatically. To this aim, we first construct a full SII pipeline by composing a deep-learning-based object detector (Fast-RCNN) [14] for the detection of parts with PWCA. We call this pipeline R-CNN+PWCA. We then compare its capability at recognising complex objects against the baseline that recognises directly complex objects (without considering their parts). For this baseline we consider Fast-RCNN trained on complex objects.
The SII pipeline
As mentioned above we built two alternative SII pipelines (shown in Fig. 3). The first one, called R-CNN+PWCA, takes into account part-whole ontological information. The second one, called R-CNN, recognises complex objects directly from the picture without considering part-whole relations. In the following we provide more details of the two pipelines. Both pipelines are based on a deep learning state-of-the-art object detection tool, called fast Region-based Convolutional Networks (Fast R-CNN) [14].
The flow of the R-CNN+PWCA pipeline (shown in the top of Fig. 3) is the following: An input image is first processed by Fast R-CNN for the semantic detection of object parts. To this extent, we train Fast R-CNN on all the classes of the Every bounding box returned by Fast R-CNN has a weighted label indicating the class of the object. We modified this functionality in order to obtain the top 5 weighted labels for bounding box. The result of the previous step is a semantically labelled picture. This picture is given in input to PWCA that returns the best partial model and the grounding 𝒢*. From and 𝒢* we compute the bounding box of each composite object of as follows: the bounding box associated to a composite object p
j
is the smallest bounding box (SBB) surrounding the segments (bounding boxes) that correspond through G to the parts of p
j
in .
The second pipeline (shown in the bottom part of Fig. 3) is implemented by directly applying Fast R-CNN to the image. Indeed, the labels about composite objects in the
We compared the SBBs returned by PWCA and the output of the default Fast-RCNN with the ground truth of the
Evaluation
To compare the output of R-CNN+PWCA with the one of R-CNN we compared the bounding boxes of composite objects returned by both pipelines. We use the area overlap as a measure to compare a bounding boxes B
p
with the bounding box of the ground truth B
gt
defined as
Discussion
Differently from most of the approaches in object detection in images PWCA does not need a training set. PWCA is a completely unsupervised method that can be easily adapted to specific domains by providing an appropriated part-whole ontology of the domain. For this reason, PWCA can be easily extended by exploiting information available in
PWCA deals with multiple and noisy labels on parts. Given a segment labelled with more labels, PWCA selects its label taking into account also the labels assigned to the sibling segments. In this way PWCA can discard labels with the highest weight in favour of labels with lower weight when this optimizes a global labelling of the elements of the cluster.
To the best of our knowledge, most of the semantic image interpretation systems do not compare the predicted structure with a ground truth. For this reason we developed the
The heuristic of grouping simple objects according to their geometric and semantic proximity could be also applied to the relation participate-in an event. In this case, the simple objects are the participants to an event in the image, whereas the composite objects are the events themselves. We can conduct the same evaluation of the part-whole relation. Moreover, we can construct the same pipeline Fast R-CNN (trained on participants to an event) and PWCA to have a complete SII system for event detection.
PWCA suffers of some limitations. On the one hand, it is able to retrieve false negatives of the object detectors (the inference of the composite object from its parts), but on the other hand it is not able to discard false positives, e.g., a wrong segment with label “eye” in the middle of a meadow. Indeed, given a segment of a simple object PWCA deduces the presence of a composite object. PWCA is not able to detect the simple object as false positive and to discard it. This needs to be studied. Moreover, the set of labels returned by an object detector could suffer of inconsistency, e.g., a segment could be classified both with “dog” and “cat” with different weights. We can apply the work in [8] to avoid these inconsistencies.
Conclusions
We proposed a well-founded and general framework for SII that integrates semantic information with low-level numeric features. An image is interpreted as a partial model of a knowledge base. The construction of the partial model is guided by an incremental clustering algorithm that mixes semantic and numeric distances. We applied the framework to the specific task of recognizing composite objects from their parts. The evaluation on the
Footnotes
1
The approach is independent from a specific DL.
3
Every logical individual is associated to at least one segment. A segment can have no connection with a logical individual, this allows the framework to handle, and possibly discard, false positive segments of the labelled picture.
4
Standard ontological assumptions of classical mereology.
5
Semantic distance is not required to be symmetric.
6
Using the centroid of G(d) as a numeric feature is enough to show the effectiveness of our approach. The approach can be generalised by considering other features like shape, texture, color, etc. The use of the centroid and the norm tends to group objects close in the space. This assumption is based on the Law of Proximity of Gestalt Psychology [
] that states that parts of the same object are usually close.
7
The interpretation ℐ
p
is represented with an ABox as follows: if d ∈ lℐ
p
, with l ∈ Σ
C
, then l (
