Integration of numeric and symbolic information for semantic image interpretation

Abstract

Semantic image interpretation (SII) is the process of generating meaningful descriptions of the content of images. Background knowledge (BK), in the form of logical theories, is extremely useful for SII. State-of-the-art algorithms for SII mainly adopt a bottom-up approach, which generates semantic interpretations of images starting from their low-level features. In these approaches BK is used only at a late stage for both enriching the semantic descriptions and improving image retrieval. In this paper, we show how BK plays an important role also during the early phase of SII. To this aim, we propose: (i) a reference framework where a semantic image description is a partial model of the BK. The elements of the partial model are grounded (linked) to a (set of) image segment(s). (ii) A loss function that evaluates how well this partial model fits the picture; (iii) a clustering-based optimization process that searches the partial model that better fits a picture. BK is used to prune branches of the search space that correspond to partial models which are inconsistent with BK. To evaluate our approach, we built a gold standard dataset of 203 pictures annotated with complex objects and their parts. We also evaluated our method on a reference dataset in Computer Vision, namely, the PASCAL-Part dataset. The results are positive. The evaluation assumes a perfect detection of parts. To understand the impact of a realistic (and noisy) part detection on our algorithm, we did a preliminary evaluation by implementing the entire SII pipeline. Part detection is performed by a recent deep learning architecture trained for detecting parts. From a qualitative analysis, it emerges that recognizing complex objects starting from parts in some cases gets better results than detecting complex objects directly.

Keywords

Information extraction computer vision semantic image interpretation ontologies clustering

1 Introduction

Semantic image interpretation (SII) is the task of generating a meaningful explanation of the content of images [18]. SII is much more than image labelling—the description of image content with a (set of) label(s)—SII aims at detecting the objects in images, their types and the relations between them.

The main challenge in SII is bridging the so called semantic gap [22], which is the complex correlation between low-level image features and high-level semantic concepts. In recent years, it became clear that ontological knowledge about image context and content plays a key role in bridging the semantic gap [35, 37]. Examples of useful ontological knowledge are: knowledge about objects qualities, e.g., color, shape, relative size, knowledge about topological and spatial properties of objects, e.g., the context where objects usually appear, the relative position where an object is likely to be; relational knowledge,e.g., the parts of complex objects or the co-occurrences of objects; taxonomical knowledge, i.e., hierarchies of object types. In the following we use the term background knowledge (BK) to indicate the ontological knowledge exploited in SII. Background knowledge is nowadays largely available in the form of RDF resources and OWL ontologies with the spread of the Semantic Web (SW) and Linked Open Data.

Most of the current works on SII that exploit BK [3 , 35] are based on the so called bottom-up approach. Object types and relations are obtained by lifting up, at the semantic level, low-level image features such as colours, texture and contours. BK is used to check the consistency of the image descriptions generated by the bottom-up algorithm. BK is also used to enrich the description with new facts that logically follows from the ones obtained by the bottom-up approach. The main drawback of these approaches is that BK cannot affect the process of constructing/searching for the image interpretation, with the effect that non-optimal explanations could be generated. In this paper we want to overcome this limitation by making BK more active in the generation/searching process of SII. To this aim, we propose the following:

A reference framework where the semantic interpretation of an image is a partial model of an ontology representing the BK. The elements of the partial model are grounded (linked) to a (set of) image segment(s). Figures 1 and 2 show the input and the output of our SII approach.

A loss function that measures the semantic gap between a partial model (the semantics) and the image content (low-level features). The partial model that minimizes the loss best fits the image content.

An incremental clustering-based algorithm that approximates the minimum of the loss function and, thus, returns the partial model that better fits a picture content. BK is used to prune the search space and to guide the algorithm.

In a nutshell, given an ontology 𝒪, representing our BK, and a labelled picture 𝒫—a semantically segmented picture where each segment has been assigned with a set of weighted semantic labels—(see Fig. 1) our method incrementally builds a partial model ℐ_p, consistent with 𝒪, that minimizes a loss function. The evaluation of our method investigates these aspects:

the quality of the predicted partial models with respect to the standard PASCAL-Part dataset [5] which contains part-whole relations between objects. Moreover, we tested our method on a manually built dataset of 203 images containing more composite objects per image than the PASCAL-Part dataset. We defined two measures of performance in order to test the ability of the method to reconstruct composite objects from their parts. The evaluation shows encouraging results on both datasets.

We compared the performance on these datasets with a baseline where the method uses only numeric features. The joint use of semantic and numeric information outperforms the baseline.

We tested if and how the cardinality axioms of the ontology affect the performance. The best performance are obtained by removing the cardinality axioms.

Moreover, we implemented the entire pipeline from an image to its partial model. The input of our clustering algorithm (an image with detected parts of objects) is provided by a deep learning-based object detector [14] (Fast R-CNN). We compared the composite objects detected by our method with the composite objects detected by Fast R-CNN. From a qualitative analysis it emerges that there are cases where R-CNN on parts plus our method performs better than R-CNN on composite objects.

This work extends and improves our previous work [9] in the following aspects: we explicitly define the loss as a function of the semantic and euclidean distance between segments of the picture. This improvement lead to an increase of the performance. We extend the evaluation to a larger dataset and to a standard dataset of the Computer Vision community. Finally, we implemented the entire pipeline from an image to its partial model where the input of our algorithm is obtained with the use of a standard object detector.

2 Related work

Among the vast literature on SII, we concentrate on two groups of approaches. In the first group BK is encoded in a logical theory and logical reasoning is exploited during SII. In the other group, BK is textual knowledge exploited by neural systems during SII. The output is a textual (not formal) description of the image content and key terms occurring in the text are aligned with the corresponding regions of thepicture.

2.1 Logic-based approaches

The seminal work of [30] proposed to formalize the whole SII process as a reasoning task in First Order Logic (FOL). This approach assumes that basic elements of the scene and their spatial relations are already identified by some low-level image processing. The description image content is derived, via pure logical reasoning, from these basic facts. Lately [32] observed that a complete description of the world cannot be obtained from a picture and proposed to represent the interpretation of images content with a partial model of a knowledge base. The process of generating a picture interpretation was a pure logical process and information of low-level image features were ignored. Neumann and Möller [26] proposed to integrate low-level features in the BK represented as a Description Logic (DL) theory. Axioms represent the connection between semantic types and low-level features (via data properties and concrete domains). For example, the standard dimension of a plate is formalized by a restriction on the data property size of the concept of plate. SII is done via deductive reasoning and preferences between partial models have also been considered. Espinosa Peraldi et al. [28] proposed the use of abductive reasoning for SII. This technique infers the preferred partial model (explanation) starting from the observations derived from a low-level image processing. The preferred partial model of an image is the one that contains more evidence and less hypotheses. This method requires a set of DL rules for defining what is abducible, which need to be manually crafted. More recently [1] combined abductive reasoning with morphological reasoning. In this case BK knowledge is represented by a concept lattices and the approach recognizes objects and their spatial relations. BK encoded in fuzzy logic has also been exploited for SII. Hudelot et al. [17] adopted a fuzzy DL ontology of spatial relations to discover objects and spatial relations between them in a picture by combining bottom-up and top-down methods. Hudelot et al. [7] used fuzzy logic to resolves the inconsistency of the labels extracted from low-level image analysis. The approaches of [25] and [13] improved object detection by exploiting the semantic dependencies between object types encoded in a knowledge base. The work of [38] employed a knowledge base representation for reasoning about objects and predicting their affordances. The method builds a knowledge base (represented with a Markov Logic Network MLN) by gathering information about affordances and attributes of objects from the Web and from datasets of images. In [27] a MLN is built to classify objects and to perform queries about properties of objects given some evidence from automatic annotators (e.g., shape, color, size). In both cases the inference is not so rich due to the limited expressivity of the knowledge base: formulas are simple Horn clauses with only one literal for the body. In [4] a Conditional Random Field (CRF), which exploits BK and numeric features of the image, is used to build the semantic interpretation of the image. To the best of our knowledge, this work is the closest to ours. However, there are some important differences: our work is based on an unsupervised approach, whereas [4] needs a dataset to learn the parameters of the CRF. Our work handles multiple labels for the detected objects, whereas [4] does not handle such uncertainty coming from the object detectors. Our work is able to reconstruct composite objects from the detection of (some of) their parts with a sort of abductive reasoning.

To summarize, our work tries to take the best of the above-mentioned works: it uses a very expressive DL, it deals with multiple labels coming from a low-level image analysis. It embeds logical constraints and numeric information into a machine learning algorithm to jointly reason on semantic and numeric features. It uses abduction to introduce new elements that better explain the image content, it is based on an unsupervised approach.

2.2 Caption-based approaches

Alternative methods for SII rely on neural networks. The combination of a Convolutional Neural Network over image regions and a Recurrent Neural Network over sentences allows the learning of a model to align image regions with fragments of text [20]. This enables the automatic generation of captions describing the content of the image with the alignment between image regions and words. Another recent model for caption generation that learns words-image alignments first extracts a set of features from the image and then a long short-term memory network produces a caption word by word [36].

From the alignment between images and words it is possible to extract a graph that describes the image, but the lack of a formal semantics does not allow to express high-level constraints on the produced description, e.g., a person cannot have more than two legs. Moreover, it is not possible to perform complex reasoning and check the consistency of the image description. These methods need large annotated dataset for training and the alignment they produce is not fine-grained, e.g., the noun phrase “three dogs” is aligned with a single region showing three dogs but the model is not able to recognize the presence of three different individuals of type “dog”. In addition, only few objects are aligned with thetext.

3 Semantic image interpretation problem

We start from semantically segmented pictures which labels are potential object classes by using state-of-the-art semantic segmentation algorithms or object detectors, e.g., [23, 14]. Each segment has a set of weighted labels. Weights represent the level of confidence of the output of the object detection. Labels are taken from the signature Σ used to specify the BK. Formally: a labelled picture is a pair 𝒫=〈S, L〉 where S = {s₁, …, s_n} is a set of segments of the picture 𝒫, a segment is a set of pixels, and L is a function that associates to each segment s ∈ S a set L (s) ⊆ Σ × (0, 1] of weighted labels 〈l, w〉.

Starting from a semantically labelled picture, to generate a semantic interpretation of the image the following issues have to be solved: to select the correct label among those associated to each segment; to decide if two segments with the same label are part of the same object; to find the semantic relations between the objects corresponding to segments; to cluster simple objects in a composite object, e.g., clustering the segments labelled with “wheel”, “windows” and “door” into a composite segment of type “car”.

To accomplish these tasks, we exploit low-level features of the picture and BK. We suppose that BK is encoded in a DL [2] knowledge base 𝒦ℬ on the signature Σ used for labelling segments.

We briefly introduce one of the most common description logics called $SHIQ$ 1 . Given a signature Σ = Σ_C ⊎ Σ_R ⊎ Σ_I, composed of three disjoint sets of symbols for concepts, relations (or roles) and individuals respectively, a $SHIQ$ concept is defined by the following grammar: $\begin{matrix} C, D & : = & A ∣ \neg C ∣ C ⊓ D ∣ C ⊔ D ∣ \exists R . C ∣ \forall R . C ∣ \\ (\geq n) R . C ∣ (\leq n) R . C \end{matrix}$ where A ∈ Σ_C, R ∈ Σ_R, and n is a nonnegative integer. We assume that Σ_R is closed under inverse role, i.e., if R ∈ Σ_R then R^– (the inverse of R) is in Σ_R. Axioms are expressions of the following forms:

TBox axioms			ABox axioms
C⊑D	concept inclusion	C(a)	class assertion
R⊑S	role inclusion	R(a, b)	role assertion

An interpretation ℐ of Σ is a pair 〈Δ^ℐ, · ^ℐ〉, where Δ^ℐ is a non empty set called the interpretation domain of ℐ, and ·^ℐ is a function that maps concepts names in subsets of Δ^ℐ, relation names in subsets of Δ^ℐ × Δ^ℐ and individuals in elements of Δ^ℐ. The interpretation function is extended to all $SHIQ$ concepts as follows: $\begin{matrix} (\neg C)^{ℐ} & = & Δ^{ℐ} \ C^{ℐ} \\ (C ⊓ D)^{ℐ} & = & C^{ℐ} \cap D^{ℐ} \\ (C ⊔ D)^{ℐ} & = & C^{ℐ} \cup D^{ℐ} \\ (\exists R . C)^{ℐ} & = & {d \in Δ^{ℐ} ∣ R^{ℐ} (d) \cap C^{ℐ} \neq \emptyset} \\ (\forall R . C)^{ℐ} & = & {d \in Δ^{ℐ} ∣ R^{ℐ} (d) \subseteq C^{ℐ}} \\ ((\geq n) R . C)^{ℐ} & = & {d \in Δ^{ℐ} ∣ # (R^{ℐ} (d) \cap C^{ℐ}) \geq n} \\ ((\leq n) R . C)^{ℐ} & = & {d \in Δ^{ℐ} ∣ # (R^{ℐ} (d) \cap C^{ℐ}) \leq n} \end{matrix}$ where $R^{ℐ} (d) = {d^{'} ∣ 〈 d, d^{'} 〉 \in R^{ℐ}}$ and # (A) is the cardinality of the set A. An axiom φ is satisfied by ℐ, in symbols ℐ ⊨ φ, if:

TBox axioms			ABox axioms
ℐ ⊨ C ⊑ D	iff C^ℐ ⊆ D^ℐ	ℐ ⊨ C(a)	iff a^ℐ ∈ C^ℐ
ℐ ⊨ R ⊑ S	iff R^ℐ ⊆ S^ℐ	ℐ ⊨ R(a, b)	iff 〈a^ℐ, b^ℐ〉 ∈ R^ℐ

An interpretation is a complete abstract description of the state of the world in terms of existing objects (i.e., the elements of Δ^ℐ), object types (i.e., the interpretations via ·^ℐ of the symbols in Σ_C) and relations between objects (i.e., the interpretations of the symbols in Σ_R). A knowledge base 𝒦ℬ on Σ is a set of TBox and ABox axioms. ℐ is a model of a knowledge base 𝒦ℬ if it satisfies all the axioms in 𝒦ℬ. The axioms of the knowledge base are constraints on the states of the world. For instance, the axiom “person ⊑ (=2) hasPart . Leg” states that every person has exactly two legs. This implies that worlds where people has only one leg or three legs are impossible.

To better explain our proposal, we use a simple running example of Fig. 1. The input required by our method is a labelled picture with each segment having simple (not composite) object types and weights, and an ontology that relates composite objects with their parts. Notice that, if we search, at this stage, for a picture containing a horse and a person the picture will not be returned. Intuitively our goal is to infer the fact that the picture contains a horse and a person along with their parts. I.e., we want to generate a graph and an alignment like the ones of Fig. 2.

Pictures provide partial views of the state of the world. E.g., due to occlusions, only one leg of a person might be visible. Consequently, picture content should be represented with a partial view of a model, i.e., a partial model 2 .

Definition 1. (Partial model). Let ℐ and ℐ′ be two interpretations of the signatures Σ and Σ′ respectively, with Σ ⊆ Σ′; ℐ^′ is an extension of ℐ, or equivalently ℐ^′ extends ℐ, if Δ^ℐ ⊆ Δ^{ℐ^′}, a^ℐ = a^{ℐ^′}, C^ℐ = C^{ℐ^′} ∩ Δ^ℐ and R^ℐ = R^{ℐ^′} ∩ Δ^ℐ × Δ^ℐ, for all a ∈ Σ_I, C ∈ Σ_C and R ∈ Σ_R. ℐ_p is a partial model for a knowledge base 𝒦ℬ, in symbols ℐ_p ⊨_p 𝒦ℬ, if there is a model ℐ of 𝒦ℬ that extends ℐ_p.

Definition 2. (Semantically interpreted picture). Given a knowledge base $KB$ with signature Σ and a labelled picture 𝒫=〈S, L〉, a semantically interpreted picture is a triple $𝕊 = (𝒫, ℐ_{p}, 𝒢)$ where:

ℐ_p = 〈Δ^ℐ_p, · ^ℐ_p〉 is a partial model of $KB$ ;

𝒢 ⊆ Δ^_ℐp × S is a left-total 3 relation called grounding relation.

The grounding of every d ∈ Δ^ℐ_p, denoted by 𝒢 (d), is the set {s ∈ S ∣ 〈d, s〉 ∈ 𝒢}.

Figure 2 shows a semantically labelled picture that describes our running example. The partial model contains a horse with four legs, a muzzle and a tail, and a person with a leg, an arm and a face. The grounding of the parts are the corresponding initial segments, whereas the grounding of the horse and the person is the union of the segments associated to their parts.

Definition 2 does not provide any criteria to select the partial model that describes the content of a picture. We therefore need a criteria to decide whether a partial model is a good explanation of the picture content. We introduce a loss function ℒ_𝒦ℬ that measures the “distance” between the partial model and the image content: the most plausible partial model $ℐ_{p}^{*}$ is the partial model that minimizes ℒ_𝒦ℬ: $(ℐ_{p}^{*}, G^{*}) = \underset{G \subseteq Δ^{ℐ_{p}} \times S}{\underset{ℐ_{p} ⊨_{p} K ℬ}{a r g m i n}} L_{K B} (P, ℐ_{p}, G)$ (1) ℒ_𝒦ℬ takes into account the agreement between low-level image features of 𝒫 and high-level semantic features contained in ℐ_p, with respect to the low-level/high-level mapping 𝒢.

Semantic image interpretation problem Given a knowledge base $KB$ , a labelled picture $P$ and a loss function ℒ_𝒦ℬ, the semantic image interpretation problem is finding a partial model ℐ_p and a grounding 𝒢 that minimize ℒ_𝒦ℬ(𝒫, ℐ_p, 𝒢).

4 Clustering-based loss function

The loss function ℒ_𝒦ℬ measures the (dis)agreement between a partial model ℐ_p and a labelled picture 𝒫 aligned by 𝒢. The higher ℒ_𝒦ℬ(𝒫, ℐ_p, 𝒢) the less the agreement between ℐ_p and 𝒫. For instance, if the element d ∈ Δ^ℐ_p of ℐ_p is grounded to the segment s, i.e., 𝒢(d) = {s}, ℒ_𝒦ℬ is lower when ℐ_p assigns to d the types that correspond to the labels of s with higher weights. Similarly, ℒ_𝒦ℬ penalizes the partial models that satisfy R(d, d′) when the low-level features of 𝒢 (d) and 𝒢(d′) are in disagreement with the relation R. E.g., ℒ_𝒦ℬ penalizes the models that satisfy close (d, d′) when the relative distance between 𝒢(d) and 𝒢(d′) is high. As can be seen from the above examples, the definition of ℒ_𝒦ℬ heavily depends from the semantics of the classes and the relations in 𝒦ℬ. In the following, we concentrate on the part-whole relation and we define a loss function that is used to recognise the presence of complex objects starting from their parts. The problem of recognising complex objects once simpler objects (parts) have been detected can be seen as a clustering problem and we specify the loss function in terms of a clustering optimisation function.

Clustering is the problem of grouping a set of input elements into groups (clusters) so that the intra-cluster similarity is maximised and the inter-cluster similarity is minimised [19]. Intra-cluster similarity is a measure of the similarity among the elements within the same cluster. Inter-cluster similarity measures the similarity among different clusters. Hierarchical clustering is a generalisation of clustering where clusters can be recursively clustered in higher-level clusters. Recognising the presence of complex objects and their parts, starting from a known set of atomic objects, can be seen as a hierarchical clustering problem with the additional task of typing the intermediate nodes. More precisely: the clustering solution associated to a semantically interpreted picture $𝕊 = (𝒫, I_{p}, G)$ is equal to $C = {C_{d} ∣ d \in Δ^{ℐ_{p}}}$ where each $C_{d} = {s \in 𝒢 (d^{'}) ∣ d^{'} \in Δ^{ℐ_{p}}, 〈 d, d^{'} 〉 \in hasPar t^{ℐ_{p}}}$ . If we assume the hasPart relation is inverse functional, transitive and irreflexive 4 the clustering $C$ is guaranteed to be hierarchical.

Clustering algorithms are based on a distance measure between input elements. We propose a distance measure δ(d, d′) that combines the Euclidean distance δ_𝒢(d, d′) between the centroids of 𝒢(d) and 𝒢(d′) called grounding distance, and a semantic quasidistance 5 δ_𝒦ℬ(d, d′) between the types of d and d′ in ℐ_p. As grounding distance we use the $L_{2}^{2}$ norm on the centroids of the segments: $δ_{𝒢} (d, d^{'}) = | | cent (𝒢 (d)) - cent (𝒢 (d^{'})) | |_{2}^{2}$ (the centroids are scaled to the interval [0, 1]) 6 . For the semantic quasidistance we specialize the Hirst and St-Onge measure (HSO) defined in [16]. Here concepts have a big distance if (1) the ontology path (number of arcs) between them is high and (2) this path has a large number of changes of directions. If a path is composed by upward (or downward) ISA arcs then it has no changes of directions. Whereas, if a path is composed by composing upward, downward ISA arcs with a partOf arc then the path has changes of directions. With this idea in mind our semantic quasidistance δ_𝒦ℬ(d, d′) assigns a small value to concepts constrained with the hasPart relation. Whereas δ_𝒦ℬ(d, d′) assigns a larger values to pairs of concepts with no hasPart constraint between them or with a negative hasPart constraint. We define δ_𝒦ℬ(d, d′) as $\min {p a r_{K B} (C, C^{'}) | \begin{matrix} 〈 C, w 〉 \in L (𝒢 (d)), \\ 〈 C^{'}, w^{'} 〉 \in L (𝒢 (d^{'})), \\ f o r n o D, D^{'} \in \sum_{C}, \\ D ⊑ C, D^{'} ⊑ C^{'} \end{matrix}}$ with $\begin{matrix} {par}_{KB} (C, C^{'}) = {\begin{matrix} 1 & if KB ⊨ C ⊑ \exists hasPart . C^{'} \\ \infty & if KB ⊨ C ⊑ \neg \exists hasPart . C^{'} \\ γ & otherwise \end{matrix} \end{matrix}$ where the parameter γ, according to the definition of HSO, can be defined as: γ = pathLength + k * changesDirections, with k a constant. Examples of semantic quasidistance between the elements of the partial model of Fig. 2 are the following: $\begin{matrix} δ_{KB} (h_{1}, t_{1}) & = & {par}_{KB} (Horse, Tail) = 1 \\ δ_{KB} (h_{1}, f_{1}) & = & {par}_{KB} (Horse, Face) = \infty \\ δ_{KB} (h_{1}, p_{1}) & = & {par}_{KB} (Horse, Person) = γ . \end{matrix}$ Following [19] we define $L_{K B} (P, ℐ_{p}, G) = α \cdot Λ + (1 - α) \cdot Γ$ (2) which is a blending, according to the parameter α ∈ [0, 1], of the intra-cluster error sum Λ and the inter-cluster error sum Γ. Λ is defined by (3). $\sum_{〈 p, d 〉 \in {hasPart}^{ℐ_{p}}} (β δ_{G} (p, d) + (1 - β) \frac{δ_{KB} (p, d)}{w (d, G, I_{p})})$ (3) with w(d, 𝒢, ℐ_p) = w_l, if l is the most specific concept that ℐ_p assigns to d and w_l is the only weight associated to the label l in 𝒢(d). Otherwise w(d, 𝒢, ℐ_p) is undefined. The parameter β ∈ [0, 1] mixes the geometric and the semantic contribution. Finally Γ is defined by (4). $\sum_{〈 p, p^{'} 〉 \in (\exists hasPart . ⊤)^{I_{p}}} δ_{G} (p, p^{'})$ (4)

5 Minimising the loss function

Minimizing Equation (2) analytically is not possible as ℒ_𝒦ℬ is not expressed in an analitical form. We therefore developed the PartWholeClusteringAlgorithm (in short PWCA, Algorithm 1) that preforms a greedy search on the lattice of clusterings of the set of segments {s₁, …, s_n} of 𝒫. For each clustering $C$ , PWCA generates a partial model-grounding pair 〈ℐ_p, 𝒢〉 that minimizes ℒ_𝒦ℬ(𝒫, ℐ_p, 𝒢). PWCA starts from the initial clustering 𝒞₀ = {{s₁} , …, {s_n}}. At each iteration PWCA (i) considers a clustering $C$ ; (ii) generates all super-clusterings of $C$ by merging any pair of clusters in $C$ (agglomerative step performed by AgglomerativeMatrix( $C$ )) and (iii) selects the super-clustering $C^{'}$ whose associated partial model-grounding pair (computed by BestMod) minimizes the loss function. PWCA terminates when $| C | = 1$ , i.e., no super-clusterings can be generated form $C$ , or when there is no partial model that can be associated to the super-clusterings of $C$ . PWCA returns the partial model-grounding pair with minimal loss among those that has been generated.

BestMod generates a partial model-grounding pair 〈ℐ_p, 𝒢〉 from a clustering $C = {C_{1} \dots C_{n}}$ as follows: The domain of ℐ_p is Δ^_ℐp = {d_s ∣ s ∈ C_j and C_j ∈ C} ∪ {p_j ∣ C_j ∈ C}; it contains the element d_s for each segment s of 𝒫, and an additional element p_j that corresponds to the parent (the composite object) of the cluster C_j. The grounding 𝒢(d_s) is {s}, while the grounding of p_j is given by 𝒢(p_j) = {⋃ _{s∈C_j}s}. At this step, BestMod selects one concept for every d ∈ Δ^_ℐp according to the weighted multiple labels associated to the segments of 𝒫. The function ca_j, called the concept assigning function for 𝒞_i, is defined as: ${ca}_{j} : C_{j} \cup 𝒢 (p_{j}) \to Σ_{C}$ such that for all s ∈ C_j, 〈ca_j(s) , w〉 ∈ L(s). Let CA(C_j) be the set of all concept assignment functions for C_j. The best concept assignment function for cluster C_j ∈ C is the one such that: ${ca}_{j}^{*} = \underset{{ca}_{j} \in C A (C_{j})}{argmin} \sum_{s \in C_{j}} \frac{{par}_{KB} ({ca}_{j} (p_{j}), {ca}_{j} (s))}{wl (s, {ca}_{j} (s))}$ (5) where the function wl(s, l) = w, if ∃l : 〈l, w〉 ∈ L(s). Otherwise wl(s, l) is undefined. For every C_j ∈ C the algorithm computes Equation (5) obtaining the best concept assignment ${ca}_{j}^{*}$ for C_j. Then PWCA constructs the interpretation function for every element of l ∈ Σ_C: $l^{ℐ_{p}} = {d \in Δ^{ℐ_{p}} ∣ {ca}_{j}^{*} (d) = l for some C_{j} \in 𝒞}$ and hasPart^ℐ_p = {〈p_j, d〉 : 𝒢(d) ⊆ C_j}. In this manner, PWCA handles possible detection errors where a segment is classified according to the label with the greatest weight. The step of introducing a new logical individual, with type given by Equation (5), that “explains” a cluster according to 𝒦ℬ can be seen as an abduction operation. As a final step BestMod checks if ℐ_p is a partial model of 𝒦ℬ. This is done by checking the consistency of 𝒦ℬ extended with the ABox that represents ℐ_p 7 using a standard DL reasoner, e.g., Pellet [33].

AgglomerativeMatrix generates all possible clusterings that can be obtained by joining any pair of clusters in 𝒞 = {C₁, …, C_n}. The result is represented in a matrix A whose elements are $a_{kh} = (C \ {C_{k}, C_{h}}) \cup {C_{k} \cup C_{h}}$ . For every clustering in A the BestMod procedure generates the corresponding partial model and grounding. The algorithm performs a greedy choice, it selects the clustering whose partial model and grounding minimize the loss function, see the combination of argmin with BestMod, line number 8. During these operations of clusterings generation and local minimum selection the partial model $ℐ_{p}^{*}$ and grounding 𝒢^* that minimize the loss are stored (lines 10, 11, 12) to be returned at the end, line 13. The algorithm terminates if there are no clusterings to process, i.e. PWCA reaches the bottom of the lattice or all the clusterings generate not consistent interpretations.

Starting from $C_{0}$ , with |S| clusters, the algorithm passes through O (|S|) levels of the lattice to reach the bottom. Every level contains clusterings with n clusters. For every level the algorithm performs O(n²) operations due to the super-clusterings generation and the greedy choice. Thus, the whole algorithm visits O(|S|³) nodes of the lattice. In the running example, the algorithm starts with 𝒞₀ = {{l₁} , {l₂} , {l₃} , {l₄} , {l₅} , {a₁} , {m₁} , {t₁} , {f₁}} (for simplicity we use individuals instead of segments). BestMod(𝒫, 𝒞₀, 𝒦ℬ) computes the partial model $ℐ_{p}^{*}$ and the grounding 𝒢^* corresponding to 𝒞₀ as follows:

Δ^_ℐp is defined as a set of 9 individuals corresponding to simple objects, l₁, …, f₉, one for every cluster in 𝒞₀;

Δ^_ℐp is extended with 9 new individuals corresponding to composite objects, p₁, …, p₉, one for every cluster in 𝒞₀;

then the part-whole assertions are added (e.g., hasPart (p₁, l₁), hasPart (p₇, m₁), and hasPart (p₉, f₁));

then BestMod assigns the best concept types to all individuals in Δ^_ℐp according to Equation (5) (e.g., leg (l₁), Arm (a₁),…, muzzle (m₁), horse (p₁) ⊔ person (p₁), horse (p₇), and person (p₉));

Finally BestMod checks the consistency of $ℐ_{p}^{*}$ (represented with an ABox) with the Pellet reasoner.

Then PWCA enters the while loop and the best child clustering of

C_{0}

, selected by the argmin, is

C^{'} = {{l_{1}}

,{l₂},{l₃},{l₅},{a₁},{m₁},{t₁},{f₁, l₄}}. The BestMod procedure generates a new partial model (in a new ABox) from

C^{'}

. According to the steps above, every element in every cluster is first introduced as a logical individual in the ABox. Then there is the introduction of individuals corresponding to the parents, the introduction of the part-whole assertions and the assignment of the concept types (Equation (5)), e.g., the cluster {f₁, l₄} has parent

p_{1}^{'}

of type person. At the end of the while loop the algorithm outputs the partial model corresponding to the clustering {{a₁, l₁} , {f₁, l₄, l₅} , {m₁, t₁, l₂, l₃}}. In this partial model the first cluster is represented with a logical individual of type person with two hasPart relations with individuals of type Arm and Tail. The second cluster is still represented with a person with the hasPart relation with three individual of type Face, Leg and Leg. The last cluster is represented with an individual of type horse having as parts logical individuals of type Muzzle, Tail, Leg and Leg. Notice that this clustering partially match with the ground truth of Fig. 2. Indeed the types of the parents are computed correctly, while a horse leg is assigned to a person.

6 Evaluation

6.1 Datasets

To evaluate our approach we need a gold standard dataset of segmented images, such that segments are correctly labelled with parts and whole object types and the part-whole relations are explicitly specified. One such a dataset is the PASCAL-Part dataset [5]. Image segments of this dataset are labelled with classes of animals, vehicles, indoor objects and their parts. However, labels for parts in PASCAL-Part are very specific, e.g., “left lower leg” and “right hand”. Since we are not interested in such a fine-grained distinction, we merged the segments of the images that refers to the same part in a unique segment, e.g. two segments labelled with “left lower leg” and “left front leg” of the same leg have been merged in a segment labelled with “leg”. Then we split the dataset into PASCAL-Part training set (7687 images) and PASCAL-Part test set (2416 images) such that they contain the 80% and 20% of the objects of every class respectively. We perform the evaluation only on the test set, while the training set in this experiment does not play any role, and it will be used for further experiments described in the next section.

The images of the PASCAL-Part contain few composite objects (an average of 2.10 composite objects per image) thus in many cases there is few variability between classes of composite objects. Since we want to test our algorithm on images that contain more composite objects and more parts, we developed a new dataset called PartOf dataset. Using LabelMe we annotated 203 pictures with simple objects, composite objects and their part-whole relations. The labels of the PartOF dataset includes classes of buildings, trees, people, street vehicles, and their parts. We also manually built two simple ontologies about the meronymy of the classes of objects in the datasets. In a large scale setting a part-whole ontology can be automatically extracted from Semantic Web resources like WordNet [12] or Yago [24]. Table 1 summarises the main differences and figures of these datasets. We can see that the PartOf dataset contains more complex objects and parts per picture, than PASCAL-Part. The presence of more complex objects in our dataset is more challenging for PWCA than the PASCAL-Part test set. The datasets of the experiments and the ontologies are available at https://dkm.fbk.eu/technologies/knowpic.

6.2 Evaluation criteria

To compute the performance of a SII algorithm (SIIA) w.r.t. a gold standard dataset 𝒟, we need to define a distance measure between the output of SIIA and the annotations on the elements of 𝒟. We suppose that both, the output of SSIA, and the annotations on 𝒟, are represented with ABoxes. We, therefore, need to define a distance between ABoxes. Let $𝒜_{𝒫}^{SIIA}$ be the ABox that represents the output of SIIA on 𝒫 and let $𝒜_{𝒫}^{D}$ the ABox associated to the picture 𝒫 in the dataset 𝒟. We define the following two measures.

Grouping (GRP) measures how good is SIIA at grouping parts of the same composite object. $\begin{matrix} {prec}_{GRP} & = & \frac{1}{| D |} \sum_{𝒫 \in D} \frac{| sibl (𝒜_{𝒫}^{SIIA}) \cap sibl (𝒜_{𝒫}^{D}) |}{| sibl (𝒜_{𝒫}^{SIIA}) |} \\ {rec}_{GRP} & = & \frac{1}{| D |} \sum_{𝒫 \in D} \frac{| sibl (𝒜_{𝒫}^{SIIA}) \cap sibl (𝒜_{𝒫}^{D}) |}{| sibl (𝒜_{𝒫}^{D}) |} \end{matrix}$ where for any ABox A, $sibl (𝒜) = {〈 d, d^{'} 〉 ∣ \exists d^{″} : {hasPart (d^{″}, d), hasPart (d^{″}, d^{'})} \subseteq 𝒜}$ .

Complex-object prediction (COP) measures how good is SIIA at predicting the type of a composite object. $\begin{matrix} {prec}_{COP} & = & \frac{1}{| D |} \sum_{𝒫 \in D} \frac{| ptype (𝒜_{𝒫}^{SIIA}) \cap ptype (𝒜_{𝒫}^{D}) |}{| ptype (𝒜_{𝒫}^{SIIA}) |} \\ {rec}_{COP} & = & \frac{1}{| D |} \sum_{𝒫 \in D} \frac{| ptype (𝒜_{𝒫}^{SIIA}) \cap ptype (𝒜_{𝒫}^{D}) |}{| ptype (𝒜_{𝒫}^{D}) |} \end{matrix}$ where for any ABox A, $ptype (𝒜) = {〈 d, C 〉 ∣ \exists d^{″} : {hasPart (d^{″}, d^{'}), C (d^{'})} \subseteq 𝒜}$ . Intuitively, ptype(𝒜) is the set of pairs 〈d, C〉 such that, according to 𝒜, d is a part of an element of type C.

For both measures the F1 is defined as usual.

6.3 Experiments

We run PWCA on the images of both PascalPart and PartOf dataset where we removed the information about the part-of relation and the complex objects. Then we measure the distance of the output produced by PWCA and the full annotations in the datasets.

To evaluate the impact of semantic information in the problem of recognition of the part-whole relation in images, we run different configurations of PWCA.

PWCA where the parameters α and β have been optimized (PWCA_best(α,β));

PWCA with β = 1.0 and α optimized, i.e, no semantic information is taken into account (PWCA_{best(α),β=1});

PWCA with optimal parameters α and β where the ontology has been extended with a set of axioms that restrict the number of parts of the same class for each complex object (e.g., a cow has exactly four legs). These axioms are called cardinality axioms (PWCA_{best(α,β) +CA}).

Our previous version of PWCA [9] (PWCA_SOM);

Table 2 shows the details of the evaluation. We run the experiments on an Intel Xeon E5-1660 v3 3.00 GHZ, 16 core, 32 GB DDR4.

The performance of PWCA_best(α,β) on the PASCAL-Part test set and PartOf dataset respectively (first and fourth line of the table) are encouraging. For both datasets we use the parameters α = 0.5, β = 0.3, that maximize the performance of the PartOf dataset. The choice of reusing the parameters that maximize the “hardest” dataset well generalizes to a standard dataset. Indeed we obtain better results for the PASCAL-Part test set with respect PartOf dataset.

To show the impact of background knowledge on the performance of PWCA, we compared PWCA with a knowledge-blind baseline (PWCA_{best(α),β=1}). This is obtained by setting β = 1.0 in (3) with the effect of cancelling out the impact of the semantic quasidistance in the loss function. Comparing this baseline with the results of other settings, we can see that semantic features significantly improve the F1 measure on both datasets.

Most of the Semantic Web resources about meronomy (WordNet [12]) have no cardinality axioms, e.g., wholeObject ⊑ (= n) hasPart.part. To check if they are really necessary, we test how adding cardinality axioms to the ontology affects the performance of PWCA. Thus, we run PWCA with the ontology containing all cardinality axioms (PWCA_{best(α,β) +CA}). For both the datasets we used the optimal parameters α = 0.5, β = 0.3. Adding all cardinality axioms to the ontologies slightly improves the precision in the GRP measure at the price of a dramatic decrease of the recall. Whereas, cardinality axioms do not affect the COP. We can conclude that PWCA works better with simple part-whole ontologies with no cardinality axioms.

Finally, we compare PWCA with our previous work [9] (PWCA_SOM) which proposes to solve the same task with a non-parametric clustering algorithm based on the Self-Organizing Maps (SOM) [21]. We tested PWCA_SOM on our dataset and we can see that, Table 2, PWCA outperforms our previous algorithm. Moreover, PWCA_SOM does not guarantee that images are interpreted in partial models that satisfy the knowledge base. In our implementation the 8% of the cases generate interpretations which are not models. PWCA, instead, always returns partialmodels.

The above results support the intuition that it is worth to jointly use numeric and semantic features of simple objects (parts) to detect composite objects in images.

7 Towards a whole SII pipeline

PWCA covers only the final part of the SII pipeline. Indeed PWCA takes in input a picture where segments about parts are already given and labelled. The previous section reports the evaluation of PWCA under the assumption of perfect semantic segmentation of parts. In this section, we consider the realistic situation where semantic segmentation is performed automatically. To this aim, we first construct a full SII pipeline by composing a deep-learning-based object detector (Fast-RCNN) [14] for the detection of parts with PWCA. We call this pipeline R-CNN+PWCA. We then compare its capability at recognising complex objects against the baseline that recognises directly complex objects (without considering their parts). For this baseline we consider Fast-RCNN trained on complex objects.

7.1 The SII pipeline

As mentioned above we built two alternative SII pipelines (shown in Fig. 3). The first one, called R-CNN+PWCA, takes into account part-whole ontological information. The second one, called R-CNN, recognises complex objects directly from the picture without considering part-whole relations. In the following we provide more details of the two pipelines. Both pipelines are based on a deep learning state-of-the-art object detection tool, called fast Region-based Convolutional Networks (Fast R-CNN) [14].

The flow of the R-CNN+PWCA pipeline (shown in the top of Fig. 3) is the following:

An input image is first processed by Fast R-CNN for the semantic detection of object parts. To this extent, we train Fast R-CNN on all the classes of the PASCAL-Part ontology by using the PASCAL-Part training set, and getting a mean Average Precision of 0.41 on the PASCAL-Part test set. For further information about performance evaluation in object detection we refer the reader to [11].

Every bounding box returned by Fast R-CNN has a weighted label indicating the class of the object. We modified this functionality in order to obtain the top 5 weighted labels for bounding box.

The result of the previous step is a semantically labelled picture. This picture is given in input to PWCA that returns the best partial model $ℐ_{p}^{*}$ and the grounding 𝒢^*. From $ℐ_{p}^{*}$ and 𝒢^* we compute the bounding box of each composite object of $ℐ_{p}^{*}$ as follows:

the bounding box associated to a composite object p_j is the smallest bounding box (SBB) surrounding the segments (bounding boxes) that correspond through G to the parts of p_j in $ℐ_{p}^{*}$ .

The second pipeline (shown in the bottom part of Fig. 3) is implemented by directly applying Fast R-CNN to the image. Indeed, the labels about composite objects in the PASCAL-Part dataset are the same on which the default Fast R-CNN is trained, i.e. the 20 classes of the Pascal VOC Challenge [11]. Thus, we run the default Fast R-CNN on the PASCAL-Part test set in order to detect composite objects.

We compared the SBBs returned by PWCA and the output of the default Fast-RCNN with the ground truth of the PASCAL-Part test set.

7.2 Evaluation

To compare the output of R-CNN+PWCA with the one of R-CNN we compared the bounding boxes of composite objects returned by both pipelines. We use the area overlap as a measure to compare a bounding boxes B_p with the bounding box of the ground truth B_gt defined as $a_{o} = \frac{area (B_{p} \cap B_{gt})}{area (B_{p} \cup B_{gt})} .$ An overlap close to 1 indicates a good prediction for B_p, see [11]. Figure 4 shows some cases where R-CNN+PWCA is comparable (or better) to R-CNN. We can see that in some cases R-CNN is not able to recognize any object in the images while R-CNN+PWCA is able to detect the presence of an object only by the detection of some parts: the more parts are detected the more accurate the bounding box prediction is. Moreover, for the images of Fig. 4 we measured the area overlap with the ground truth, see first two columns of Table 3. From this qualitative analysis it emerges that there are cases where the predictions of R-CNN+PWCA are more accurate that the ones of fast R-CNN. This encouraging result states that PWCA can be combined with standard object detectors to improve the accuracy of the detections. In addition, we compared the computational time of the pipeline R-CNN+PWCA against the computational time of R-CNN. The latter is obviously faster but the time of our pipeline is still acceptable, see last two columns of Table 3.

8 Discussion

Differently from most of the approaches in object detection in images PWCA does not need a training set. PWCA is a completely unsupervised method that can be easily adapted to specific domains by providing an appropriated part-whole ontology of the domain. For this reason, PWCA can be easily extended by exploiting information available in WordNet or Yago [24]. They both contain a taxonomy of classes and the first formalizes also the meronymy of the classes. Usually providing a part-whole ontology of a set of classes is less expensive then producing training sets of images for the same set of classes. More expressive ontologies can be automatically obtained from Web resources (e.g., corpora, Web documents, RDF repositories) with the use of ontology learning techniques (e.g., [6, 29]).

PWCA deals with multiple and noisy labels on parts. Given a segment labelled with more labels, PWCA selects its label taking into account also the labels assigned to the sibling segments. In this way PWCA can discard labels with the highest weight in favour of labels with lower weight when this optimizes a global labelling of the elements of the cluster.

To the best of our knowledge, most of the semantic image interpretation systems do not compare the predicted structure with a ground truth. For this reason we developed the GRP and COP measures (the investigation about other methods of comparison of logical interpretations is left as future work). Other systems evaluate their performance on common tasks such as the prediction of object properties [27, 38], caption generation [4 , 36], images search [4]. This fact suggests us to address the future experimentations in these directions.

The heuristic of grouping simple objects according to their geometric and semantic proximity could be also applied to the relation participate-in an event. In this case, the simple objects are the participants to an event in the image, whereas the composite objects are the events themselves. We can conduct the same evaluation of the part-whole relation. Moreover, we can construct the same pipeline Fast R-CNN (trained on participants to an event) and PWCA to have a complete SII system for event detection.

PWCA suffers of some limitations. On the one hand, it is able to retrieve false negatives of the object detectors (the inference of the composite object from its parts), but on the other hand it is not able to discard false positives, e.g., a wrong segment with label “eye” in the middle of a meadow. Indeed, given a segment of a simple object PWCA deduces the presence of a composite object. PWCA is not able to detect the simple object as false positive and to discard it. This needs to be studied. Moreover, the set of labels returned by an object detector could suffer of inconsistency, e.g., a segment could be classified both with “dog” and “cat” with different weights. We can apply the work in [8] to avoid these inconsistencies.

9 Conclusions

We proposed a well-founded and general framework for SII that integrates semantic information with low-level numeric features. An image is interpreted as a partial model of a knowledge base. The construction of the partial model is guided by an incremental clustering algorithm that mixes semantic and numeric distances. We applied the framework to the specific task of recognizing composite objects from their parts. The evaluation on the PASCAL-Part dataset and on a new dataset of 203 labelled pictures shows good results and improvements with respect to our previous work [9]. Our approach shows how the explicit combination of semantic information of a knowledge base with numeric information of an image improves the methods based only on numeric features in the task of the part-whole recognition. Moreover, we constructed a whole pipeline from an image to its partial model for part-whole detection. We combined a state-of-the-art object detector [14] (to obtain labelled pictures) with PWCA. Our aim is to test PWCA on the detection of composite objects. From a qualitative analysis of the results PWCA is able to reconstruct the bounding box of a composite object from the presence of few parts. This bounding box can be as accurate as (or more in some cases) the one of an object detector.

Footnotes

1

The approach is independent from a specific DL.

2

Our definition slightly differs from the one of [].

3

Every logical individual is associated to at least one segment. A segment can have no connection with a logical individual, this allows the framework to handle, and possibly discard, false positive segments of the labelled picture.

4

Standard ontological assumptions of classical mereology.

5

Semantic distance is not required to be symmetric.

6

Using the centroid of G(d) as a numeric feature is enough to show the effectiveness of our approach. The approach can be generalised by considering other features like shape, texture, color, etc. The use of the centroid and the $L_{2}^{2}$ norm tends to group objects close in the space. This assumption is based on the Law of Proximity of Gestalt Psychology [] that states that parts of the same object are usually close.

7

The interpretation ℐ_p is represented with an ABox as follows: if d ∈ l^ℐ_p, with l ∈ Σ_C, then l (d) is in the ABox, if (p, d) ∈ hasPart^_ℐp, with hasPart ∈ Σ_R, then hasPart (p, d) is in the ABox.

References

Atif

, Hudelot

, and Bloch

. Explanatory reasoning for image understanding using formal concept analysis and description logics. Systems, Man, and Cybernetics: Systems, IEEE Transactions on 44(5) (2014), 552–570.

Baader

, Calvanese

, McGuinness

D.L.

, Nardi

, and Patel-Schneider

P.F.

, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, New York, NY, USA, 2003.

Bannour

, and Hudelot

Towards ontologies for image interpretation and annotation. In Martinez

José M.

, editor, 9th International Workshop on Content-Based Multimedia Indexing, CBMI 2011, Madrid, Spain, 2011, pp. 211–216. IEEE, 2011.

Chen

, Zhou

Q.-Y.

, and Prasanna

Understanding web images by object relation network. In Proceedings of the 21st International Conference on World Wide Web, WWW’12, New York, NY, USA, 2012 pp. 291–300. ACM.

Chen

, Mottaghi

, Liu

, Fidler

, Urtasun

, and Yuille

. Detect what you can: Detecting and representing objects using holistic models and body parts, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

Cimiano

, Mädche

, Staab

, and Völker

Ontology learning. In Handbook on ontologies, Springer Berlin Heidelberg, 2009, pp. 245–267.

Dasiopoulou

, Kompatsiaris

, and Strintzis

M.G.

. Applying fuzzy dls in the extraction of image semantics, J Data Semantics 14 (2009), 105–132.

Deng

, Ding

, Jia

, Frome

, Murphy

, Bengio

, Li

, Neven

, and Adam

Large-scale object classification using label relation graphs. In Computer Vision–ECCV 2014, Springer, 2014, pp. 48–64.

Donadello

, Serafini

Mixing low-level and semantic features for image interpretation. In Agapito

Lourdes

, Bronstein

M. Michael

, and Rother

Carsten

, editors, Computer Vision - ECCV 2014Workshops, volume 8926 of Lecture Notes in Computer Science, Springer International Publishing, 2014, pp. 283–298 Best paper award.

10.

Espinosa

, Kaya

, and MÃűller

Logical formalization of multimedia interpretation. In Georgios

Paliouras

, Spyropoulos

Constantine D.

, Tsatsaronis

George

, editors, Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, volume 6050 of Lecture Notes in Computer Science Springer Berlin Heidelberg, 2011, pp. 110–133.

11.

Everingham

, Eslami

S.M.A.

, Van Gool

Williams

C.K.I.

, Winn

, and Zisserman

. The pascal visual object classes challenge: A retrospective, International Journal of Computer Vision 111(1) (2015), 98–136.

12.

Fellbaum

, editor. WordNet: an electronic lexical database. MIT Press, 1998.

13.

Forestier

, Wemmert

, and Puissant

. Coastal image interpretation using background knowledge and semantics, Computers & Geosciences, 54 (2013), 88–96.

14.

Girshick

. Fast r-cnn, In International Conference on Computer Vision (ICCV), 2015.

15.

Gould

, Rodgers

, Cohen

, Elidan

, and Koller

. Multi-class segmentation with relative location prior, International Journal of Computer Vision 80(3) (2008), 300–316.

16.

Hirst

, and St-Onge

. Lexical chains as representations of context for the detection and correction of malapropisms, WordNet: An Electronic Lexical Database 305 (1998), 305–332.

17.

Hudelot

, Atif

, and Bloch

. Fuzzy spatial relation ontology for image interpretation, Fuzzy Sets and Systems 159(15) (2008), 1929–1951.

18.

Hudelot

, Maillot

, and Thonnat

. Symbol grounding for semantic image interpretation: From image data to semantics. In Proc. of the 10th IEEE Intl. Conf. on Computer Vision Workshops, ICCVW ’05. IEEE Computer Society, 2005.

19.

Jung

, Park

, Du

D.-Z.

, and Drake

B.L.

. A decision criterion for the optimal number of clusters in hierarchical clustering, Journal of Global Optimization 25(1) (2003), 91–111.

20.

Karpathy

, and Li

F.-F.

. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.

21.

Kohonen

. The self-organizing map, Proc. of the IEEE 78(9) (1990), 1464–1480.

22.

Liu

, Zhang

, Lu

, and Ma

W.-Y.

. A survey of content-based image retrieval with high-level semantics, Pattern Recognition 40(1) (2007), 262–282.

23.

Long

, Shelhamer

, and Darrell

. Fully convolutional networks for semantic segmentation, arXiv preprint arXiv:1411.4038, 2014.

24.

Mahdisoltani

, Biega

, and Suchanek

F.M.

. Yago3: A knowledge base from multilingual wikipedias, In Proc. of the Conf on Innovative Data Systems Research, 2015.

25.

Marszalek

, and Schmid

. Semantic Hierarchies for Visual Object Recognition. In Computer Vision and Pattern Recognition, 2007.

26.

Neumann

, and Möller

. On scene interpretation with description logics, Image and Vision Computing, 26(1) (2008), 82–101. Cognitive Vision-Special Issue.

27.

Nyga

, Balint-Benczedi

, and Beetz

. Pr2 looking at thingsâĂ Ťensemble learning for unstructured information processing with markov logic networks, In Robotics and Automation (ICRA), 2014 IEEE International Conference on IEEE, 2014, pp. 3916–3923.

28.

Espinosa Peraldi

I.S.

, Kaya

, and Möller

, Formalizing multimedia interpretation based on abduction over description logic aboxes, In Proc. of the 22nd Intl. Workshop on Description Logics (DL 2009), volume 477 of CEUR Workshop Proceedings. CEUR-WS.org, 2009.

29.

Petrucci

. Information extraction for learning expressive ontologies, In The Semantic Web. Latest Advances and New Domains - 12th European Semantic Web Conference, ESWC 2015, Portoroz, Slovenia, May 31 - June 4, 2015. Proceedings, pages 740–750, 2015.

30.

Reiter

, and Mackworth

A.K.

. A logical framework for depiction and image interpretationm, Artificial Intelligence 41(2) (1989), 125–155.

31.

Russell

B.C.

, Torralba

, Murphy

K.P.

, and Freeman

W.T.

. Labelme: A database and web-based tool for image annotation, Int J Comput Vision 77(1–3) (2008), 157–173.

32.

Schroder

, and Neumann

. On the logics of image interpretation: model-construction in a formal knowledgerepresentation framework, In Image Processing, 1996. Proceedings, Int Conf on, 1 (1996), 785–788.

33.

Sirin

, Parsia

, Grau

B.C.

, Kalyanpur

, and Katz

. Pellet: A practical owl-dl reasoner, Web Semant 5(2) (2007), 51–53.

34.

Smith

, von Ehrenfels

, and Verlag

. Foundations of Gestalt theory. Philosophia Verlag Munich, Germany, 1988.

35.

Town

. Ontological inference for image and video analysis, Mach Vision Appl 17(2) (2006), 94–115.

36.

, Ba

, Kiros

, Cho

, Courville

A.C.

, Salakhutdinov

, Zemel

R.S.

, and Bengio

. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015.

37.

Yuille

, and Oliva

Frontiers in computer vision: Nsf white paper, November 2010. http://www.frontiersincomputervision.com/WhitePaperInvite.pdf

38.

Zhu

, Fathi

, and Fei-Fei

Reasoning about object affordances in a knowledge base representation. In Fleet

David

, Pajdla

Tomas

, Schiele

Bernt

, and Tuytelaars

Tinne

, editors, Computer Vision ĂŞ ECCV 2014, volume 8690 of Lecture Notes in Computer Science, Springer International Publishing, 2014, pp. 408–424.