Resource List w/ Abstract

[pdf]: paper PDF online link
[repo]: paper PDF repo link
[github]: github link
[web]: website link

Paper
Tutorials
Datasets

Paper

2009 {#1}

Distant Supervision for Relation Extraction without Labeled Data

[pdf] [repo]
Mintz et al. (2009.08)
ACL'09
Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distance in their expression.

2013 {#1}

Translating Embeddings for Modeling Multi-relational Data

[pdf] [repo]
Bordes et al. (2013.12)
NIPS'13
We consider the problem of embedding entities and relationships of multi-relational data in the low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples.

2014 {#2}

Knowledge Graph Embedding by Translating on Hyperplanes

[pdf] [repo]
Wang et al. (2014.07)
AAAI'14
We deal with embedding a large scale knowledge graph composed of entities and relations into a continuous vector space. TransE is a promising method proposed by recently, which is very efficient while achieving state-of-the-art predictive performance. We discuss some mapping properties of relations which should be considered in embedding, such as reflexive, one-to-many, many-to-one, and many-to-many. We note that TransE does not do well in dealing with these properties. Some complex models are capable of preserving these mapping properties but sacrifice efficiency in the process. To make a good trade-off between model capacity and efficiency, in this paper we propose TransH which models a relation as a hyperplane together with a translation operation on it. In this way, we can well preserve the above mapping properties of relations with almost the same model complexity of TransE. Additionally, as a practical knowledge graph is often far from completed, how to construct negative examples to reduce false negative labels in training is very important. Utilizing the one-to-many/many-to-one mapping property of a relation, we propose a simple trick to reduce the possibility of false negative labeling. We conduct extensive experiments on link prediction, triplet classification and fact extraction on benchmark datasets like WordNet and Freebase. Experiments show TransH delivers significant improvements over TransE on predictive accuracy with comparable capability to scale up.

Relation Classification via Convolutional Deep Neural Network

[pdf] [repo]
Zeng et al. (2014.08)
COLING'14
The state-of-the-art methods used for relation classification are primarily based on statistical machine learning, and their performance strongly depends on the quality of the extracted features. The extracted features are often derived from the output of pre-existing natural language processing (NLP) systems, which leads to the propagation of the errors in the existing tools and hinders the performance of these systems. In this paper, we exploit a convolutional deep neural network (DNN) to extract lexical and sentence level features. Our method takes all of the word tokens as input without complicated pre-processing. First, the work tokens are transformed to vectors by looking up word embeddings. Then, lexical level features are extracted according to the given nouns. Meanwhile, sentence level features are learned using a convolutional approach. These two level features are concatenated to form the final extracted feature vector. Finally, the features are fed into a softmax classifier to predict the relationship between two marked nouns. The experimental results demonstrate that our approach significantly outperforms the state-of-the-art methods.

2015 {#3}

Distance Supervision for Relation Extraction via Piecewise Convolutional Neural Networks

[pdf] [repo]
Zeng et al. (2015.08)
EMNLP'15
Two problems arise when using distant supervision for relation extraction. First, in this method, an already existing knowledge base is heuristically aligned to texts, and the alignment results are treated as labeled data. However, the heuristic alignment can fail, resulting in wrong label problem. In addition, in previous approaches, statistical models have typically been applied to ad hoc features. The noise that originates from feature extraction process can cause poor performance. In this paper, we propose a novel model dubbed the Piecewise Convolutional Neural Networks (PCNNs) with multi-instance learning to address these two problems. To solve the first problem, distant supervised relation extraction is treated as a multi-instance problem in which the uncertainty of instance labels is taken into account. To address the latter problem, we avoid feature engineering and instead adopt convolutional architecture with piecewise max pooling to automatically learn relevant features. Experiments show that our method is effective and outperforms several competitive baseline methods.

Knowledge Graph Embedding via Dynamic Mapping Matrix

[pdf] [repo]
Ji et al. (2015.10)
ACL'15
Knowledge graphs are useful resources for numerous AI applications, but they are far from completeness. Previous work such as TransE, TransH and TransR/CTransR regard a relation as translation from head entity to tail entity and the CTransR achieves state-of-the-art performance. In this paper, we propose a more fine-grained model named TransD, which is an improvement of TransR/CTransR. In TransD, we use two vectors to represent a named symbol object (entity and relation). The first one represents the meaning of a(n) entity (relation), the other one is used to construct mapping matrix dynamically. Compared with TransR/CTransR, TransD not only considers the diversity of relations, but also entities. TransD has less parameters and has no matrix-vector multiplication operations, which makes it can be applied on large scale graphs. In experiments, we evaluate our model on two typical tasks including triplets classification and link prediction. Evaluation results show that our approach outperforms state-of-the-art methods.

Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries

[pdf] [repo]
Zhu et al. (2015.11)
arXiv
The complexity of the visual world creates significant challenges for comprehensive visual understanding. In spite of recent successes in visual recognition, today's vision systems would still struggle to deal with visual queries that require a deeper reasoning. We propose a knowledge base (KB) framework to handle an assortment of visual queries, without the need to train new classifiers for new tasks. Building such a large-scale multimodal KB presents a major challenge of scalability. We cast a large-scale MRF into a KB representation, incorporating visual, textual and structured data, as well as their diverse relations. We introduce a scalable knowledge base construction system that is capable of building a KB with half billion variables and millions of parameters in a few hours. Our system achieves competitive results compared to purpose-built models on standard recognition and retrieval tasks, while exhibiting greater flexibility in answering richer visual queries.

2016 {#4}

Knowledge Graph Completion with Adaptive Sparse Transfer Matrix

[pdf] [repo]
Ji et al. (2016.02)
AAAI'16
We model knowledge graphs for their completion by encoding each entity and relation into a numerical space. All previous work including Trans(E, H, R, and D) ignore the heterogeneity (some relations link many entity pairs and others do not) and the imbalance (the number of head entities and that of tail entities in a relation could be different) of knowledge graphs. In this paper, we propose a novel approach TranSparse to deal with the two issues. In TranSparse, transfer matrices are replaced by adaptive sparse matrices, whose sparse degrees are determined by the number of entities (or entity pairs) linked by relations. In experiments, we design structured and unstructured sparse patterns for transfer matrices and analyze their advantages and disadvantages. We evaluate our approach on triplet classification and link prediction tasks. Experimental results show that TranSparse outperforms Trans(E, H, R, and D) significantly, and achieves state-of-the-art performance.

Order-Embeddings of Image and Language

[pdf] [repo] [github]
Vendrov et al. (2016.03)
ICLR'16
Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modelling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

Neural Relation Extraction with Selective Attention over Instances

[pdf] [repo]
Lin et al. (2016.08)
ACL'16
Distant supervised relation extraction has been widely used to find novel relational facts from text. However, distant supervision inevitably accompanies with the wrong labelling problem, and these noisy data will substantially hurt the performance of relation extraction. To alleviate this issue, we propose a sentence-level attention-based model for relation extraction. In this model, we employ convolutional neural networks to embed the semantics of sentences. Afterwards, we build sentence-level attention over multiple instances, which is expected to dynamically reduce the weights of those noisy instances. Experimental results on real-world datasets show that, our model can make full use of all informative sentences and effectively reduce the influence of wrong labelled instances. Our model achieves significant and consistent improvements on relation extraction as compared with baselines.

TransG: A Generative Model for Knowledge Graph Embedding

[pdf] [repo]
Xiao et al. (2016.08)
ACL'16
Recently, knowledge graph embedding, which projects symbolic entities and relations into continuous vector space, has become a new, hot topic in artificial intelligence. This paper proposes a novel generative model (TransG) to address the issue of multiple relation semantics that a relation may have multiple meanings revealed by the entity pairs associated with the corresponding triples. The new model can discover latent semantics for a relation and leverage a mixture of relation-specific component vectors to embed a fact triple. To the best of our knowledge, this is the first generative model for knowledge graph embedding, and at the first time, the issue of multiple relation semantics is formally discussed. Extensive experiments show that the proposed model achieves substantial improvements against the state-of-the-art baselines.

2017 {#1}

Image-embodied Knowledge Representation Learning

[pdf] [repo]
Xie et al. (2017.08)
IJCAI'17
Entity images could provide significant visual information for knowledge representation learning. Most conventional methods learn knowledge representations merely from structured triples, ignoring rich visual information extracted from entity images. In this paper, we propose a novel Image-embodied Knowledge Representation Learning model (IKRL), where knowledge representations are learned with both triple facts and images. More specially, we first construct representations for all images of an entity with a neural image encoder. These image representations are then integrated into an aggregated image-based representation via an attention-based method. We evaluate out IKRL models on knowledge graph completion and triple classification. Experimental results demonstrate that our models outperform all baselines on both tasks, which indicates the significance of visual information for knowledge representations and the capability of our models in learning knowledge representations with images.

2018 {#5}

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems

[pdf] [repo]
Saha et al. (2018.01)
AAAI'18
While multimodal conversation agents are gaining importance in several domains such as retail, travel etc., deep learning research in this area has been limited primarily due to the lack of availability of large-scale, open chatlogs. To overcome this bottleneck, in this paper we introduce the task of multimodal, domain-aware conversation, and propose the MMD benchmark dataset. This dataset was gathered by working in close coordination with large number of domain experts in the retail domain. These experts suggested various conversations flows and dialog states which are typically seen in multimodal conversations in the fashion domain. Keeping these flows and states in mind, we created a dataset consisting of over 150K conversation sessions between shoppers and sale agents, with the help of in-house annotators using a semi-automated manually intense iterative process. With this dataset, we propose 5 new sub-tasks for multimodal conversations along with their evaluation methodology. We also propose two multimodal neural models in the encode-attend-decode paradigm and demonstrate their performance on two of the sub-tasks, namely text response generation and best image response selection. These experiments serve to establish baseline performance and open new research directions for each of these sub-tasks. Further, for each of the sub-tasks, we present a 'per-state evaluation' of 9 most significant dialog states, which would enable more focused research into understanding the challenges and complexities involved in each of these states.

A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning

[pdf] [repo]
Mousselly-Sergieh et al. (2018.06)
SEM'18
Current methods for knowledge graph (KG) representation learning focus solely on the structure of the KG and do not exploit any kind of external information, such as visual and linguistic information corresponding to the KG entities. In this paper, we propose a multimodal translation-based approach that defines the energy of a KG triple as the sum of sub-energy functions that leverage both multimodal (visual and linguistic) and structural KG representations. Next, a ranking-based loss is minimized using a simple neural network architecture. Moreover, we introduce a new large-scale dataset for multimodal KG representation learning. We compared the performance of our approach to other baselines on two standard tasks, namely knowledge graph completion and triple classification, using our as well as the WN9-IMG dataset. The results demonstrate that our approach outperforms all baselines on both tasks and datasets.

Multimodal Unsupervised Image-to-Image Translation

[pdf] [repo]
Huang et al. (2018.09)
ECCV'18
Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any examples of corresponding image pairs. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrate the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT.

Embedding Multimodal Relational Data for Knowledge Base Completion

[pdf] [repo]
Pezeshkpour et al. (2018.11)
EMNLP'18
Representing entities and relations in an embedding space is well-studied approach for machine learning on relational data. Existing approaches, however, primarily focus on simple link structure between a finite set of entities, ignoring the variety of data types that are often used in knowledge bases, such as text, images, and numerical values. In this paper, we propose multimodal knowledge base embeddings (MKBE) that use different neural encoders for this variety of observed data, and combine them with existing relational models to learn embeddings of the entities and multimodal data. Further, using these learned embeddings and different neural decoders, we introduce a novel multimodal imputation model to generate missing multimodal values, like text and images, from information in the knowledge base. We enrich existing relational datasets to create two novel benchmarks that contain additional information such as textual descriptions and images of the original entities. We demonstrate that our models utilize this additional information effectively to provide more accurate link prediction, achieving state-of-the-art results with a considerable gep of 5-7% over existing methods. Further, we evaluate the quality of our generated multimodal values via a user study.

Associative Multichannel Autoencoder for Multimodal Word Representation

[pdf] [repo]
Wang et al. (2018.11)
EMNLP'18
In this paper we address the problem of learning multimodal word representations by integrating textual, visual and auditory inputs. Inspired by the re-constructive and association nature of human memory, we propose a novel associative multichannel autoencoder (AMA). Our model first learns the associations between textual and perceptual modalities, so as to predict the missing perceptual information of concepts. Then the textual and predicted perceptual representations are fused through reconstructing their original and associated embeddings. Using a gating mechanism our model assigns different weights to each modality according to the different concepts. Results on six benchmark concept similarity tests show that the proposed method significantly outperforms strong unimodal baselines and state-of-the-art multimodal models.

2019 {#10}

MMKG: Multi-Modal Knowledge Graphs

[pdf] [repo]
Liu et al. (2019.03)
ESWC'19
We present MMKG, a collection of three knowledge graphs that contain both numerical features and (links to) images for all entities as well as entity alignments between pairs of KGs. Therefore, multi-relational link prediction and entity matching communities can benefit from this resource. We believe this data set has the potential to facilitate the development of novel multi-modal learning approaches for knowledge graphs. We validate the utility of MMKG in the sameAs link prediction task with an extensive set of experiments. These experiments show that the task at hand benefits from learning of multiple feature types.

Shared Predictive Cross-Modal Deep Quantization

[pdf] [repo]
Yang et al. (2019.04)
arXiv
With explosive growth of data volume and ever-increasing diversity of data modalities, cross-modal similarity search, which conducts nearest neighbor search across different modalities, has been attracting increasing interest. This paper presents a deep compact code learning solution for efficient cross-modal similarity search. Many recent studies have proven that quantization based approaches perform generally better than hashing based approaches on single-modal similarity search. In this work, we propose a deep quantization approach, which is among the early attempts of leveraging deep neural networks into quantization based cross-modal similarity search. Our approach, dubbed shared predictive deep quantization (SPDQ), explicitly formulates a shared subspace across different modalities and two private subspaces for individual modalities, representations in the shared subspace and the private subspaces are learned simultaneously by embedding them to a reproducing kernel Hilbert space where the mean embedding of different modality distributions can be explicitly compared. Additionally, in the shared subspace, a quantizer is learned to produce the semantics preserving compact codes with the help of label alignment. Thanks to this novel network architecture in cooperation with supervised quantization training, SPDQ can preserve intra- and inter-modal similarities as much as possible and greatly reduce quantization error. Experiments on two popular benchmarks corroborate that our approach outperforms state-of-the-art methods.

Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs

[pdf] [repo]
Rubio et al. (2019.05)
AKBC'19
A visual-relational knowledge graph (KG) is a multi-relational graph whose entities are associated with images. We explore novel machine learning approaches for answering visual-relational queries in web-extracted knowledge graphs. To this end, we have created ImageGraph, a KG with 1330 relation types, 14870 entities, and 829,931 images crawled from the web. With visual-relational KGs such as ImageGraph one can introduce novel probabilistic query types in which images are treated as first-class citizens. Both the prediction of relations between unseen images as well as multi-relational image retrieval can be expressed with specific families of visual-relational queries. We introduce novel combinations of convolutional networks and knowledge graph embedding methods to answer such queries. We also explore a zero-shot learning scenario where an image of an entirely new entity is linked with multiple relations to entities of an existing KG. The resulting multi-relational grounding of unseen entity images into a knowledge graph serves as a semantic entity representation. We conduct experiments to demonstrate that the proposed methods can answer these visual-relational queries efficiently and accurately.

Multimodal Data Enhanced Representation Learning for Knowledge Graphs

[pdf] [repo]
Wang et al. (2019.06)
IJCNN'19
Knowledge graph, or knowledge base, plays an important role in a variety of applications in the field of artificial intelligence. In both research and application of knowledge graph, knowledge representation learning is one of the fundamental tasks. Existing representation learning approaches are mainly based on structural knowledge between entities and relations, while knowledge among entities per se is largely ignored. Though a few approaches integrated entity knowledge while learning representations, these methods lack the flexibility to apply to multimodalities. To tackle this problem, in this paper, we propose a new representation learning method, TransAE, by combining multimodal autoencoder with TransE model, where TransE is a simple and effective representation learning method for knowledge graphs. In TransAE, the hidden layer of autoencoder is used as the representation of entities in the TransE model, thus it encodes not only the structural knowledge, but also the multimodal knowledge, such as visual and textual knowledge, into the final representation. Compared with traditional methods based on only structured knowledge, TransAE can significantly improve the performance in the sense of link prediction and triplet classification. Also, TransAE has the ability to learn representations for entities out of knowledge base in zero-shot. Experiments on various tasks demonstrate the effectiveness of our proposed TransAE method.

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

[pdf] [repo] [github]
Guo et al. (2019.10)
ACM-MM'19
Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper. Based on this view, we propose to explicitly model the object interactions in semantics and geometry based on Graph Convolutional Networks (GCNs), and fully exploit the alignment between linguistic words and visual semantic units for image captioning. Particularly, we construct a semantic graph and a geometry graph, where each node corresponds to a visual semantic unit, i.e., an object, an attribute, or a semantic (geometrical) interaction between two objects. Accordingly, the semantic (geometrical) context-aware embeddings for each unit are obtained through the corresponding GCN learning processors. At each time step, a context gated attention module takes as inputs the embeddings of the visual semantic units and hierarchically align the current word with these units by first deciding which type of visual semantic unit (object, attribute, or interaction) the current word is about, and then finding the most correlated visual semantic units under this type. Extensive experiments are conducted on the challenging MS-COCO image cpationing dataset, and superior results are reported when comparing to state-of-the-art approaches. The code is publicly available at https://github.com/ltguo19/VSUA-Captioning.

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

[pdf] [repo] [github]
He et al. (2019.10)
ACM-MM'19
Cross-media retrieval is to return the results of various media types corresponding to the query of any media type. Existing researches generally focus on coarse-grained cross-media retrieval. When users submit an image of "Slaty-backed Gull" as a query, coarse-grained cross-media retrieval treats it as "Bird", so that users can only get the results of "Bird", which may include other bird species with similar appearance (image and video), descriptions (text) or sounds (audio), such as "Herring Gull". Such coarse-grained cross-media retrieval is not consistent with human lifestyle, where we generally have the fine-grained requirement of returning the exactly relevant results of "Slaty-backed Gull" instead of "Herring Gull". However, few researches focus on fine-grained cross-media retrieval, which is a highly challenging and practical task. Therefore, in this paper, we first construct a new benchmark for fine-grained cross-media retrieval, which consists of 200 fine-grained subcategories of the "Bird", and contains 4 media types, including image, text, video and audio. To the best of our knowledge, it is the first benchmark with 4 media types for fine-grained cross-media retrieval. Then, we propose a uniform deep model, namely FGCrossNet, which simultaneously learns 4 types of media without discriminative treatments. We jointly consider three constraints for better common representation learning: classification constraint ensures the learning of discriminative features for fine-grained subcategories, center constraint ensures the compactness characteristic of the features of the same subcategory, and ranking constraint ensures the sparsity characteristic of the features of different subcategories. Extensive experiments verify the usefulness of the new benchmark and the effectiveness of our FGCrossNet. The new benchmark and the source code of FGCrossNet will be made available at https://github.com/PKU-ICST-MIPL/FGCrossNet_ACMMM2019.

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

[pdf] [repo]
Huang et al. (2019.10)
ACM-MM'19
Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30K and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.

Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding

[pdf] [repo]
Liu et al. (2019.10)
ACM-MM'19
Weakly supervised referring expression grounding (REG) aims at localizing the referential entity in an image according to linguistic query, where the mapping between the image region (proposal) and the query is unknown in the training stage. In referring expressions, people usually describe a target entity in terms of its relationship with other contextual entities as well as visual attributes. However, previous weakly supervised REG methods rarely pay attention to the relationship between the entities. In this paper, we propose a knowledge-guided pairwise reconstruction network (KPRN), which models the relationship between the target entity (subject) and contextual entity (object) as well as grounds these two entities. Specifically, we first design a knowledge extraction module to guide the proposal selection of subject and object. The prior knowledge is obtained in a specific form of semantic similarities between each proposal and the subject/object. Second, guided by such knowledge, we design the subject and object attention module to construct the subject-object proposal pairs. The subject attention excludes the unrelated proposals from the candidate proposals. The object attention selects the most suitable proposal as the contextual proposal. Third, we introduce a pairwise attention and an adaptive weighting scheme to learn the correspondence between these proposal pairs and the query. Finally, a pairwise reconstruction module is used to measure the grounding for weakly supervised learning. Extensive experiments on four large-scale datasets show our method outperforms existing state-of-the-art methods by a large margin.

Multimodal Dialog System: Generating Responses via Adaptive Decoders

[pdf] [repo]
Nie et al. (2019.10)
ACM-MM'19
On the shoulders of textual dialog systems, the multimodal ones, recently have engaged increasing attention, especially in the retail domain. Despite the commercial value of multimodal dialog systems, they will suffer from the following challenges: 1) automatically generate the right responses in appropriate medium forms; 2) jointly consider the visual cues and the side information while selecting product images; and 3) guide the response generation with multi-faceted and heterogeneous knowledge. To address the aforementioned issues, we present a Multimodal diAloG system with adaptIve deCoders, MAGIC for short. In particular, MAGIC first judges the response type and the corresponding medium form via understanding the intention of the given multimodal context. Hereafter, it employs adaptive decoders to generate the desired responses: a simple recurrent neural network (RNN) is applied to generating general responses, then a knowledge-aware RNN decoder is designed to encode the multiform domain knowledge to enrich the response, and the multimodal response decoder incorporates an image recommendation model with jointly considers the textual attributes and the visual images via a neural model optimized by the max-margin loss. We comparatively justify MAGIC over a benchmark dataset. Experiment results demonstrate that MAGIC outperforms the existing methods and achieves the state-of-the-art performance.

Multi-modal Knowledge-aware Hierarchical Attention Network for Explainable Medical Question Answering

[pdf] [repo]
Zhang et al. (2019.10)
ACM-MM'19
Online healthcare services can offer public ubiquitous access to the medical knowledge, especially with the emergence of medical question answering websites, where patients can get in touch with doctors without going to hospital. Explainability and accuracy are two main concerns for medical question answering. However, existing methods mainly focus on accuracy and cannot provide good explanation for retrieved medical answers. This paper proposes a novel Multi-Modal Knowledge-aware Hierarchical Attention Network (MKHAN) to effectively exploit multi-modal knowledge graph (MKG) for explainable medical question answering. MKHAN can generate path representation by composing the structural, linguistics, and visual information of entities, and infer the underlying rationale of question-answer interactions by leveraging the sequential dependencies within a path from MKG. Furthermore, a novel hierarchical attention network is proposed to discriminate the salience of paths endowing our model with explainability. We build a large-scale multi-modal medical knowledge graph and two real-world medical question answering datasets, the experimental results demonstrate the superior performance on our approach compared with the state-of-the-art methods.

2020 {#15}

Multimodal Sentiment Analysis based on Multi-head Attention Mechanism

[pdf] [repo]
Xi et al. (2020.01)
ICMLSC'20
Multimodal sentiment analysis is still a promising area of research, which has many issues needed to be addressed. Among them, extracting reasonable unimodal features and designing a robust multimodal sentiment analysis model is the most basic problem. This paper presents some novel ways of extracting sentiment features from visual, audio and text, furthermore use these features to verify the multimodal sentiment analysis model based on multi-head attention mechanism. The proposed model is evaluated on Multimodal Opinion Utterances Dataset (MOUD) corpus and CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) corpus for multimodal sentiment analysis. Experimental results prove the effectiveness of the proposed approach. The accuracy of the MOUD and MOSI datasets is 90.43% and 82.71%, respectively. Compared to the state-of-the-art models, the improvements of the performance are approximately 2 and 0.4 points.

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

[pdf] [repo]
Park et al. (2020.01)
WACV'20
Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flickr30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space.

MULE: Multimodal Universal Language Embedding

[pdf] [repo]
Kim et al. (2020.02)
AAAI'20
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of MULE on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 21.9% on a single-language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available.

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

[pdf] [repo]
Mai et al. (2020.02)
AAAI'20
Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

[pdf] [repo]
Zhang et al. (2020.03)
JSTSP/IEEE
Deep learning methods have revolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modelling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality singal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

GAIA: A Fine-grained Multimedia Knowledge Extraction System

[pdf] [repo]
Li et al. (2020.07)
ACL'20
We present the first comprehensive, open source multimedia knowledge extraction system that takes a massive stream of unstructured, heterogeneous multimedia data from various and languages as input, and creates a coherent, structured knowledge base, indexing entities, relations, and events, following a rich, fine-grained ontology. Our system, GAIA, enables seamless search of complex graph queries, and retrieves multimedia evidence including text, images and videos. GAIA achieves top performance at the recent NIST TAC SM-KBP2019 evaluation. The system is publicly available at GitHub and DockerHub, with complete documentation.

MMEA: Entity Alignment for Multi-modal Knowledge Graph

[pdf] [repo]
Chen et al. (2020.08)
KSEM'20
Entity Alignment plays an essential role in the knowledge graph (KG) integration. Through large efforts have been made on exploring the association of relational embeddings between different knowledge graphs, they may fail to effectively describe and integrate the multi-modal knowledge in the real application scenario. To that end, in his paper, we propose a novel solution called Multi-Modal Entity Alignment (MMEA) to address the problem of entity alignment in a multi-modal view. Specifically, we first design a novel multi-modal knowledge embedding method to generate the entity representations of relational, visual and numerical knowledge, respectively. Along this line, multiple representations of different types of knowledge will be integrated via a multi-modal knowledge fusion module. Extensive experiments on two public datasets clearly demonstrate the effectiveness of the MMEA model with a significant margin compared with the state-of-the-art methods.

Construction of Multi-modal Chinese Tourism Knowledge Graph

[pdf] [repo]
Xie et al. (2020.08)
EasyChair
This paper proposes a construction process of Multi-modal Chinese Tourism Knowledge Graph (MCTKG). In order to maintain the semantic consistency of heterogeneous data sources, the tourism ontology. The construction process of MCTKG includes following modules: ontology construction, entity alignment, tourism route automatic generation, sharing platform establishment. In ontology construction, semi-automatic fine-tuning operations are carried out, which optimize and simplify the concept hierarchy relationship abstracted from the obtained data resources, and some new concept hierarchy relationships are manually defined according to the actual application scenarios. In entity alignment, we adopt a method based on n-gram distance to align entities from different sources, and fuse and cross-validate their attributes. In addition, based on the concepts of attraction and tourism style in the knowledge graph, we propose a tourism route generation algorithm, which could automatically schedule the tourism routes by incorporating the characteristics of attraction and tourists' demands. Lastly, a sharing platform is established, which provides open data access and query interface.

Multimodal Learning with Incomplete Modalities by Knowledge Distillation

[pdf] [repo]
Wang et al. (2020.08)
KDD'20
Multimodal learning aims at utilizing information from a variety of data modalities to improve the generalization performance. One common approach is to seek the common information that is shared among different modalities for learning, whereas we can also fuse the supplementary information to leverage modality-specific information. Through the supplementary information is often desired, mosting existing multimodal approaches can only learn from samples with complete modalities, which wastes a considerable amount of data collected. Otherwise, model-based imputation needs to be used to complete the missing values and yet may introduce undesired noise, especially when the sample size is limited. In this paper, we proposed a framework based on knowledge distillation, utilizing the supplementary information from all modalities, and avoiding imputation and noise associated with it. Specifically, we first train models on each modality independently using all the available data. Then the trained models are used as teachers to teach the student model, which is trained with the samples having complete modalities. We demonstrate the effectiveness of the proposed method in extensive empirical studies on both synthetic datasets and real-world datasets.

Fake News Detection via Knowledge-driven Multimodal Graph Convolutional Networks

[pdf] [repo]
Wang et al. (2020.10)
ICMR'20
Nowadays, with the rapid development of social media, there is a great deal of news produced every day. How to detect fake news automatically from a large multimedia posts has become very important for people, the government and news recommendation sites. However, most of the existing approaches either extract features from the text of the post which is a single modality or simply concatenate the visual features and textual features of a post to get a multimodal feature and detect fake news. Most of them ignore the background knowledge hidden in the text content of the post which facilitates fake news detection. To address these issues, we propose a novel Knowledge-drive Multimodal Graph Convolutional Network (KMGCN) to model teh semantic representations by jointly modeling the textual information, knowledge concepts and visual information into a unified framework for fake news detection. Instead of viewing text content as word sequences normally, we convert them into a graph, which can model non-consecutive phrases for better obtaining the composition of semantics. Besides, we not only convert visual information as nodes of graphs but also retrieve external knowledge from real-world knowledge graph as nodes of graphs to provide complementary semantics information to improve fake news detection. We utilize a well-designed graph convolutional network to extract the semantic representation of these graphs. Extensive experiments on two public real-world datasets illustrate teh validation of our approach.

Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations

[pdf] [repo]
Chiou et al. (2020.10)
arXiv
Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanism, it is believed that external visual common-sense knowledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language common-sense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual relationship recognition by taking in object names directly, enabling it to be used on top of any object detection systems. We show through quantitative and qualitative experiments that, with the transferred knowledge and novel module, RVL-BERT achieves competitive results on two challenging visual relationship detection datasets. The source code will be publicly available soon.

Multimodal Knowledge Graph for Deep Learning Papers and Code

[pdf] [repo]
Kannan et al. (2020.10)
CIKM'20
Keeping up with the rapid growth of Deep Learning (DL) research is a daunting task. While existing scientific literature search systems provide text search capabilities and can identify similar papers, gaining an in-depth understanding of a new approach or an application is much more complicated. Many publications leverage multiple modalities to convey their findings and spread their ideas - they include pseudocode, tables, images and diagrams in addition to text, and often make publicly accessible their implementations. It is important to be able to represent and query them as well. We utilize RDF knowledge graphs (KGs) to represent multimodal information and enable expressive querying over modalities. In our demo we present an approach for extracting KGs from different modalities, namely text, architecture images, and source code. We show how graph queries can be used to get insights into different facets (modalities) of a paper, and its associated code implementation. Our innovation lies in the multimodal nature of the KG we create. While our work is of direct interest to DL researchers and practitioners, our approaches can also be leveraged in other scientific domains.

Multi-modal Knowledge Graphs for Recommender Systems

[pdf] [repo]
Sun et al. (2020.10)
CIKM'20
Recommender systems have shown great potential to solve the information explosion problem and enhance user experience in various online applications. To tackle data sparsity and cold start problems in recommender systems, researchers propose knowledge graphs (KGs) based recommendations by leveraging valuable external knowledge as auxiliary information. However, most of these works ignore the variety of data types (e.g., texts and images) in multi-modal knowledge graphs (MMKGs). In this paper, we propose Multi-modal Knowledge Graph Attention Network (MKGAT) to better enhance recommender systems by leveraging multi-modal knowledge. Specifically, we propose a multi-modal graph attention technique to conduct information propagation over MMKGs, and then use the resulting aggregated embedding representation for recommendation. To the best of our knowledge, this is the first work that incorporates multi-modal knowledge graph into recommender systems. We conduct extensive experiments on two real datasets from different domains, results of which demonstrate that our model MKGAT can successfully employ MMKGs to improve the quality of recommender system.

Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis

[pdf] [repo] [github]
Tsai et al. (2020.11)
EMNLP'20
The human language can be expressed through multiple sources of information known as modalities, including tones of voice, facial gestures, and spoken language. Recent multimodal learning with strong performances on human-centric tasks such as sentiment analysis and emotion recognition are often block-box, with very limited interpretability. In this paper we propose Multimodal Routing, which dynamically adjusts weights between input modalities and output representations differently for each input sample. Multimodal routing can identify relative importance of both individual modalities and cross-modality features. Moreover, the weight assignment by routing allows us interpret modality-prediction relationships not only globally (i.e. general trends over the whole dataset), but also locally for each single input sample, meanwhile keeping competitive performance compared to state-of-the-art methods.

Incorporating Multimodal Information in Open-Domain Web Keyphrase Extraction

[pdf] [repo] [github]
Wang et al. (2020.11)
EMNLP'20
Open-domain Keyphrase extraction (KPE) on the Web is a fundamental yet complex NLP task with a wide range of practical applications within the field of Information Retrieval. In contrast to other document types, web page designs are intended for easy navigation and information finding. Effective designs encode within the layout and formatting signals that point to where the important information can be found. In this work, we propose a modeling approach that leverages these multi-modal signals to aid the KPE task. In particular, we leverage both lexical and visual features (e.g. size, font, position) at the micro-level to enable effective strategy induction, and meta-level features that describe pages at a macro-level to aid in strategy selection. Our evaluation demonstrates that a combination of effective strategy induction and strategy selection within this approach for the KPE task outperforms state-of-the-art models. A qualitative post-hoc analysis illustrates how these features function within the model.

Deep Multimodal Fusion by Channel Exchanging

[pdf] [repo] [github]
Wang et al. (2020.12)
NeurIPS'20
Deep multimodal fusion by using multiple sources of data for classification or regression has exhibited a clear advantage over the unimodal counterpart on various applications. Yet, current methods including aggregation-based and alignment-based fusion are still inadequate in balancing the trade-off between inter-modal fusion and intra-modal processing, incurring a bottleneck of performance improvement. To this end, this paper proposes Channel-Exchanging-Network (CEN), a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities. Specifically, the channel exchanging process is self-guided by individual channel importance that is measured by the magnitude of Batch-Normalization (BN) scaling factor during training. The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network. Extensive experiments on semantic segmentation via RGB-D data and image translation through multi-domain input verify the effectiveness of our CEN compared to current state-of-the-art methods. Detailed ablation studies have also been carried out, which provably affirm the advantage of each component we propose.

2021 {#20}

Visual Pivoting for (Unsupervised) Entity Alignment

[pdf] [repo]
Liu et al. (2021.02)
AAAI'21
This work studies the use of visual semantic representations to align entities in heterogeneous knowledge graphs (KGs). Images are natural components of many existing KGs. By combining visual knowledge with other auxiliary information, we show that the proposed new approach, EVA, creates a holistic entity representation that provides strong signals for cross-graph entity alignment. Besides, previous entity alignment methods require human labelled seed alignment, restricting availability. EVA provides a completely unsupervised solution by leveraging the visual similarity of entities to create an intial seed dictionary (visual pivots). Experiments on benchmark data sets DBP15k and DWY15k show that EVA offers state-of-the-art performance on both monolingual and cross-lingual entity alignment tasks. Furthermore, we discover that images are particularly useful to align long-tail KG entities, which inherently lack the structural contexts that are necessary for capturing the correspondences.

RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER

[pdf] [repo] [github]
Sun et al. (2021.02)
AAAI'21
Recently multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets. However, most of the multimodal methods use attention mechanisms to extract visual clues regardless of whether the text and image are relevant. Practically, the irrelevant text-image pairs account for a large proportion in tweets. The visual clues that are unrelated to the texts will exert uncertain or even negative effects on multimodal model learning. In this paper, we introduce a method for text-image relation propagation into the multimodal BERT model. We integrate soft or hard gates to select visual clues and propose a multitask algorithm to train on the MNER datasets. In the experiments, we deeply analyze the changes in the visual attention before and after the use of text-image relation propagation. Our model achieves state-of-the-art performance on the MNER datasets. The source code is available online.

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

[pdf] [repo] [github]
Zhu et al. (2021.04)
EACL'21
One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates a real-life urban environment. Due to the lack of human-annotated instructions that illustrate intricate urban scenes, outdoor VLN remains a challenging task to solve. This paper introduces a Multimodal Text Style Transfer (MTST) learning approach and leverages external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Map API, the pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set.

A New View of Multi-modal Language Analysis: Audio and Video Features as Text Styles

[pdf] [repo]
Sun et al. (2021.04)
EACL'21
Imposing the style of one image onto another is called style transfer. For example, the style of a Van Gogh painting might be imposed on a photograph to yield an interesting hybrid. This paper applies the adaptive normalization used for image style transfer to language semantics, i.e., the style is the way the words are said (tone of voice and facial expressions) and these are style-transferred onto the text. The goal is to learn richer representations for multi-modal utterances using style-transferred multi-modal features. The proposed Style-Transfer Transformer (STT) grafts a stepped styled adaptive layer-normalization onto a transformer network, the output from which is used in sentiment analysis and emotion recognition problems. In addition to achieving performance on par with the state-of-the-art (but using less than a third of the model parameters), we examine the relative contributions of each mode when used in the downstream applications.

Adaptive Fusion Techniques for Multimodal Data

[pdf] [repo] [github]
Sahu et al. (2021.04)
EACL'21
Effective fusion of data from multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data. In this paper, we propose adaptive fusion techniques that aim to model context from different modalities effectively. Instead of defining a deterministic fusion operation, such as concatenation, for the network, we let the network decide how to combine a given set of multimodal features more effectively. We propose two networks: 1) Auto-Fusion, which learns to compress information from different modalities while preserving the context, and 2) GAN-Fusion, which regularizes the learned latent space given context from complementing modalities. A quantitative evaluation on the tasks of multimodal machine translation and emotion recognition suggests that our lightweight, adaptive networks can better model context from other modalities than existing methods, many of which employ massive transformer-based networks.

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

[pdf] [repo]
Xu et al. (2021.06)
ACL'21
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experimental results show that LayoutLMv2 outperforms strong baselines and archieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD, CORD, SROIE, Kleister-NDA, RVL-CDIP, and DocVQA

Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases

[pdf] [repo]
Cao et al. (2021.06)
ACL'21
Previous literatures show that pre-trained masked language models (MLMs) such as BERT can achieve competitive factual knowledge extraction performance on some datasets, indicating that MLMs can potentially be a reliable knowledge source. In this paper, we conduct a rigorous study to explore the underlying predicting mechanisms of MLMs over different extraction paradigms. By investigating the behaviors of MLMs, we find that previous decent performance mainly owes to the biased prompts which overfit dataset artifacts. Furthermore, incorporating illustrative cases and external contexts improve knowledge prediction mainly due to entity type guidance and golden answer leakage. Our findings shed light on the underlying predicting mechanisms of MLMs, and strongly question the previous conclusion that current MLMs can potentially serve as reliable factual knowledge bases.

KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

[pdf] [repo]
Xing et al. (2021.06)
ACL'21
We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-based sequence-to-sequence model capable of reasoning about commonsense knowledge from multimodal inputs of images and texts. We adapt the generative BART architecture to a multimodal model with visual and textual inputs. We further develop novel pretraining tasks to improve the model performance on the Visual Commonsense Generation (VCG) task. In particular, our pretraining task of Knowledge-based Commonsense Generation (KCG) boosts model performance on the VCG task by leveraging commonsense knowledge from a large langauge model pretrained on external commonsense knowledge graphs. To the best of our knowledge, we are the first to propose a dedicated task for improving model performance on the VCG task. Experimental results show that our model reaches state-of-the-art performance on the VCG task by applying these novel pretraining tasks.

GEM: A General Evaluation Benchmark for Multimodal Tasks

[pdf] [repo]
Su et al. (2021.06)
ACL'21
In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural langauge tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-langauge tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.

KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA

[pdf] [repo]
Marino et al. (2021.06)
CVPR'21
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge, the setting when the knowledge required to answer a question in not given/annotated, neither at training nor test time. We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pretraining and supervised training data with transformer-based models. Second, explicit, symbolic knowledge encoded in knowledge bases. Our approach combines both-exploiting the powerful implicit reasoning of transformer models for answer prediction, and integrating symbolic representations from a knowledge graph, while never losing their explicit semantics to an implicit embedding. We combine disverse sources of knowledge to cover the wide variety of knowledge needed to solve knowledge-based questions. We show our approach, KRISP (Knowledge Reasoning with Implicit and Symbolic rePresentations), significantly outperforms state-of-the-art on OK-VQA, the largest available dataset for open-domain knowledge-based VQA. We show with extensive ablations that while our model successfully exploits implicit knowledge reasoning, the symbollic answer module which explicitly connects the knowledge graph to the answer vocabulary is critical to the performance of our method and generalizes to rare answers.

Amalgamating Knowledge from Heterogeneous Graph Neural Networks

[pdf] [repo]
Jing et al. (2021.06)
CVPR'21
In this paper, we study a novel knowledge transfer task in the domain of graph neural networks (GNNs). We strive to train a multi-talented student GNN, without accessing human annotations, that amalgamates knowledge from a couple of teacher GNNs with heterogeneous architectures and handling distinct tasks. The student derived in this way is expected to integrate the expertise from both teachers while maintaining a compact architecture. To this end, we propose an innovative approach to train a slimmable GNN that enables learning from teachers with varying feature dimensions. Meanwhile, to explicitly align topological semantics between the student and teachers, we introduce a topological attribution map (TAM) to highlight the structural saliency in a graph, based on which the student imitates the teachers' ways of aggregating information from neighbors. Experiments on seven datasets across various tasks, including multi-label classification and joint segmentation-classification, demonstrate that the learned student, with a lightweight architecture, achieves gratifying results on par with and sometimes even superior to those of the teachers in their specializations.

Multimodal Contrastive Training for Visual Representation Learning

[pdf] [repo]
Yuan et al. (2021.06)
CVPR'21
We develop an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objectives. Unlike existing visual pre-training methods, which solve a proxy prediction task in a single domain, our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlations simultaneously, hence improving the quality of learned visual representations. By including multimodal training in a unified framework with different types of contrastive losses, our method can learn moore powerful and generic visual features. We first train our model on COCO and evaluate the learned visual representations on various downstream tasks including image classificationo, object detection, and instance segmentation. For example, the visual representationos pre-trained ono COCO by our method achieve state-of-the-art top-1 validation accuracy of 55.3% on ImageNet classification, under the common transfer protocol. We also evaluate out method on the large-scale Stock images dataset and show its effectiveness on multi-label image tagging, and cross-modal retrieval tasks.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

[pdf] [repo]
Wang et al. (2021.06)
CVPR'21
Weakly supervised phrase grouding aims at learning region-phrase correspondences using only image-sentence pairs. A major challenge thus lies in the missing links between image regions and sentence phrases during training. To address this challenge, we leverage a generic object detector at training time, and propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching. Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed. Importantly, our region-phrase score function is learned by distilling from soft matching scores between the detected object names and candidate phrases within an image-sentence pair, while the image-sentence score function is supervised by ground-truth image-sentence pair. The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost. Without bells and whistles, our approach achieves state-of-the-art results on visual phrase grounding, surpassing previous methods that require expensive object detectors at test time.

Explicit Knowledge Incorporation for Visual Reasoning

[pdf] [repo]
Zhang et al. (2021.06)
CVPR'21
Existing explainable and explicit visual reasoning methods only perform reasoning based on visual evidence but do not take into account knowledge beyond that is in the visual sence. To address the knowledge gap between visual reasoning methods and the semantic complexity of real-world images, we present the first explicit visual reasoning method that incorporates external knowledge and models high-order relational attention for improved generalizability and explainability. Specifically, we propose a knowledge incorporation network that explicitly creates and includes new graph nodes for entities and predicates from external knowledge bases to enrich the semantics of the scene graph used in explicit reasoning. We then create a novel Graph-Relate module to perform high-order relational attention on the enriched scene graph. By explicitly introducing structured external knowledge and high-order relational attention, our method demonstrates significant generalizability and explainability over the state-of-the-art visual reasoning approaches on the GQA and VQAv2 datasets.

Is Visual Context Really Helpful for Knowledge Graph? A Representation Learning Perspective

[pdf] [repo]
Wang et al. (2021.10)
ACM-MM'21
Visual modality recently has aroused extensive attention in the fields of knowledge graph and multimedia because a lot of real-world knowledge is multi-modal in nature. However, it is currently unclear to what extent the visual modality can improve the performance of knowledge graph tasks over unimodal models, and equally treating structural and visual features may encode too much irrelevant information from images. In this paper, we probe the utility of the auxiliary visual context from knowledge graph representation learning perspective by designing a Relation Sensitive Multi-modal Embedding model, RSME for short. RSME can automatically encourage or filter the influence of visual context during the representation learning. We also examine the effect of different visual feature encoders. Experimental results validate the superiority of our approach compared to the state-of-the-art methods. On the basis of in-depth analysis, we conclude that under appropriate circumstances models are capable of leveraging the visual input to generate better knowledge graph embeddings and vice versa.

Interactive Machine Comprehension with Dynamic Knowledge Graphs

[pdf] [repo]
Yuan et al. (2021.11)
EMNLP'21
Interactive machine reading comprehension (iMRC) is machine comprehension tasks where knowledge sources are partially observable. An agent must interact with an environment sequentially to gather necessary knowledge in order to answer a question. We hypothesize that graph representations are good inductive biases, which can serve as an agent's memory mechanism in iMRC tasks. We explore four different categories of graphs that can capture text information at various levels. We describe methods that dynamically build and update these graphs during information gathering, as well as neural models to encode graph representations in RL agents. Extensive experiments on iSQuAD suggest that graph representations can result in significant performance improvements for RL agents.

BiQUE: Biquaternionic Embeddings of Knowledge Graphs

[pdf] [repo]
Guo et al. (2021.11)
EMNLP'21
Knowledge graph embeddings (KGEs) compactly encode multi-relational knowledge graphs (KGs). Existing KGE models rely on geometric operations to model relational patterns. Euclidean (circular) rotation is useful for modeling patterns such as symmetry, but cannot represent hierarchical semantics. In contrast, hyperbolic models are effective at modeling hierarchical relations, but do not perform as well on patterns on which curcular rotation excels. It is crucial for KGE models to unify multiple geometric transformations so as to fully cover the multifarious relations in KGs. To do so, we propose BiQUE, a novel model that employs biquaternions to integrate multiple geometric transformations, viz., scaling, translation, Euclidean rotation, and hyperbolic rotation. BiQUE makes the best trade-offs among geometric operators during training, picking the best one (or their best combination) for each relation. Experiments on five datasets show BiQUE's effectiveness.

Open Knowledge Graphs Canonicalization using Variational Autoencoders

[pdf] [repo]
Dash et al. (2021.11)
EMNLP'21
Noun phrases and Relation phrases in open knowledge graphs are not canonicalized, leading to an explosion of redundant and ambiguous subject-relation-object triples. Existing approaches to solve this problem take a two-step approach. First, they generate embedding representations for both noun and relation phrases, then a clustering algorithm is used to group them using the embeddings as features. In this work, we propose Canonicalizing Using Variational Autoencoders (CUVA), a joint model to learn both embeddings and cluster assignments in an end-to-end approach, which leads to a better vector representation for the noun and relation phrases. Our evaluation over multiple benchmarks shows that CUVA outperforms the existing state-of-the-art approaches. Moreover, we introduce CANONICNELL, a novel dataset to evaluate entity canonicalization systems.

End-to-End Entity Resolution and Question Answering Using Differentiable Knowledge Graphs

[pdf] [repo]
Oliya et al. (2021.11)
EMNLP'21
Recently, end-to-end (E2E) trained models for question answering over knowledge graphs (KGQA) have delivered promising results using only a weakly supervised dataset. However, these models are trained and evaluated in a setting where hand-annotated question entities are supplied to the model, leaving the important and non-trivial task of entity resolution (ER) outside the scope of E2E learning. In this work, we extend the boundaries of E2E learning for KGQA to include the training of an ER component. Our model only needs the question text and answer entities to train, and delivers a stand-alone QA model that does not require an additional ER component to be supplied during runtime. Our approach is fully differentiable, thanks to its reliance on a recent method for building differentiable KGs. We evaluate our E2E trained model on two public datasets and show that it comes close to baseline models that use hand-annotated entities.

Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT

[pdf] [repo]
Meng et al. (2021.11)
EMNLP'21
Infusing factural knowledge into pretrained models is fundamental for many knowledge-intensive tasks. In this paper, we propose Mixture-of-Partitions (MoP), an infusion approach that can handle a very large knowledge graph (KG) by partitioning it into smaller sub-graphs and infusing their specific knowledge into various BERT models using lightweight adapters. To leverage the overall factual knowledge for a target task, these sub-graph adapters are further fine-tuned along with the underlying BERT through a mixture layer. We evaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) on six downstream tasks (inc. NLI, QA, Classification), and the results show that our MoP consistently enhances the underlying BERTs in task performance, and achieves new SOTA performances on five evaluated datasets.

2022 {#28}

Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective

[pdf] [repo]
Salin et al. (2022.01)
AAAI'22
In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint fine-grained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.

Multi-Modal Knowledge Graph Construction and Application: A Survey

[pdf] [repo]
Zhu et al. (2022.02)
IEEE
Recent years have witnessed the resurgence of knowledge engineering which is featured by the fast growth of knowledge graphs. However, most of existing knowledge graphs are represented with pure symbols, which hurts the machine's capability to understand the real world. The multi-modalization of knowledge graphs is an inevitable key step towards the realization of human-level machine intelligence. The results of this endeavor are Multi-Modal Knowledge Graphs (MMKGs). In this survey on MMKGs constructed systematically review the challenges, progresses and opportunities on the construcion and application of MMKGs respectively, with detailed analysis of the strength and weakness of different solutions. We finalize this survey with open research problems relevant to MMKGs.

Multimodal Continual Graph Learning with Neural Architecture Search

[pdf] [repo]
Cai et al. (2022.04)
WWW'22
Continual graph learning is rapidly emerging as an important role in a variety of real-world applications such as online product recommendation systems and social media. While achieving great success, existing works on continual graph learning ignore the information from multiple modalities (e.g., visual and textual features) as well as the rich dynamic structural information hidden in the ever-changing graph data and evolving tasks. However, considering multimodal continual graph learning with evolving topological structures poses great challenges: i) it is unclear how to incorporate the multimodal information into continual graph learning and ii) it is nontrivial to design models that can capture the structure-evolving dynamics in continual graph learning. To tackle these challenges, in this paper we propose a novel Multimodal Structure-evolving Continual Graph Learning (MSCGL) model, which continually learns both the model architecture and the corresponding parameters for Adaptive Multimodal Graph Neural Network (AdaMGNN). To be concrete, our proposed MSCGL model simultaneously takes social information and multimodal information into account to build the multimodal graphs. In order for continually adapting to new tasks without forgetting the old ones, our MSCGL model explores a new strategy with joint optimization of Neural Architecture Search (NAS) and Group Sparse Regularization (GSR) across different tasks. These two parts interact with each other reciprocally, where NAS is expected to explore more promising architecutres and GSR is in charge of preserving important information from the previous tasks. We conduct extensive experiments over two real-world multimodal continual graph scenarios to demonstrate the superiority of the proposed MSCGL model. Empirical experiments indicate that both the architectures and weight sharing across different tasks play important roles in affecting the model performances.

Domino: Discovering Systematic Errors with Cross-model Embeddings

[pdf] [repo]
Eyuboglu et al. (2022.04)
ICLR'22
Machine learning models that achieve high overall accuracy often make systematic errors on important subsets (or slices) of data. Identifying underperforming slice is particularly challenging when working with high-dimensional inputs (e.g. images, audio), where important slices are often unlableled. In order to address this issue, recent studies have proposed automated slice discovery methods (SDMs), which leverage learned model representations to mine input data for slices on which a model performs poorly. To be useful to a pracitioner, these methods must identify slices that are both underperforming and coherent (i.e. united by a human-understandable concept). However, no quantitative evaluation framework currently exists for rigorously assessing SDMs with respect to these criteria. Additionally, prior qualitative evaluations have shown that SDMs often identify slices that are incoherent. In this work, we address these challenges by first designing a principled evaluation framework that enables a quantitative comparison of SDMs across 1235 slices discovery settings in three input domains (natural images, medical images, and time-series data). Then, motivated by the recent development of powerful cross-modal representaton learning approaches, we present Domino, an SDM that leverages cross-modal embeddings and a novel error-aware mixture model to discover and describe coherent slices. We find that Domino accurately identifies 36% of the 1235 slices in our framework - a 12 percentage point improvement over prior methods. Further, Domino is the first SDM that can provide natural language descriptions of identified slices, correctly generating the exact name of the slice in 35% of settings.

WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types

[pdf] [repo] [dataset]
Wang et al. (2022.05)
ACL'22
Multimodal Entity Linking (MEL) which aims at linking mentions with multimodal contexts to the referent entities from a knowledge base (e.g., Wikipedia), is an essential task for many multimodal applications. Although much attention has been paid to MEL, the shortcomings of existing MEL datasets including limited contextual topics and entity types, simplified mention ambiguity, and restricted availability, have caused great obstacles to the research and application of MEL. In this paper, we present WikiDiverse, a high-quality human-annotated MEL dataset with diversified contextual topics and entity types from Wikines, which uses Wikipedia as the corresponding knowledge base. A well-tailored annotation procedure is adopted to ensure the quality of the dataset. Based on WikiDiverse, a sequence of well-designed MEL models with intra-modality and inter-modality attentions are implemented, which utilize the visual information of images more adaquately than existing MEL models do. Extensive experiments analyses are conducted to investigated the contributions of different modalities in terms of MEL, facilitating the future research on this task.

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

[pdf] [repo]
Wu et al. (2022.05)
ACL'22
The ability to sequence unordered events is evidence of comprehension and reasoning about real world tasks/procedures. It is essential for applications such as task planning and multi-source instruction summarization. It often requires thorough understanding of temporal common sense and multimodal information, since these procedures are often conveyed by a combination of texts and images. While humans are capable of reasoning about and sequencing unordered procedural instructions, the extent to which the current machine learning methods process such capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequenceig unordered multimodal instructions by curating datasets from online instructional manuals and collecting comprehensive human annotations. We find current state-of-the-art models not only perform significantly worse than humans but also seem incapable of efficiently utilizing multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequence-aware pretraining techniques exploiting the sequential alignment properties of both texts and images, resulting in >5% improvements on perfect match ratio.

Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

[pdf] [repo]
Zhang et al. (2022.05)
ACL'22
Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. Comprehending PMDs and inducing their representations for the downstream reasoning tasks is designated as Procedural MultiModal Machine Comprehension (M3C). In this study, we approach Procedural M3C at a fine-grained level (compared with existing explorations at a document or sentence level), that is, entity. With delicate consideration, we model entity both in its temporal and cross-modal relation and propose a novel Temporal-Modal Entity Graph (TMEG). Specifically, a heterogeneous graph structure is formulated to capture textual and visual entities and trace their temporal-modal evolution. In addition, a graph aggregation module is introduced to conduct graph encoding an reasoning. Comprehensive experiments across three Procedural M3C tasks are conducte on a traditional dataset RecipeQA and our new dataset CraftQA, which can better evaluate the generalization of TMEG.

Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion

[pdf] [repo] [github]
Chen et al. (2022.05)
SIGIR'22
Multimodal Knowledge Graphs (MKGs), which organize visual-text factual knowledge, have recently been successfully applied to tasks such as information retrieval, question answering, and recommendation system. Since most MKGs are far from complete, extensive knowledge graph completion studies have been proposed focusing on the multimodal entity, relation extraction and link prediction. However, different tasks and modalities require changes to the model architecture, and not all images/objects are relevant to text input, which hinders the applicability to diverse real-world scenarios. In this paper, we propose a hybrid transformer with multi-level fusion to address those issues. Specifically, we leverage a hybrid transformer architecture with unified input-output for diverse multimodal knowledge graph completion tasks. Moreover, we propose multi-level fusion, which integrates visual and text representation via coarse-grained prefix-guided interaction and fine-grained correlation-aware fusion modules. We conduct extensive experiments to validate that our MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.

Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

[pdf] [repo] [github]
Chen et al. (2022.05)
NAACL'22
Multimodal named entity recognition and relation extraction (MNER and MRE) is a fundamental and crucial branch in information extraction. However, existing approaches for MNER and MRE usually suffer from error sensitivity when irrelevant object images incorporated in texts. To deal with these issues, we propose a novel Hierarchical Visual Prefix fusion Network (HVPNeT) for visual-enhanced entity and relation extraction, aiming to achieve more effective and robust performance. Specifically, we regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision. We further propose a dynamic gated aggregation strategy to achieve hierarchical multi-scaled visual features as visual prefix for fusion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.

ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

[pdf] [repo]
Wang et al. (2022.05)
NAACL'22
Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose Image-text Alignments (ITA) to align image features into the textual space, so that the attention mechanism in transformer-based pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts are a new cross-modal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical in dealing with text-only inputs and robust to noises from images. In our experiments, we show that ITA models can achieve state-of-the-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

[pdf] [repo]
Ding et al. (2022.06)
CVPR'22
Knowledge-based visual question answering requires the ability of associating external knowledge for open-ended cross-modal scene understanding. One limitation of existing solutions is that hey capture relevant knowledge from text-only knowledge bases, which merely contain facts expressed by first-order predicates or language descriptions while lacking complex but indispensable multimodal knowledge for visual understanding. How to construct vision-relevant and explainable multimodal knowledge for the VQA scenario has been less studied. In this paper, we propose MuKEA to represent multimodal knowledge by an explicit triplet to correlate visual objects and fact answers with implicit relations. To bridge the heterogeneous gap, we propose three objective losses to learn the triplet representations from complementary views: embedding structure, topological relation and semantic space. By adopting a pre-training and fine-tuning learning strategy, both basic and domain-specific multimodal knowledge are progressively accumulated for answer prediction. We outperform the state-of-the-art by 3.35% and 6.08% respectively on two challenging knowledge-required datasets: OK-VQA and KRVQA. Experimental results prove the complementary benefits of the multimodal knowledge with existing knowledge bases and the advantages of our end-to-end framework over the existing pipeline methods.

WebQA: Multihop and Multimodal QA

[pdf] [repo]
Chang et al. (2022.06)
CVPR'22
Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and langauge generation. In this work, we introduce WebQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel objects and the ability to reason, yet trieval for humans. WebQA mirrors the way humans use the web: 1) Ask a question, 2) Choose sources to aggregate, and 3) Produce a fluent langauge response. This is the behavior we should be expecting from IoT devices and digital assistants. Existing work prefers to assume that a model can either reason about knowledge in images or in text. WebQA includes a secondary text-only QA task to ensure improved visual performance does not comesat the cost of language understanding. Our challenge for the community is to create unified multimodal reasoning models that answer questions regardless of the source modality, moving us closer to digital assistants that not only query language knowledge but also the richer visual online world.

Are Multimodal Transformers Robust to Missing Modality?

[pdf] [repo]
Ma et al. (2022.06)
CVPR'22
Multimodal data collected from the real world are often imperfect due to missing modalities. Therefore multimodal models that are robust against modal-incomplete data are highly preferred. Recently, Transformer models have shown great success in processing multimodal data. However, existing work has been limited to either architecture designs or pre-training strategies; whether Transformer models are naturally robust against missing-modal data has rarely been investigated. In this paper, we present the first-of-its-kind work to comprehensively investigate the behavior of Transformers in the presence of modal-incomplete data. Unsurprising, we find Transformer models are sensitive to missing modalities while different modal fusion strategies will significantly affect the robustness. What surprised us is that the optimal fusion strategy is dataset dependent even for the same Transformer model; there does not exist a universal strategy that works in general cases. Based on these findings, we propose a principle method to improve the robustness of Transformer models by automatically searching for an optimal fusion strategy regarding input data. Experimental validations on three benchmarks support the superior performance of the proposed method.

Multi-modal Siamese Network for Entity Alignment

[pdf] [repo] [github]
Chen et al. (2022.08)
KDD'22
The booming of multi-modal knowledge graphs (MMKGs) has raised the imperative demand for multi-modal entity alignment techniques, which faciliate the integration of multiple MMKGs from separate data sources. Unfortunately, prior arts harness multi-modal knowledge only via the heuristic merging of uni-modal feature embeddings. Therefore, inter-modal cues concealed in multi-modal knowldge could be largely ignored. To deal with that problem, in this paper, we propose a novel Multi-modal Siamese Network for Entity Alignment (MSNEA) to align entities in different MMKGs, in which multi-modal knowledge could be comprehensively leveraged by the exploitation of inter-modal effect. Specifically, we first devise a multi-modal knowledge embedding module to extract visual, relational, and attribute features of entities to generate holistic entity representations for distinct MMKGs. During this procedure, we employ inter-modal enhancement mechanisms to integrate visual features to guide relational feature learning and adaptively assign attention weights to capture valuable attributes for alignment. Afterwards, we design a multi-modal contrastive learning module to achieve inter-modal enhancement fusion with avoiding the overwhelming impact of weak modalities. Experimental results on two public datasets demonstrate that our proposed MSNEA provides state-of-the-art performance with a large margin compared with competitive baselines.

Cross-modal Knowledge Graph Contrastive Learning for Machine Learning Method Recommendation

[pdf] [repo]
Cao et al. (2022.10)
ACM-MM'22
The explosive growth of machine learning (ML) methods is overloading users with choices for learning tasks. Method recommendation aims to alleviate this problem by selecting the most appropriate ML methods for given learning tasks. Recent research shows that the descriptive and structural information of the knowledge graphs (KGs) can significantly enhance the performance of ML method recommendation. However, existing studies have not fully explored the descriptive information in KGs, nor have they effectively exploited the descriptive and structural information to provide the necessary supervision. To address these limitations, we distinguish descriptive attributes from the traditional relationships in KGs with the rest as structural connections to expand the scope of KG descriptive information. Based on this insight, we propose the Cross-modal Knowledge Graph Contrastive (CKGC) learning approach, which regards information from descriptive attributes and structural connections as two modalities, learning informative node representations by maximizing the agreement between the descriptive view and the structural view. Through extensive experiments, we demonstrate that CKGC significantly outperforms the state-of-the-art baselines, achieving around 2% higher accurate click-through-rate (CTR) prediction, over 30% more accurate top-10 recommendation, and over 50% more accurate top-20 recommendation compared to the best performing existing approach.

Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition

[pdf] [repo]
Jia et al. (2022.10)
ACM-MM'22
Multimodal named entity recognition (MNER) is a vision-language task where the system is required to detect entity spans and corresponding entity types given a sentence-image pair. Existing methods capture text-image relations with various attention mechanisms that only obtain implicit alignments between entity types and image regions. To locate regions more accurately and better model cross-/within-modal relations, we propose a machine reading comprehension based framework for MNER, namely MRC-MNER. By utilizing queries in MRC, our framework can provide prior information about entity types and image regions. Specifically, we design two stages, Query-Guided Visual Grounding and Multi-Level Modal Interaction, to align fine-grained type-region information and simulate text-image/inner-text interactions respectively. For the former, we train a visual grounding model via transfer learning to extract region candidates that can be further integrated into the second stage to enhance token representations. For the latter, we design text-image and inner-text interaction modules along with three sub-tasks for MRC-MNER. To verify the effectiveness of our model, we conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MRC-MNER outperforms the current state-of-the-art moedls on Twitter2017, and yields competitive results on Twitter2015.

Relation-enhanced Negative Sampling for Multimodal Knowledge Graph Completion

[pdf] [repo]
Xu et al. (2022.10)
ACM-MM'22
Knowledge Graph Completion (KGC), aiming to infer the missing part of Knowledge Graphs (KGs), has long been treated as a crucial task to support downstream applications of KGs, especially for the multimodal KGs (MKGs) which suffer the incomplete relations due to the insufficient accumulation of multimodal corpus. Though a few research attentions have been paid to the completion task of MKGs, there is still a lack of specially designed negative sampling strategies tailored to MKGs. Meanwhile, though effective negative sampling strategies have been widely regarded as a crucial solution for KGC to alleviate the vanishing gradient problem, we realize that, there is a unique challenge for negative sampling in MKGs about how to model the effect of KG relations during learning the complementary semantics among multiple modalities as an extra context. In this case, traditional negative sampling techniques which only consider the structural knowledge may fail to deal with the multimodal KGC task. To that end, in this paper, we propose a MultiModal Relation-enhanced Negative Sampling (MM-RNS) framework for multimodal KGC task. Especially, we design a novel knowledge-guided cross-modal attention (KCA) mechanism, which provides bi-directional attention for visual & textual features via integrating relation embedding. Then, an effective contrastive semantic sampler is devised after consolidating the KCA mechanism with contrastive learning. In this way, a more similar representation of semantic features between positive samples, as well as a more diverse representation between negative samples under different relations could be learned. Afterwards, a masked gumbel-softmax optimization mechanism is utilized for solving the non-differentiability of sampling process, which provides effective parameter optimization compared with traditional sample strategies. Extensive experiments on three multimodal KGs demonstrate that our MMRNS framework could significantly outperform the state-of-the-art baseline methods, which validates the effectiveness of relation guides in multimodal KGC task.

Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

[pdf] [repo]
Zhao et al. (2022.10)
ACM-MM'22
Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a (text, image) pair. However, dominant work independently models the internal matching relations in a pair of image and text, ignoring the external matching relations between different (text, image) pairs inside the dataset, though such relations are crucial for alleviating image noise in MNER task. In this paper, we primarily explore two kinds of external matching relations between different (text, image) pairs, i.e., inter-modal relations and intra-modal relations. On this basis, we propose a Relation-enhanced Graph Convolutional Network (R-GCN) for the MNER task. Specifically, we first construct an inter-modal relation graph and an intra-modal relation graph to gather the image informaton most relevant to the current text and image from the dataset, respectively. And then, multimodal interaction and fusion are leveraged to predict the NER label sequences. Extensive experimental results show that our model consistently outperforms state-of-the-art works on two public datasets.

OTKGE: Multi-modal Knowledge Graph Embeddings via Optimal Transport

[pdf] [repo]
Cao et al. (2022.11)
NeurIPS'22
Multi-modal knowledge graph embeddings (KGE) have caught more and more attention in learning representations of entities and relations for link prediction tasks. Different from previous uni-modal KGE approaches, multi-modal KGE can leverage expressive knowledge from a wealth of modalities (image, text, etc.), leading to more comprehensive representations of real-world entities. However, the critical challenge along this course lies in that the multi-modal embedding spaces are usually heterogeneous. In this sense, direct fusion will destroy the inherent spatial structure of different modal embeddings. To overcome this challenge, we revisit multi-modal KGE from a distributional alignment perspective and propose optimal transport knowledge graph embeddings (OTKGE). Specifically, we model the multi-modal fusion procedure as a transport plan moving different modal embeddings to a unified space by minimizing the Wasserstein distance between multi-modal distributions. Theoretically, we show that by minimizing the Wasserstein distance between the individual modalities and the unified embedding space, the final results are guaranteed to maintain consistency and comprehensiveness. Moreover, experimental results on well-established multi-modal knowledge graph completion benchmarks show that our OTKGE achieves state-of-the-art performance.

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

[pdf] [repo]
Kim et al. (2022.11)
NeurIPS'22
Despite surprising performance on zero-shot transfer, pre-training a large-scale multimodal model is often prohibitive as it requires a huge amount of data and computing resources. In this paper, we propose a method (BeamCLIP) that can effectively transfer the representations of a large pre-trained multimodal model (CLIP-ViT) into a small target model (e.g., ResNet-18). For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model by matching the relative similarity distribution across text prompt embeddings. To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts. Our experiments show that unsupervised representation transfer of a pre-trained vision-language model enables a small ResNet-18 to achieve a better ImageNet-1K top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning (SSL) methods (e.g., SimCLR: 51.8%, SwAV: 63.7%), while closing the gap with supervised learning (69.8%).

REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

[pdf] [repo]
Lin et al. (2022.11)
NeurIPS'22
This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance. While visual representation is extensively studied in traditional VQA, it is under-explored in knowledge-based VQA even though these two tasks share the common spirit, i.e., rely on visual input to answer the question. Specifically, we observe that in most state-of-the-art knowledge-based VQA methods: 1) visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected; 2) visual features are not well utilized in the final answering model, which is counter-intuitive to some extent. Based on these observations, we propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions not only in the knowledge retrieval stage but also in the answering model. The key motivation is that object regions and inherent relationship are important for knowledge-based VQA. We perform extensive experiments on the standard OK-VQA dataset and achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin (+3.6%). We also conduct detailed analysis and show the necessity of regional information in different framework components for knowledge-based VQA.

MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching

[pdf] [repo]
Huang et al. (2022.11)
NeurIPS'22
Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which are trained on millions or billions of paired images and texts. Different from them, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model training. To deal with this, we propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK), which is inspired by the knowledge use in human brain. In can be directly used as general knowledge to correlate images and texts even without model training, or further fine-tuned based on unpaired images and texts to better generalize to certain datasets. In addition, we extend it as a re-ranking method, which can be easily combined with existing image-text matching models to substantially improve their performance.

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

[pdf] [repo]
Liang et al. (2022.11)
NeurIPS'22
We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and texts) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model intialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separated by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness.

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

[pdf] [repo]
Lu et al. (2022.11)
NeurIPS'22
When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (SCIENCEQA), a new benchmark that consists of ~21k multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain-of-thought (CoT) to mimic the multi-hop reasoning process when answering SCIENCEQA questions. SCIENCEQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input: we observe that it improves the few-shot performance by GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.

Contrastive Language-Image Pre-Training with Knowledge Graphs

[pdf] [repo]
Pan et al. (2022.11)
NeurIPS'22
Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks. Nevertheless, existing approaches mainly focus on pre-training with simple image-text pairs, while neglecting the semantic connections between concepts from different modalities. In this paper, we propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model. Through introducing knowledge-based objectives in the pre-training process and utilizing different types of knowledge graphs as training data, our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning alibility across scenarios and modalities. Extensive experiments on various vision-language downstream tasks demonstrate the effectiveness of Knowledge-CLIP compared with the original CLIP and competitive baselines.

Rethinking Knowledge Graph Evaluation Under the Open-World Assumption

[pdf] [repo]
Yang et al. (2022.11)
NeurIPS'22
Most knowledge graphs (KGs) are incomplete, which motivates one important research topic on automatically complementing knowledge graphs. However, evaluation of knowledge graph completion (KGC) models often ignores the incompleteness - facts in the test set ranked against all unknown triplets which may contain a large number of missing facts not included in the KG yet. Treating all unknown triplets as false is called the closed-world assumption. This closed-world assumption might negatively affect the fairness and consistency of the evaluation metrics. In this paper, we study KGC evaluation under a more realistic setting, namely the open-world assumption, where unknown triplets are considered to include many missing facts not included in the training or test sets. For the currently most used metrics such as mean reciprocal rank (MRR) and Hits@K, we point out that their behavior may be unexpected under the open-world assumption. Spcifically, with not many missing facts, their numbers show a logarithmic trend with respect to the true strength of the model, and thus, the metric increase could be insignificant in terms of reflecting the true model improvement. Further, considering the variance, we show that the degradation in the reported numbers may result in incorrect comparisons between different models, where stronger models may have lower metric numbers. We validate the phenomenon both theoretically and experimentally. Finally, we suggest possible causes and solutions for this problem.

MoSE: Modality Split and Ensemble for Multimodal Knowledge Graph Completion

[pdf] [repo] [github]
Zhao et al. (2022.12)
EMNLP'22
Multimodal knowledge graph completion (MKGC) aims to predict missing entities in MKGs. Previous works usually share relation representation across modalities. This results in mutual interference between modalities during training, since for a pair of entities, the relation from one modality probably contradicts that from another modality. Furthermore, making a unified prediction based on the shared relation representation treats the input in different modalities equally, while their importance to the MKGC task should be different. In this paper, we propose MoSE, a Modality Split representation learning and Ensemble inference framework for MKGC. Specifically, in the training phase, we learn modality-split relation embeddings for each modality instead of a single modality-shared one, which alleviates the modality interference. Based on these embeddings, in the inference phase, we first make modality-split predictions and then exploit various ensemble methods to combine the predictions with different weights, which models the modality importance dynamically. Experimental results on three KG datasets show that MoSE outperforms state-of-the-art MKGC methods.

A Span-based Multimodal Variational Autoencoder for Semi-supervised Multimodal Named Entity Recognition

[pdf] [repo] [github]
Zhou et al. (2022.12)
EMNLP'22
Multimodal named entity recognition (MNER) on social media is a challenging task which aims to extract named entities in free text and incorporate images to classify them into user-defined types. The existing semi-supervised named entity recognition methods focus on the text modal and are utilized to reduce labeling costs in traditional NER. However, the previous methods are not efficient for semi-supervised MNER. Because the MNER task is defined to combine the text information with image one and needs to consider the mis-match between the posted text and image. To fuse the text and image features for MNER effectively under semi-supervised setting, we propose a novel span-based multimodal variational autoencoder (SMVAE) model for semi-supervised MNER. The proposed method exploits modal-specific VAEs to model text and image latent features, and utilizes product-of-experts to acquire multimodal features. In our approach, the implicit relations between labels and multimodal features are modeled by multimodal VAE. Thus, the useful information of unlabeled data can be exploited in our method under semi-supervised setting. Experimental results on two benchmark datasets demonstrate that our apporach not only outperforms baselines under supervised setting, but also improves MNER performance with less labeled data than existing semi-supervised methods.

2023 {#20}

MKVSE: Multimodal Knowledge Enhanced Visual-Semantic Embedding for Image-Text Retrieval

[pdf] [repo] [github]
Feng et al. (2023.02)
TOMM'23
Image-text retrieval aims to take the text (image) query to retrieve the semantically relevant images (texts), which is fundamental and critical in the search system, online shopping, and social network. Existing works have shown the efectiveness of visual-semantic embedding and unimodal knowledge exploiting (e.g. textual knowledge) in connecting the image and text. However, they neglect the implicit multimodal knowledge relations between these two modalities when the image contains information that is not directly described in the text, hindering the ability to connect the image and text with the implicit semantic relations. For instance, an image shows a person next to the "tap" but the pairing text description may only include the word "wash", missing the washing tool "tap". The implicit semantic relation between image object "tap" and text word "wash" can help to connect the above image and text. To suiciently utilize the implicit multimodal knowledge relations, we propose a Multimodal Knowledge enhanced Visual-Semantic Embedding (MKVSE) approach building a multimodal knowledge graph to explicitly represent the implicit multimodal knowledge relations and injecting it to visual-semantic embedding for image-text retrieval task. The contributions in this paper can be summarized as follows: (1) Multimodal Knowledge Graph (MKG) is proposed to explicitly represent the implicit multimodal knowledge relations between the image and text as intra-modal semantic relations and inter-modal co-occurrence relations. Intra-modal semantic relations provide synonymy information that is implicit in the unimodal data such as the text corpus. And inter-modal co-occurrence relations characterize the co-occurrence correlations (such as temporal, causal, and logical) which are implicit in image-text pairs. These two relations help establishing reliable image-text connections in the higher-level semantic space. (2) Multimodal Graph Convolution Networks (MGCN) is proposed to reason on the MKG in two steps to suiciently utilize the implicit multimodal knowledge relations. In the irst step, MGCN focuses on the intra-modal relations to distinguish other entities in the semantic space. In the second step, MGCN focuses on the inter-modal relations to connect multimodal entities based on co-occurrence correlations. The two-step reasoning manner can suiciently utilize the implicit semantic relations between two modal entities to enhance the embeddings of the image and text. Extensive experiments are conducted on two widely-used datasets, namely Flickr30K and MSCOCO, to demonstrate the superiority of the proposed MKVSE approach in achieving state-of-the-art performances.

Learning Representations of Bi-level Knowledge Graph for Reasoning beyond Link Prediction

[pdf] [repo]
Chung et al. (2023.02)
AAAI'23
Knowledge graphs represent known facts using triplets. While existing knowledge graph embedding methods only consider the connections between entities, we propose considering the relationships between triplets. For example, let us consider two triplets $T_1$ and $T_2$ where $T_1$ is (Academy_Awards, Nominates, Avatar) and $T_2$ is (Avatar, Wins, Academy_Awards). Given these two base-level triplets, we see that $T_1$ is a prerequisite for $T_2$. In this paper, we define a higher-level triplet to represent a relationship between triplets, e.g., <$T_1$, PrerequisiteFor, $T_2$> where PrerequisiteFor is a higher-level relation. We define a bi-level knowledge graph that consists of the base-level and the higher-level triplets. We also propose a data augmentation strategy based on the random walks on the bi-level knowledge graph to augment plausible triplets. Our model called BiVE learns embeddings by taking into account the structures of the base-level and the higher-level triplets, with additional consideration of the augmented triplets. We propose two new tasks: triplet prediction and conditional link prediction. Given a triplet $T_1$ and a higher-level relation, the triplet prediction predicts a triplet that is likely to be connected to $T_1$ by the higher-level relation, e.g., <$T_1$, PrerequisiteFor, ?>. The conditional link prediction predicts a missing entity in a triplet conditioned on another triplet, e.g., <$T_1$, PrerequsiteFor, (Avatar, Wins, ?)>. Experimental results show that BiVE significantly outperforms all other methods in the two new tasks and the typical base-level link prediction in real-world bi-level knowledge graphs.

Enabling Knowledge Refinement upon New Concepts in Abductive Learning

[pdf] [repo]
Huang et al. (2023.02)
AAAI'23
Recently there are great efforts on leveraging machine learning and logical reasoning. Many approaches start from a given knowledge base, and then try to utilize the knowledge to help machine learning. In real practice, however, the given knowledge base can often be incomplete or even noisy, and thus, it is crucial to develop the ability of knowledge refinement or enhancement. This paper proposes to enable the Abductive Learning (ABL) paradigm to have the ability of knowledge refinement/enhancement. In particular, we focus on the problem that, in contrast to closed-environment tasks where a fixed set of symbols are enough to represent the concepts in the domain, in open-environment tasks new concepts may emerge. Ignoring those new concepts can lead to significant performance decay, whereas it is challenging to identify new concepts and add them to the existing knowledge base with potential conflicts resolved. We propose the ABL_nc approach which exploits machine learning in ABL to identify new concepts from data, exploits knowledge graph to match them with entities, and refines existing knowledge base to resolve conflicts. The refined/enhanced knowledge base can then be used in the next loop of ABL and help improve the performance of machine learning. Experiments on three neuro-symbolic learning tasks verified effectiveness of the proposed approach.

Multi-Modal Knowledge Hypergraph for Diverse Image Retrieval

[pdf] [repo]
Zeng et al. (2023.02)
AAAI'23
The task of keyword-based diverse image retrieval has received considerable attention due to its wide demand in real-world scenarios. Existing methods either rely on a multi-stage re-ranking strategy based on human design to diversify results, or extend sub-semantics via an implicit generator, which either relies on manual labor or lacks explainability. To learn more diverse and explainable representations, we capture sub-semantics in an explicit manner by leveraging the multi-modal knowledge graph (MMKG) that contains richer entities and relations. However, the huge domain gap between the off-the-shelf MMKG and retrieval datasets, as well as the semantic gap between images and texts, make the fusion of MMKG difficult. In this paper, we pioneer a degree-free hypergraph solution that models many-to-many relations to address the challenge of heterogeneous sources and heterogenous modalities. Specifically, a hyperlink-based solution, Multi-Modal Knowledge Hyper Graph (MKHG) is proposed, which bridges heterogeneous data via various hyperlinks to diversify sub-semantics. Among them, a hypergraph construction module first customizes various hyperedges to link the heterogeneous MMKG and retrieval databases. A multi-modal instance bagging module then explicitly selects instances to diversify the semantics. Meanwhile, a diverse concept aggregator flexibly adapts key sub-semantics. Finally, several losses are adopted to optimize the semantic space. Extensive experiments on two real-world datasets have well verified the effectiveness and explainability of our proposed method.

Attribute-Consistent Knowledge Graph Representation Learning for Multi-Modal Entity Alignment

[pdf] [repo]
Li et al. (2023.04)
WWW'23
The multi-modal entity alignment (MMEA) aims to find all equivalent entity pairs between multi-modal knowledge graphs (MMKGs). Rich attributes and neighboring entities are valuable for the alignment task, but existing works ignore contextual gap problems that the aligned entities have different numbers of attributes on spcific modality when learning entity representations. In this paper, we propose a novel attribute-consistent knowledge graph representation learning framework for MMEA (ACK-MMEA) to compensate the contextual gaps through incorporating consistent alignment knowledge. Attribute-consistent KGs (ACKGs) are first constructed via multi-modal attribute uniformization with merge and generate operators so that each entity has one and only one uniform feature in each modality. The ACKGs are then fed into a relation-aware graph neural network with random dropouts, to obtain aggregated relation representations and robust entity representations. In order to evaluate the ACK-MMEA facilitated for entity alignment, we specially design a joint alignment loss for both entity and attribute evaluation. Extensive experiments conducted on two benchmark datasets show that our approach achieves excellent performance compared to its competitors.

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

[pdf] [repo]
Yao et al. (2023.04)
WWW'23
Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e.g. multimodal retrieval and recommendation. However, existing models suffer from the problem of generating "over-generic" descriptions, such as their tendency to generate repetitive sentences with common concepts for different images. These generic descriptions fail to provide sufficient textual semantics for ever-changing web images. Inspired by the recent success of Vision-Language Pre-training (VLP) models that learn diverse image-text concept alignment during pretraining, we explore leveraging their cross-modal pre-trained knowledge to automatically enrich the textual semantics of image descriptions. With no need for additional human annotations, we propose a plug-and-play framework, i.e. CapEnrich, to complement the generic image descriptions with more semantic details. Specifically, we first propose an automatic data-building strategy to get desired training sentences, based on which we then adopt prompting strategies, i.e. learnable and template prompts, to incentivize VLP models to generate more textual details. For learnable templates, we fix the whole VLP model and only tune the prompt vectors, which leads to two advantages: 1) the pre-training knowledge of VLP model can be reserved as much as possible to describe diverse visual concepts; 2) only lightweight trainable parameters are required, so it is friendly to low data resources. Extensive experiments show that our method significantly improves the descriptiveness and diversity of generated sentences for web images.

Structure Pretraining and Prompt Tuning for Knowledge Graph Transfer

[pdf] [repo]
Zhang et al. (2023.04)
WWW'23
Knowledge graphs (KG) are essential background knowledge provides in many tasks. When designing models for KG-related tasks, one of the key tasks is to devise the Knowledge Representation and Fusion (KRF) module that learns the representation of elements from KGs and fuses them with task representations. While due to the difference of KGs and perspectives to be considered during fusion across tasks, duplicate and ad hoc KRF modules design are conducted among tasks. In this paper, we propose a novel knowledge graph pretraining model KGTransformer that could serve as a uniform KRF module in diverse KG-related tasks. We pretrain KGTransformer with three self-supervised tasks with sampled sub-graphs as input. For utilization, we propose a general prompt-tuning mechanism regarding task data as a triple prompt to allow flexible interactions between task KGs and task data. We evaluate pretrained KGTransformer on three tasks, triple classification, zero-shot image classification, and question answering. KGTransformer consistently achieves better results than specifically designed task models. Through experiments, we justify that the pretrained KGTransformer could be used off the shelf as a general and effective KRF module across KG-related tasks.

MMKGR: Multi-hop Multi-modal Knowledge Graph Reasoning

[pdf] [repo]
Zheng et al. (2023.04)
ICDE'23
Multi-modal knowledge graphs (MKGs) include not only the relation triplets, but also related multi-modal auxiliary data (i.e., texts and images), which enhance the diversity of knowledge. However, the natural incompleteness has significantly hindered the applications of MKGs. To tackle the problem, existing studies employ the embedding-based reasoning models to infer the missing knowledge after fusing the multi-modal features. However, the reasoning performance of these methods is limited due to the following problems: (1) ineffective fusion of multi-modal auxiliary features; (2) lack of complex reasoning ability as well as inability to conduct the multi-hop reasoning which is able to infer more missing knowledge. To overcome these problems, we propose a novel model entitled MMKGR (Multi-hop Multi-modal Knowledge Graph Reasoning). Specifically, the model contains the following two components: (1) a unified gate-attention network which is designed to generate effective multi-modal complementary features through sufficient attention interaction and noise reduction; (2) a complementary feature-aware reinforcement learning method which is proposed to predict missing elements by performing the multi-hop reasoning process, based on the features obtained in component (1). The experimental results demonstrate that MMKGR outperforms the state-of-the-art approaches in the MKG reasoning task.

Do as I can, not as I get: Topology-aware multi-hop reasoning on multi-modal knowledge graphs

[pdf]
Zheng et al. (2023.04)
ArXiv
Multi-modal knowledge graph (MKG) includes triplets that consists of entities and relations and multi-modal auxiliary data. In recent years, multi-hop knowledge graph reasoning (MMKGR) based on reinforcement learning (RL) has received extensive attention because it addresses the intrinsic incompleteness of MKG in an interpretable manner. However, its performance is limited by empirically designed rewards and sparse relations. In addition, this method has been designed for the transductive setting where test entities have been seen during training, and it works poorly in the inductive setting where test entities do not appear in the training set. To overcome these issues, we propose TMR (Topology-aware Multi-hop Reasoning), which can conduct MKG reasoning under inductive and transductive settings. Specifically, TMR mainly consists of two components. (1) The topology-aware inductive representation captures information from the directed relations of unseen entities, and aggregates query-related topology features in an attentive manner to generate the fine-grained entity-independent features. (2) After completing multi-modal feature fusion, the relation-augment adaptive RL conducts multi-hop reasoning by eliminating manual rewards and dynamically adding actions. Finally we construct new MKG datasets with different scales for inductive reasoning evalution. Experimental results demonstrate that TMP outperforms state-of-the-art MKGR methods under both inductive and transductive settings.

Direct Fact Retrieval from Knowledge Graphs without Entity Linking

[pdf] [repo]
Baek et al. (2023.07)
ACL'23
There has been a surge of interest in utilizing Knowledge Graphs (KGs) for various natural language processing/understanding tasks. The conventional mechanism to retrieve facts in KGs usually involves three steps: entity span detection, entity disambiguation, and relation classification. However, this approach requires additional labels for training each of the three subcomponents in addition to pairs of input texts and facts, and also may accumulate errors propagated from failures in previous steps. To tackle these limitations, we propose a simple knowledge retrieval framework, which directly retrieves facts from the KGs given the input text based on their representational similarities, which we refer to as Direct Fact Retrieval (DiFaR). Specifically, we first embed all facts in KGs onto a dense embedding space by using a language model trained by only pairs of input texts and facts, and then provide the nearest facts in response to the input text. Since the fact, consisting of only two entities and one relation, has little context to encode, we propose to further refine ranks of top-K retrieved facts with a reranker that contextualizes the input text and the fact jointly. We validate out DiFaR framework on multiple fact retrieval tasks, showing that it significantly outperforms relevant baselines that use the three-step approach.

FACTKG: Fact Verification via Reasoning on Knowledge Graphs

[pdf] [repo]
Kim et al. (2023.07)
ACL'23
In real world applications, knowledge graphs (KG) are widely used in various domains (e.g. medical applications and dialogue agents). However, for fact verification, KGs have not been adequately utilized as a knowledge source. KGs can be a valuable knowledge source in fact verification due to their reliability and broad applicability. A KG consists of nodes and edges which makes it clear how concepts are linked together, allowing machines to reason over chains of topics. However, there are many challenges in understanding how these machine-readable concepts map to information in text. To enable the community to better use KGs, we introduce a new dataset, FACTKG: Fact Verification via Reasoning on Knowledge Graphs. It consists of 108k natural language claims with five types of reasoning: One-hop, Conjunction, Existence, Multi-hop, and Negation. Furthermore, FACTKG contains various linguistic patterns, including colloquial style claims as well as written style claims to increase practicality. Lastly, we develop a baseline approach and analyze FACTKG over these reasoning types. We believe FACTKG can advance both reliability and practicality in KG-based fact verification.

To Copy Rather Than Memorize: A Vertical Learning Paradigm for Knowledge Graph Completion

[pdf] [repo] [github]
Li et al. (2023.07)
ACL'23
Embedding models have shown great power in knowledge graph completion (KGC) task. By learning structural constraints for each training triple, these methods implicitly memorize instrinsic relation rules to infer missing links. However, this paper points out that the multihop relation rules are inherent deficiencies of such implicit memorization strategy, making embedding models underperform in predicting links between distant entity pairs. To alleviate this problem, we present Vertical Learning Paradigm (VLP), which extends embedding models by allowing to explicitly copy target information from related factual triples for more accurate prediction. Rather than solely relying on the implicit memory, VLP directly provides additional cues to improve the generalization ability of embedding models, especially making the distant link prediction significantly easier. Moreover, we also propose a novel relative distance based negative sampling technique (ReD) for more effective optimization. Experiments demonstrate the validity and generality of our proposals on two standard benchmarks.

End-to-end Knowledge Retrieval with Multi-modal Queries

[pdf] [repo] [github]
Luo et al. (2023.07)
ACL'23
We investigate knowledge retrieval with multimodal queries, i.e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval. We curate a new dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model "ReViz" that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion without being dependent on the intermediate modules such as object detectors or caption generators. We introduce a new pretraining task that is effective for learning knowledge retrieval with multimodal queries and also improves performance on downstream tasks. We demonstrate superior performance in retrieval on two datasets (ReMuQ and OK-VQA) under zero-shot settings as well as further improvements when finetuned on these datasets.

Combo of Thinking and Observing for Outside-Knowledge VQA

[pdf] [repo]
Si et al. (2023.07)
ACL'23
Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge. Some existing solutions draw external knowledge into the cross-modality space which overlooks the much vaster textual knowledge in natural-language space, while others transform the image into a text that further fuses with the textual knowledge into the natural-language space and completely abandons the use of visual features. In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space which makes the visual features preserved directly, and the model still benefits from the vast knowledge in natural-language space. To this end, we propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder. Such structure allows us to introduce more types of knowledge including explicit and implicit multimodal and textual knowledge. Extensive experiments validate the superiority of the proposed method which outperforms the state-of-the-art by 6.17% accuracy. We also conduct comprehensive ablations of each component, and systematically study the roles of varying types of knowledge.

VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering

[pdf] [repo]
Yao et al. (2023.07)
ACL'23
We present Visual Knowledge oriented Programming platform (VisKoP), a knowledge base question answering (KBQA) system that integrates human into the loop to edit and debug the knowledge base (KB) queries. VisKoP not only provides a neural program induction module, which converts natural language questions into knowledge oriented program language (KoPL), but also maps KoPL programs into graphical elements. KoPL programs can be edited with simple graphical operators, such as "dragging" to add knowledge operators and "slot filling" to designate operator arguments. Moreover, VisKoP provides auto-completion for its knowledge base schema and users can easily debug the KoPL program by checking its intermediate results. To facilitate the practical KBQA on a million-entity-level KB, we design a highly efficient KoPL execution engine for the back-end. Experiment results show that VisKoPus is highly efficient and user interaction can fix a large portion of wrong KoPL programs to acquire the correct answer.

Multi-Grained Multimodal Interaction Network for Entity Linking

[pdf] [repo]
Luo et al. (2023.07)
KDD'23
Multimodal entity linking (MEL) task, which aims at resolving ambiguous mentons to a multimodal knowledge graph, has attracted wide attention in recent years. Though large efforts have been made to explore the complementary effect among multiple modalities, however, they may fail to fully absorb the comprehensive expression of abbreviated textual context and implicit visual indication. Even worse, the inevitable noisy data may cause inconsistency of different modalities during the learning process, which severely degenerates the performance. To address the above issues, in this paper, we propose a novel Multi-GraIned Multimodal InteraCtion Network (MIMIC) framework for solving the MEL task. Specifically, the unified inputs of mentions and entities are first encoded by textual/visual encoders separately, to extract global descriptive features and local detailed features. Then, to derive the similarity matching score for each mention-entity pair, we device three interaction units to comprehensively explore the intra-modal interaction and inter-modal fusion among features of entities and mentions. In particular, three modules, namely the Text-based Global-Local interaction Unit (TGLU), Vision-based DuaL interaction Unit (VDLU) and Cross-Modal Fusion-based interaction Unit (CMFU) are designed to capture and integrate the fine-grained representation lying in abbreviated text and implicit visual cues. Afterwards, we introduce a unit-consistency objective function via constrastive learning to avoid inconsistency and model degredation. Experimental results on three public benchmark datasets demonstrate that our solution outperforms various state-of-the-art baselines, and ablation studies verify the effectiveness of designed modules.

Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation

[pdf] [repo]
Wu et al. (2023.07)
KDD'23
Zero-Shot Learning (ZSL), which aims at automatically recognizing unseen objects, is a promising learning paradigm to understand new real-world knowledge for machines continuously. Recently, the Knowledge Graph (KG) has been proven as an effective scheme for handling the zero-shot task with large-scale and non-attribute data. Prior studies always embed relationships of seen and unseen obejcts into visual information from existing knowledge grpahs to promote the cognitive ability of the unseen data. Actually, real-world knowledge is natually formed by multimodal facts. Compared with ordinary structural knowledge from a graph perspective, multimodal KG can provide cognitive systems with fine-grained knowledge. For example, the text description and visual content can depict more critical details of a fact than only depending on knowledge triplets. Unfortunately, this multimodal fine-grained knowledgeg is largely unexploited due to the bottleneck of feature alignment between different modalities. To that end, we propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings via a designed dense attention module and self-calibration loss. It makes the semantic transfer process of our ZSL framework learns more differentiated knowledge between entities. Our model also gets rid of the performance limitation of only using rough global features. We conduct extensive experiments and evaluate our model on large-scale real-world data. The experimental results clearly demonstrate the effectiveness of the proposed model in standard zero-shot classification tasks.

An Empirical Study on the Language Modal in Visual Question Answering

[pdf] [repo]
Peng et al. (2023.08)
IJCAI'23
Generalization beyond in-domain experience to out-of-distribution data is of paramount significance in the AI domain. Of late, state-of-the-art Visual Question Answering (VQA) models have shown impressive performance on in-domain data, partially due to the language priors bias which, however, hinders the generalization ability in practice. This paper attempts to provide new insights into the influence of language modality on VQA performance from an empirical study perspective. To achieve this, we conducted a series of experiments on six models. The results of these experiments revealed that, 1) apart from prior bias caused by question types, there is a notable influence of postfix-related bias in inducing biases, and 2) training VQA models with word-sequence-related variant questions demonstrated improved performance on the out-of-distribution benchmark, and the LXMERT even achieved a 10-point gain without adopting any debiasing methods. We delved into the underlying reasons behind these experimental results and put forward some simple proposals to reduce the models' dependency on language priors. The experimental results demonstrated the effectiveness of our proposed method in improving performance on the out-of-distribution benchmark, VQA-CPv2. We hope this study can inspire novel insights for future research on designing bias-reduction approaches.

Prompt Learns Prompt: Exploring Knowledge-Aware Generative Prompt Collaboration for Video Captioning

[pdf] [repo]
Yan et al. (2023.08)
IJCAI'23
Fine-tuning large vision-language models is a challenging task. Prompt tuning approaches have been introduced to learn fixed textual or visual prompts while freezing the pre-trained model in downstream tasks. Despite the effectiveness of prompt tuning, what do those learnable prompts learn remains unexplained. In this work, we explore whether prompts in the fine-tuning can learn knowledge-aware prompts from the pre-training, by designing two sets of prompts -- one in pre-training and the other in fine-tuning. Specifically, we present a Video-Language Prompt tuning (VL-Prompt) approach for video captioning, which first effeciently pre-train a video-language model to extract key information (e.g., actions and objects) with flexibly generated Knowledge-Aware Prompt (KAP). Then, we design a Video-Language Prompt (VLP) to utilize the knowledge from KAP and finetune the model to generate full captions. Experimental results show the superior performance of our approach over several state-of-the-art baselines. We further demonstrate that the video-language prompts are well learned from the knowledge-aware prompts.

Construction and Applications of Billion-Scale Pre-Trained Multimodal Business Knowledge Graph

[pdf] [repo]
Deng et al. (2023.11)
ICDE'23
Business Knowledge Graphs (KGs) are important to many enterprises today, providing factual knowledge and structure data that steer many products and make them more intelligent. Despite their promising benefits, building business KG necessitates solving prohibitive issues of deficient structure and multiple modalities. In this paper, we advance the understanding of the practical challenges related to building KG in non-trivial real-world systems. We introduce the process of building an open business knowledge graph (OpenBG) derived from a well-known enterprise, Alibaba Group. Specifically, we define a core ontology to cover various abstract products and consumption demands, with fine-grained taxonomy and multimodal facts in deployed applications. OpenBG is an open business KG of unprecedented scale: 2.6 billion triples with more than 88 million entities covering over 1 million core classes/concepts and 2681 types of relations. We release all the open resources (OpenBG benchmarks) derived from it for the community and report experimental results of KG-centric tasks. We also run up an online competition based on OpenBG benchmarks, and has attracted thousands of teams. We further pre-train OpenBG and apply it to many KG-enhanced downstream tasks in business scenarios, demonstrating the effectiveness of billion-scale multimodal knowledge for e-commerce.

IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling

[pdf] [repo]
Wen et al. (2023.11)
IEEE'23
Multimodal knowledge graph link prediction aims to improve the accuracy and efficiency of link prediction tasks for multimodal data. However, for complex multimodal information and sparse training data, it is usually difficult to achieve interpretability and high accuracy simultaneously for most methods. To address this difficulty, a new model is developed in this paper, namely interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling (IMKGA-SM). First, a multi-modal fine-grained fusion method is proposed, and VGG16 and Optical Character Recognition (OCR) techniques are adopted to effectively extract text information from images and images. Then, the knowledge graph link prediction task is modelled as an offline reinforcement learning Markov decision model, which is then abstracted into a unified sequence framework. An interactive perception-based reward expectation mechanism and special causal masking mechanism are designed, which "converts" the query into an inference path. Then an autoregressive dynamic gradient adjustment mechanism is proposed to alleviate the insufficient problem of multimodal optimization. Finally, two datasets are adopted for experiments, and the popular SOTA baselines are used for comparison. The results show that the developed IMKGA-SM achieves much better performance than SOTA baselines on multimodal link prediction datasets of different sizes.

2024 {#10}

Continual Vision-Language Retrieval via Dynamic Knowledge Rectification

[pdf] [repo]
Cui et al. (2024.01)
AAAI'24
The recent large-scale pre-trained models like CLIP have aroused great concern in vision-language tasks. However, when required to match image-text data collected in a streaming manner, namely Continual Vision-Language Retrieval (CVRL), their performances are still limited due to the catastrophic forgetting of the learned old knowledge. To handle this issue, advanced methods are proposed to distil the affinity knowledge between images and texts from the old model to the new one for anti-forgetting. Unfortunately, existing approaches neglect the impact of incorrect affinity, which prevents the balance between the anti-forgetting of old knowledge and acquistion of new knowledge. Therefore, we propose a novel framework called Dynamic Knowledge Rectification (DKR) that simultaneously achieves incorrect knowledgeg filtering and rectification. Specifically, we first filter the incorrect affinity knowledge calculated by the old model on the new data. Then a knowledge rectification method is designed to rectify the incorrect affinities while preserving the correct ones. In particular, for the new data that can only be correctly retrieved by the new model, we rectify them with the corresponding new affinity to protect them from negative transfer. Additionally, for those that can not be retrieved by either the old or the new model, we introduce paired ground-truth labels to promote the acquisition of both old and new knowledge. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our DKR and its superiority against state-of-the-art methods.

KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

[pdf] [repo]
Mondal et al. (2024.01)
AAAI'24
Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables step-by-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multple modalities for a comprehensive understanding of multimodal tasks. KAM-CoT adopts a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding reducing hallucinations and enhancing the quality of answers. This knowledge-augmented CoT reasoning empowers the model to handle questions requiring external context, providing more informed answers. Experimental findings show KAM-CoT outperforms the state-of-the-art methods. On the ScienceQA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-CoT achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness.

Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge

[pdf] [repo]
Kim et al. (2024.01)
AAAI'24
Open Set Recognition (OSR) poses significant challenges in distinguishing known from unknown classes. In OSR, the overconfidence problem has become a persistent obstacle, where visual recognition models often misclassify unknown objects as known objects with high confidence. This issus stems from the fact that visual recognition models often lack the integration of common-sense knowledge, a feature that is naturally present in language-based models but lacking in visual recognition systems. In this paper, we propose a novel apporach to enhance OSR performance by distilling common-sense knowledge into visual prompts. Utilizing text prompts that embody common-sense knowledge about known classes, the proposed visual prompt is learned by extracting semantic common-sense features and aligning them with image features from visual recognition models. The unique aspect of this work is the training of individual visual prompts for each class to encapsulate this common-sense knowledge. Our methodology is model-agnostic, capable of enhancing OSR across various visual recongition models, and computationally light as it focuses solely on training the visual prompts. This research introduces a method for addressing OSR, aiming at a more systematic integration of visual recognition systems with common-sense knowledge. The obtained results indicate an enhancement in recognition accuracy, suggesting the applicability of this apporach in practical settings.

Beyond Entities: A Large-Scale Multi-Modal Knowledge Graph with Triplet Fact Grounding

[pdf] [repo] [github]
Liu et al. (2024.01)
AAAI'24
Much effort has been devoted to building multi-modal knowledge graphs by visualizing entities on images, but ignoring the multi-modal information of the relation between entities. Hence, in this paper, we aim to construct a new large-scale multi-modal knowledge graph with triplet facts grounded on images that reflect not only entities but also their relations. To achieve this purpose, we propose a novel pipeline method, including triplet fact filtering, image retrieving, entity-based image filtering, relation-based image filtering, and image clustering. In this way, a multi-modal knowledge graph named ImgFact is constructed, which contains 247,732 triplet facts and 3,730,805 images. In experiments, the manual and automatic evaluations prove the reliable quality of our ImgFact. We further use the obtrained images to enhance model performance on two tasks. In particular, the model optmized by our ImgFact achieves an impressive 8.38% and 9.87% improvement over the solutions enhanced by an existing multimodal knowledge graph and VisualChatGPT on F1 of relation classification.

Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search

[pdf] [repo]
Liang et al. (2024.01)
AAAI'24
Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solutions for cross-modal search. However, the existing methods neglecet the implicit fine-grained multimodal knowledge relations between different modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to suffciently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of-the-art methods.

LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation

[pdf] [repo]
Shang et al. (2024.01)
AAAI'24
Recently, an enormous amount of research has emerged on multimodal knowledge graph completion (MKGC), which seeks to extract knowledge from multimodal data and predict the most plausible missing facts to complete a given multimodal knowledge graph (MKG). However, existing MKGC approaches largely ignore that visual information may introduce noise and lead to uncertainty when adding them to the traditional KG embeddings due to the contribution of each associated image to entity is different in diverse link scenarios. Moreover, treating each triple independently when learning entity embeddings leads to local structural and the whole graph information missing. To address these challenges, we propose a novel link aware fusion and aggregation based multimodal knowledge graph completion model named LAFA, which is composed of link aware fusion module and link aware aggregation module. The link aware fusion module alleviates noise of irrelevant visual information by calculating the importance between an entity and its associated images in different link scenarios, and fuses the visual and structural embeddings according to the importance through our proposed modality embedding fusion mechanism. The link aware aggregation module assigns neighbor structural information to a given central entity by calculating the importance between the entity and its neighbors, and aggregating the fused embeddings through linear combination according to the importance. Extensive experiments on standard datasets validate that LAFA can obtain state-of-the-art performance.

Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey

[pdf] [repo] [github]
Chen et al. (2024.02)
arXiv
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the semantic web community's exploration into multi-modal dimensions unlocking new avenues for innovation. In this survey, we carefully review over 300 articles, focusing on KG-aware research in two principal aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into the MMKG realm. We begin by defining KGs and MMKGs, then expore their construction progress. Our review includes two primary task categories: KG-aware multi-modal learning tasks, such as Image Classification and Visual Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph Completion and Entity Alignment, highlighting specific research trahjectories. For most of these tasks, we provide definitions, evaluation benchmarks, and additionally outline essential insights for conducting relevant research. Finally, we discuss current challenges and identify emerging trends, such as progress in Large Language Modeling and Multi-modal Pre-training strategies. This survey aims to serve as a comprehensive reference for researchers already involved in or considering delving into KG and multi-modal learning research, offering insights into the evolving landscape of MMKG research and supporting future work.

Unleashing the Power of Imbalanced Modality Information for Multi-modal Knowledge Graph Completion

[pdf] [repo] [github]
Zhang et al. (2024.02)
COLING'24
Multi-modal Knowledge graph completion (MMKGC) aims to predict the missing triples in the multi-modal knowledge graphs by incorporating structural, visual, and textual information of entities into the discriminant models. The information from different modalities will work together to measure the triple plausibility. Existing MMKGC methods overlook the imbalance problem of modality information among entities, resulting in inadequate modal fusion and inefficient utilization of the raw modality information. To address the mentioned problems, we propose Adaptive Multi-modal Fusion and Modality Adversarial Training (AdaMF-MAT) to unleash the power of imbalanced modality information for MMKGC. AdaMF-MAT achieves multi-modal fusion with adaptive modality weights and further generate adversarial samples by modality-adversarial training to enhance the imbalanced modality information. Our apporoach is a co-design of the MMKGC model and train strategy which can outperform 19 recent MMKGC methods and achieve new state-of-the-art results on three public MMKGC benchmarks.

LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

[pdf] [repo]
Chen et al. (2024.06)
CVPR'24
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issus, we devise a dual-level visual knowledge enhanced multimodal large language model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvements of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

[pdf] [repo]
Lin et al. (2024.07)
arXiv
In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs, limiting their ability to answer questions requiring an understanding of detailed or localized visual elements. Drawing inspiration from the Retrieval-Augmented Generation (RAG) concept, this paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models (e.g., instance segmentation/OCR models), into MLLMs. This is a promising yet underexplored direction for enhancing MLLMs' performance. Our approach diverges from concurrent works, which transform external knowledge into additional text prompts, necessitating the model to indirectly learn the correspondence between visual content and text coordinates. Instead, we propose embedding fine-grained knowledge information directly into a spatial embedding map as a visual prompt. This design can be effortlessly incorporated into various MLLMs, such as LLaVA and Mipha, considerably improving their visual understanding performance. Through rigorous experiments, we demonstrate that our method can enhance MLLM performance across nine benchmarks, amplifying their fine-grained context-aware capabilities.

Tutorials

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge Extraction or Knowledge Hallucination?

[pdf]
Heiko Paulheim
Universität Mannheim

Online Multimodal Knowledge Discovery

[pdf]
Ilharco et al.
ICDM'20

Multimodal Knowledge Graphs: Automatic Extraction & Applications

[pdf] [repo]
Shih-Fu Chang
CVPR'19 & Uni. Columbia

Towards Building Large-Scale Multimodal Knowledge Bases

[pdf] [repo]
Dihong Gong
Uni. Florida

How To Build a Knowledge Graph

[pdf] [repo]
Elias Karle & Umutcan Simsek
Uni Innsbruck

Mining Knowledge Graphs from Text

[pdf]
[repo-Tutorial1-Introduction]
[repo-Tutorial2-A-InformationExtraction]
[repo-Tutorial2-B-InformationExtraction]
[repo-Tutorial3-Construction]
[repo-Tutorial4-Embedding]
[repo-Tutorial5-Conclusion]
Jay Pujara & Sameer Singh
WSDM‘18

Injecting Prior Information and Multiple Modalities into Knowledge Base Embeddings

[pdf] [repo]
Sameer Singh
Uni California, Irvine

Files

resource_list_abstract.md

Latest commit

History