Publications
(Click on titles to see abstracts)
Mojtaba Valizadeh, Philip John Gorinski, Ignacio Iacobacci, Martin Berger (2024).
Correct and Optimal: the Regular Expression Inference Challenge.
IJCAI 2024, Jeju, South Korea. [PDF]
We propose regular expression inference (REI) as a challenge for code/language modelling, and the wider machine learning community. REI is a supervised machine learning (ML) and program optimisation task, and poses the problem of finding minimal regular expressions from examples: Given two finite sets of strings P and N and a cost function cost(·), the task is to generate an expression r that accepts all strings in P and rejects all strings in N, while no other such expression r' exists with cost(r')
Philip John Gorinski, Matthieu Zimmer, Gerasimos Lampouras, Derrick Goh Xin Deik, Ignacio Iacobacci (2023).
Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis.
Findings of EMNLP, Singapore. [PDF]
The advent of large pre-trained language models in the domain of Code Synthesis has shown remarkable performance on various benchmarks, treating the problem of Code Generation in a fashion similar to Natural Language Generation, trained with a Language Modelling (LM) objective. In addition, the property of programming language code being precisely evaluable with respect to its semantics -- through the use of Unit Tests to check its functional correctness -- lends itself to using Reinforcement Learning (RL) as a further training paradigm. Previous work has shown that RL can be applied as such to improve models' coding capabilities; however, such RL-based methods rely on a reward signal based on defined Unit Tests, which are much harder to obtain compared to the huge crawled code datasets used in LM objectives. In this work, we present a novel approach to automatically obtain data consisting of function signatures and associated Unit Tests, suitable for RL training of Code Synthesis models. We also introduce a straightforward, simple yet effective Actor-Critic RL training scheme and show that it, in conjunction with automatically generated training data, leads to improvement of a pre-trained code language model's performance by up to 9.9% improvement over the original underlying code synthesis LM, and up to 4.3% over RL-based models trained with standard PPO or CodeRL.
Yunjie He, Philip John Gorinski, Ieva Staliūnaitė, Pontus Stenetorp (2023).
Graph Attention with Hierarchies for Multi-hop Question Answering.
Preprint, arXiv. [PDF]
Multi-hop QA (Question Answering) is the task of finding the answer to a question across multiple documents. In recent years, a number of Deep Learning-based approaches have been proposed to tackle this complex task, as well as a few standard benchmarks to assess models' Multi-hop QA capabilities.
In this paper, we focus on the well-established HotpotQA benchmark dataset, which requires models to perform answer span extraction as well as support sentence prediction.
We present two extensions to the state-of-the-art Graph Neural Network (GNN) based model for HotpotQA, Hierarchical Graph Network (HGN): (i) we complete the original hierarchical structure by introducing new edges between the query and context sentence nodes; (ii) in the graph propagation step, we propose a novel extension to Hierarchical Graph Attention Network -- GATH (Graph ATtention with Hierarchies) -- that makes use of the graph hierarchy to update the node representations in a sequential fashion.
Experiments on HotpotQA demonstrate the efficiency of the proposed modifications and support our assumptions about the effects of model-related variables.
Alexander I. Cowen-Rivers, Philip John Gorinski, Aivar Sootla, Asif Khan, Liu Furui, Jun Wang, Jan Peters, Haitham Bou Ammar (2022).
Structured Q-learning For Antibody Design.
RL4RealLife Workshop at NeurIPS2022, New Orleans, LA, USA. [PDF]
Optimizing combinatorial structures is core to many real-world problems, such as those encountered in life sciences. For example, one of the crucial steps involved in antibody design is to find an arrangement of amino acids in a protein sequence that improves its binding with a pathogen. Combinatorial optimization of antibodies is difficult due to extremely large search spaces and non-linear objectives. Even for modest antibody design problems, where proteins have a sequence length of eleven, we are faced with searching over 2.05x10^14 structures. Applying traditional Reinforcement Learning algorithms such as Q-learning to combinatorial optimization results in poor performance. We propose Structured Q-learning (SQL), an extension of Q-learning that incorporates structural priors for combinatorial optimization. Using a molecular docking simulator, we demonstrate that SQL finds high binding energy sequences and performs favourably against baselines on eight challenging antibody design tasks, including designing antibodies for SARS-COV.
Ieva Staliūnaitė, Philip John Gorinski, Ignacio Iacobacci (2022).
Relational Graph Convolutional Neural Networks for Multihop Reasoning: A Comparative Study.
Preprint, arXiv. [PDF]
Multihop Question Answering is a complex Natural Language Processing task that requires multiple steps of reasoning to find the correct answer to a given question. Previous research has explored the use of models based on Graph Neural Networks for tackling this task. Various architectures have been proposed, including Relational Graph Convolutional Networks (RGCN). For these many node types and relations between them have been introduced, such as simple entity co-occurrences, modelling coreferences, or "reasoning paths" from questions to answers via intermediary entities. Nevertheless, a thoughtful analysis on which relations, node types, embeddings and architecture are the most beneficial for this task is still missing. In this paper we explore a number of RGCN-based Multihop QA models, graph relations, and node embeddings, and empirically explore the influence of each on Multihop QA performance on the WikiHop dataset.
Ieva Staliūnaitė, Philip John Gorinski, Ignacio Iacobacci (2021).
Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation.
AAAI2021, Virtual. [PDF]
Determining the plausibility of causal relations between clauses is a commonsense reasoning task that requires complex inference ability. The general approach to this task is to train a large pretrained language model on a specific dataset. However, the available training data for the task is often scarce, which leads to instability of model training or reliance on the shallow features of the dataset. This paper presents a number of techniques for making models more robust in the domain of causal reasoning. Firstly, we perform adversarial training by generating perturbed inputs through synonym substitution. Secondly, based on a linguistic theory of discourse connectives, we perform data augmentation using a discourse parser for detecting causally linked clauses in large text, and a generative language model for generating distractors. Both methods boost model performance on the Choice of Plausible Alternatives (COPA) dataset, as well as on a Balanced COPA dataset, which is a modified version of the original data that has been developed to avoid superficial cues, leading to a more challenging benchmark. We show a statistically significant improvement in performance and robustness on both datasets, even with only a small number of additionally generated data points.
Yusheng Tian, Philip John Gorinski (2020).
Improving End-to-End Speech-to-Intent Classification with Reptile.
InterSpeech2020, Shanghai, China. [PDF]
End-to-end spoken language understanding (SLU) systems have many advantages over conventional pipeline systems, but collecting in-domain speech data to train an end-to-end system is costly and time consuming. One question arises from this: how to train an end-to-end SLU with limited amounts of data? Many researchers have explored approaches that make use of other related data resources, typically by pre-training parts of the model on high-resource speech recognition. In this paper, we suggest improving the generalization performance of SLU models with a non-standard learning algorithm, Reptile. Though Reptile was originally proposed for model-agnostic meta learning, we argue that it can also be used to directly learn a target task and result in better generalization than conventional gradient descent. In this work, we employ Reptile to the task of end-to-end spoken intent classification. Experiments on four datasets of different languages and domains show improvement of intent prediction accuracy, both when Reptile is used alone and used in addition to pre-training.
Rafael Kourdis, Gabriel Gordon-Hall, Philip John Gorinski (2020).
αVIL: Learning to Leverage Auxiliary Tasks for Multitask Learning.
Preprint, arXiv. [PDF]
Multitask Learning is a Machine Learning paradigm that aims to train a range of (usually related) tasks with the help of a shared model. While the goal is often to improve the joint performance of all training tasks, another approach is to focus on the performance of a specific target task, while treating the remaining ones as auxiliary data from which to possibly leverage positive transfer towards the target during training. In such settings, it becomes important to estimate the positive or negative influence auxiliary tasks will have on the target. While many ways have been proposed to estimate task weights before or during training they typically rely on heuristics or extensive search of the weighting space. We propose a novel method called α-Variable Importance Learning (αVIL) that is able to adjust task weights dynamically during model training, by making direct use of task-specific updates of the underlying model's parameters between training epochs. Experiments indicate that αVIL is able to outperform other Multitask Learning approaches in a variety of settings. To our knowledge, this is the first attempt at making direct use of model updates for task weight estimation.
Gabriel Gordon-Hall, Philip John Gorinski, Shay B. Cohen (2020).
Learning Dialog Policies from Weak Demonstrations.
Proceedings of ACL 2020, Seattle, WA, USA. [PDF]
Deep reinforcement learning is a promising approach to training a dialog manager, but current methods struggle with the large state and action spaces of multi-domain dialog systems. Building upon Deep Q-learning from Demonstrations (DQfD), an algorithm that scores highly in difficult Atari games, we leverage dialog data to guide the agent to successfully respond to a user's requests. We make progressively fewer assumptions about the data needed, using labeled, reduced-labeled, and even unlabeled data to train expert demonstrators. We introduce Reinforced Fine-tune Learning, an extension to DQfD, enabling us to overcome the domain gap between the datasets and the environment. Experiments in a challenging multi-domain dialog system framework validate our approaches, and get high success rates even when trained on out-of-domain data.
Gabriel Gordon-Hall, Philip John Gorinski, Gerasimos Lampouras, Ignacio Iacobacci (2020).
Show Us the Way: Learning to Manage Dialog from Demonstrations.
The Eighth Dialog System Technology Challenge (DSTC8) Workshop at AAAI2020, New York, NY, USA. [PDF]
We present our submission to the End-to-End Multi-Domain Dialog Challenge Track of the Eighth Dialog System Technology Challenge. Our proposed dialog system adopts a pipeline architecture, with distinct components for Natural Language Understanding, Dialog State Tracking, Dialog Management and Natural Language Generation. At the core of our system is a reinforcement learning algorithm which uses Deep Q-learning from Demonstrations to learn a dialog policy with the help of expert examples. We find that demonstrations are essential to training an accurate dialog policy where both state and action spaces are large. Evaluation of our Dialog Management component shows that our approach is effective - beating supervised and reinforcement learning baselines.
Philip John Gorinski, Honghan Wu, Claire Grover, Richard Tobin, Conn Talbot, Heather Whalley, Cathie Sudlow, William Whiteley, and Beatrice Alex (2019).
Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches.
Presented at Healthcare Text Analytical Conference (HealTAC) 2019, Cardiff, Wales. [PDF]
This work investigates multiple approaches to Named Entity Recognition (NER) for text in Electronic Health Record (EHR) data. In particular, we look into the application of (i) rule-based, (ii) deep learning and (iii) transfer learning systems for the task of NER on brain imaging reports with a focus on records from patients with stroke. We explore the strengths and weaknesses of each approach, develop rules and train on a common dataset, and evaluate each system’s performance on common test sets of Scottish radiology reports from two sources (brain imaging reports in ESS – Edinburgh Stroke Study data collected by NHS Lothian as well as radiology reports created in NHS Tayside). Our comparison shows that a hand-crafted system is the most accurate way to automatically label EHR, but machine learning approaches can provide a feasible alternative where resources for a manual system are not readily available.
Philip John Gorinski (2018).
Automatic Movie Analysis and Summarization.
PhD Thesis at the University of Edinburgh, School of Informatics, ILCC, Edinburgh, Scotland. [PDF]
Automatic movie analysis is the task of employing Machine Learning methods to the field of screenplays, movie scripts, and motion pictures to facilitate or enable various tasks throughout the entirety of a movie’s life-cycle. From helping with making informed decisions about a new movie script with respect to aspects such as its originality, similarity to other movies, or even commercial viability, all the way to offering consumers new and interesting ways of viewing the final movie, many stages in the life-cycle of a movie stand to benefit from Machine Learning techniques that promise to reduce human effort, time, or both. Within this field of automatic movie analysis, this thesis addresses the task of summarising the content of screenplays, enabling users at any stage to gain a broad understanding of a movie from greatly reduced data. The contributions of this thesis are four-fold: (i) We introduce ScriptBase, a new large-scale data set of original movie scripts, annotated with additional meta-information such as genre and plot tags, cast information, and log- and tag-lines. To our knowledge, ScriptBase is the largest data set of its kind, containing scripts and information for almost 1,000 Hollywood movies. (ii) We present a dynamic summarisation model for the screenplay domain, which allows for extraction of highly informative and important scenes from movie scripts. The extracted summaries allow for the content of the original script to stay largely intact and provide the user with its important parts, while greatly reducing the script-reading time. (iii) We extend our summarisation model to capture additional modalities beyond the screenplay text. The model is rendered multi-modal by introducing visual information obtained from the actual movie and by extracting scenes from the movie, allowing users to generate visual summaries of motion pictures. (iv) We devise a novel end-to-end neural network model for generating natural language screenplay overviews. This model enables the user to generate short descriptive and informative texts that capture certain aspects of a movie script, such as its genres, approximate content, or style, allowing them to gain a fast, high-level understanding of the screenplay. Multiple automatic and human evaluations were carried out to assess the performance of our models, demonstrating that they are well-suited for the tasks set out in this thesis, outperforming strong baselines. Furthermore, the ScriptBase data set has started to gain traction, and is currently used by a number of other researchers in the field to tackle various tasks relating to screenplays and their analysis.
Philip John Gorinski and Mirella Lapata (2018).
What's this Movie about? A Joint Neural Network Architecture for Movie Content Analysis.
In Proceedings of NAACL-HLT 2018, New Orleans, LA, USA. [PDF]
This work takes a first step toward movie content analysis by tackling the novel task of movie overview generation. Overviews are natural language texts that give a first impression of a movie, describing aspects such as its genre, plot, mood, or artistic style. We create a dataset that consists of movie scripts, attributevalue pairs for the movies’ aspects, as well as overviews, which we extract from an online database. We present a novel end-to-end model for overview generation, consisting of a multi-label encoder for identifying screenplay attributes, and an LSTM decoder to generate natural language sentences conditioned on the identified attributes. Automatic and human evaluation show that the encoder is able to reliably assign good labels for the movie’s attributes, and the overviews provide descriptions of the movie’s content which are informative and faithful.
Philip John Gorinski and Mirella Lapata (2015).
Movie Script Summarization as Graph-based Scene Extraction.
In Proceedings of NAACL-HLT 2015, Denver, CO, USA. [PDF]
Philip John Gorinski, Josef Ruppenhofer, and Caroline Sporleder (2013).
Towards Weakly Supervised Resolution of Null Instantiations.
In Proceedings of IWCS 2013 - Long Papers, Potsdam, Germany. [PDF]
This paper addresses the task of finding antecedents for locally uninstantiated arguments. To resolve such null instantiations, we develop a weakly supervised approach that investigates and combines a number of linguistically motivated strategies that are inspired by work on semantic role labeling and corefence resolution. The performance of the system is competitive with the current state-of-the-art supervised system.
Vera Demberg, Asad B. Sayeed, Philip J. Gorinski, and Nikolaos Engonopoulos (2012).
Syntactic surprisal affects spoken word duration in conversational contexts.
In Proceedings of EMNLP-CoNLL '12, Jeju Island, South Korea. [PDF]
We present results of a novel experiment to investigate speech production in conversational data that links speech rate to information density. We provide the first evidence for an association between syntactic surprisal and word duration in recorded speech. Using the AMI corpus which contains transcriptions of focus group meetings with precise word durations, we show that word durations correlate with syntactic surprisal estimated from the incremental Roark parser over and above simpler measures, such as word duration estimated from a state-of-the-art text-to-speech system and word frequencies, and that the syntactic surprisal estimates are better predictors of word durations than a simpler version of surprisal based on trigram probabilities. This result supports the uniform information density (UID) hypothesis and points a way to more realistic artificial speech generation.
Josef Ruppenhofer, Philip John Gorinski, and Caroline Sporleder (2011).
In search of missing arguments: A linguistic approach.
In Proceedings of of RANLP 2011, Hissar, Bulgaria. [PDF]
Semantic argument structures are often incomplete in that core arguments are not locally instantiated. However, many of these implicit arguments can be linked to referents in the wider context. In this paper we explore a number of linguistically motivated strategies for identifying and resolving such null instantiations (NIs). We show that a more sophisticated model for identifying definite NIs can lead to noticeable performance gains over the state-ofthe-art for NI resolution
Caroline Sporleder, Linlin Li, Philip John Gorinski, and Xaver Koch (2010).
Idioms in Context: The IDIX Corpus.
In Proceedings of LREC 2010, Valletta, Malta. [PDF]
Idioms and other figuratively used expressions pose considerable problems to natural language processing applications because they are very frequent and often behave idiosyncratically. Consequently, there has been much research on the automatic detection and extraction of idiomatic expressions. Most studies focus on type-based idiom detection, i.e., distinguishing whether a given expression can (potentially) be used idiomatically. However, many expressions such as break the ice can have both literal and non-literal readings and need to be disambiguated in a given context (token-based detection). So far relatively few approaches have attempted context-based idiom detection. One reason for this may be that few annotated resources are available that disambiguate expressions in context. With the IDIX corpus, we aim to address this. IDIX is available as an add-on to the BNC and disambiguates different usages of a subset of idioms. We believe that this resource will be useful both for linguistic and computational linguistic studies.