Alex SpenglerPhD in Informatics |
Laboratoire d'Informatique de Paris 6 |
Lip6 |
|
||
Machine learning and (Bayesian) statistics, (Web) information extraction & visualization.
Alex Spengler studied Computer Science in Karlsruhe, Edinburgh and Paris.
In December 2011 he finished his PhD in Informatics, supervised by Professor Patrick Gallinari from the
Laboratoire d'Informatique at Université Pierre et Marie Curie,
Paris, France and Professor Bernhard Schölkopf from the Max
Planck Institute for Biological Cybernetics, Tübingen,
Germany.
His research interests are centered on the application of Bayesian statistics
to questions in document analysis, specifically Web content analysis.
Stefania Rubrichi, Silvana Quaglini, Alex Spengler and Patrick Gallinari
Extracting Information from Summary of Product Characteristics for Improving Drugs
Prescription Safety
In Proceedings of the 13th Conference on Artificial Intelligence in Medicine. Bled, Slovenia, 2011
Information about medications is critical in supporting decision-making during the
prescription process and thus in improving the safety and quality of care. The
Summary of Product Characteristics (SPC) represents the basis of information for
health professionals on how to use medicines. However, this information is locked
in free-text and, as such, cannot be actively accessed and elaborated by computerized
applications. In this work, we propose a machine learning based system for the automatic
recognition of drug-related entities (active ingredient, interaction effects, etc.) in SPCs,
focusing on drug interactions. Our approach learns to classify this information in a
structured prediction framework, relying on conditional random fields. The classifier is
trained and evaluated using a corpus of a hundred SPCs. They have been hand-annotated with
thirteen semantic labels that have been derived from a previously developed domain ontology.
Our evaluations show that the model exhibits high overall performance, with an
average F1-measure of about 90%.
@inproceedings{rubrichi2011extracting,
author = {Rubrichi, Stefania and Quaglini, Silvana and Spengler, Alex and Gallinari, Patrick},
title = {Extracting Information from Summary of Product Characteristics for Improving Drugs Prescription Safety},
booktitle = {AIME '11: 13th Conference on Artificial Intelligence in Medicine},
series = {Lecture Notes in Computer Science},
year = {2011},
pages = {327--337},
volume = {6747},
location = {Bled, Slovenia},
doi = {http://dx.doi.org/10.1007/978-3-642-22218-4_42},
publisher = {Springer},
}
Alex Spengler and Patrick Gallinari
Document Structure Meets Page Layout: Loopy Random Fields for Web News Content Extraction
In Proceedings of the 10th ACM Symposium on Document Engineering. Manchester, UK, 2010
Web content extraction is concerned with the automatic identification
of semantically interesting web page regions. To generalize to pages
from unknown sites, it is crucial to exploit not only the local characteristics
of a particular web page region, but also the rich interdependencies
that exist between the regions and their latent semantics. We therefore
propose a loopy conditional random field which combines semantic
intra-page dependencies derived from both document structure
and page layout, uses a realistic set of local and relational features
and is efficiently learnt in the tree-based reparameterization framework.
The results of our empirical analysis on a corpus of real-world news
web pages from 177 distinct sites with multiple annotations on DOM
node level demonstrate that our combination of document structure and
layout-driven interdependencies leads to a significant error reduction
on the semantically interesting regions of a web page.
@inproceedings{spengler2010document,
author = {Spengler, Alex and Gallinari, Patrick},
title = {Document Structure Meets Page Layout: Loopy Random Fields for Web News Content Extraction},
booktitle = {DocEng '10: Proceedings of the 10th ACM symposium on Document engineering},
year = {2010},
isbn = {978-1-4503-0231-9},
pages = {151--160},
location = {Manchester, United Kingdom},
doi = {http://doi.acm.org/10.1145/1860559.1860590},
publisher = {ACM},
address = {New York, NY, USA},
}
Alex Spengler, Antoine Bordes and Patrick Gallinari
A Comparison of Discriminative Classifiers for Web News Content Extraction
In Proceedings of RIAO 2010, 9th Int. Conf. on Adaptivity, Personalization and Fusion of Heterogeneous Information. Paris, France, 2010
Until now, approaches to web content extraction have focused on
random field models, largely neglecting large margin
methods. Structured large margin methods, however, have recently
shown great practical success. We compare, for the first time,
greedy and structured support vector machines with conditional
random fields on a real-world web news content extraction task,
showing that large margin approaches are indeed competitive with
random field models.
@inproceedings{spengler2010comparison,
author = {Alex Spengler, Antoine Bordes and Patrick Gallinari},
title = {A Comparison of Discriminative Classifiers for Web News Content Extraction},
booktitle = {Proceedings of RIAO 2010, 9th Int. Conf. on Adaptivity, Personalization and
Fusion of Heterogeneous Information},
year = {2010},
isbn = {},
pages = {},
doi = {},
publisher = {CID},
address = {Paris, France},
}
Alex Spengler and Patrick Gallinari
Learning to Extract Content from News Webpages
The 2009 IEEE International Symposium on Mining and Web (MAW2009). Bradford, England, UK, 2009
We consider the problem of content extraction from online news
webpages. To explore to what extent the syntactic markup and the
visual structure of a webpage facilitate the extraction of its
content, we compare two state-of-the-art classifiers as first
instantiations of a general framework that allows for proper model
comparison. To this end, we introduce the publicly available
News600 corpus, a set of 604 real
world news webpages which have
been annotated with 30 semantic labels. An empirical analysis of the
two models on this dataset shows that the inclusion of structural
information is indeed advantageous.
@article{spengler2009learning,
author = {Alex Spengler and Patrick Gallinari},
title = {Learning to Extract Content from News Webpages},
journal ={Advanced Information Networking and Applications Workshops, International Conference on},
volume = {0},
year = {2009},
isbn = {978-0-7695-3639-2},
pages = {709-714},
doi = {http://doi.ieeecomputersociety.org/10.1109/WAINA.2009.97},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
}
Alex Spengler
Probabilistic Web Content Analysis. Representation of Content Semantics
in the Bayesian Diagnostic Paradigm.
PhD thesis, Université Pierre et Marie Curie, France, 2011
An automatic identification of meaningful content sections on web pages, such as titles, paragraphs, advertisements, product images or user comments, facilitates a large number of applications, ranging from speech rendering for the visually impaired over contextual advertisement to structured web search. Ultimately, such an identification always necessitates both, a partitioning of the content and a classification of the resulting partitions into a number of application-dependent semantic categories. We hence propose to approach the analysis of web content in an interdependent classification framework, integrating semantic coherence, just as in segmentation, via interaction features which describe the semantic configuration of two or more semantically atomic content regions.
One of the major obstacles to gaining meaningful access to web contents is their semantically inappropriate organisation and markup. As a consequence, it generally is impossible to characterise an interesting content region with certainty. In this thesis, we propose to treat the uncertainties arising in an analysis of web content in a coherent probabilistic framework, the Bayesian diagnostic paradigm, and attempt to illuminate the conditions under which some probability model might be justified, deriving its form of representation from assumptions about observable quantities such as region features and semantics, utilising the concepts of exchangeability, conditional independence and sufficiency. In particular, we examine different Markovian dependencies between the semantic content categories within individual web pages and discuss how to take into account the structure that exists between pages and sites.
We equally present an informal feature analysis which elucidates the manifold information available in the content, structure and style of a web page. Such an analysis is a quintessential prerequisite to both formal probabilistic modelling and high predictive performance. Furthermore, we introduce a new, publicly available data set of 604 real-world news web pages from 206 distinct sites with accurate annotations based on 30 distinct semantic categories, termed the NEWS600 corpus. Finally, we conduct a series of experiments on the NEWS600 corpus to empirically compare a number of different approaches for web news content classification. It demonstrates that even relatively simple models in our framework achieve significantly better results than the current state of the art.
@phdthesis{spengler2011probabilistic,
author = {Spengler, Alexander A.},
title = {Probabilistic Web Content Analysis. Representation of Content Semantics
in the Bayesian Diagnostic Paradigm.},
school = {{Université Pierre et Marie Curie, France}},
year = {2011},
month = {December}
}
Alex Spengler
Maximum Margin Markov Networks for XML Tag Relabelling
MSc thesis, Karlsruhe Institute of Technology, Germany, 2005
This thesis explores a number of different
discriminative models for learning to label the segments of
semi-structured documents. This is interesting as it allows to query
heterogeneous data collections using predefined, homogeneous segment
semantics. We focus on the recently introduced M3N framework, a
parameter estimation technique for models with interdependent outputs
which maximises the margin of the decision boundary. An instance of
this framework, similar to Sequential Minimal Optimisation, is
discussed. We hypothesise that this instance outperforms other models
such as conditional random fields and multiclass Support Vector
Machines. Experiments on a XML relabelling task underpin that this is
indeed the case. Although high-dimensional output spaces and large
tree-width Markov networks remain problematic, this is a highly
promising result, indicating that these algorithms render maximum
margin Markov networks (M3Ns) more than theoretically sound and
probabilistically principled.
@mastersthesis{spengler2005maximum,
author = {Spengler, Alex},
title = {Maximum Margin {M}arkov Networks for {XML} Tag Relabelling},
school = {{Karlsruhe Institute of Technology, Germany}},
year = {2005},
month = {December}
}
Alex Spengler
Neonatal Baby Monitoring
MSc thesis, University of Edinburgh, UK, 2003
In this
thesis we investigate the use of probabilistic graphical models for
neonatal baby monitoring applications. In particular, we concentrate
on detecting artefact patterns in physiological data using a
conditional Gaussian approach. We describe a system that learns the
necessary parameters from the given data and produces marginal
posterior probabilities for the latent variables that have been used
to model the artefact processes. It should be emphasised that the
current system does not include the temporal evolution of the measured
signals, but we indicate how this can be done within the presented
framework. We also discuss our approach in the context of prior work
and present ways to overcome identified problems.
@mastersthesis{spengler2003neonatal,
author = {Spengler, Alex},
title = {Neonatal Baby Monitoring},
school = {{University of Edinburgh, UK}},
year = {2003},
month = {September}
}
Arboreal - A Firefox/Firebug extension for DOM tree visualization.
News600 - A corpus of 604 semantically labeled news web pages for web content extraction.
Alex Spengler gratefully acknowledges support from Microsoft
Research through its European PhD Scholarship Programme.