Alex SpenglerPhD Candidate |
Laboratoire d'Informatique de Paris 6 |
|
|
||
Machine learning and (Bayesian) statistics, information extraction & visualization.
Alex Spengler studied Computer Science and Mathematics in Karlsruhe, Edinburgh and Paris. In April 2006 he began his PhD under the supervision of Professor Patrick Gallinari from the Laboratoire d'Informatique at Université Pierre et Marie Curie, Paris, France and Professor Bernhard Schölkopf from the Max Planck Institute for Biological Cybernetics, Tübingen, Germany. His research interests are centered on the intersection between machine learning and information extraction, focussing on structured prediction for document transformation.
Spengler, Alex and Gallinari, Patrick (2009): Learning to Extract Content from News Webpages. The 2009 IEEE International Symposium on Mining and Web (MAW2009). Bradford, England, UK.
![]()
We consider the problem of content extraction from online news
webpages. To explore to what extent the syntactic markup and the
visual structure of a webpage facilitate the extraction of its
content, we compare two state-of-the-art classifiers as first
instantiations of a general framework that allows for proper model
comparison. To this end, we introduce the publicly available
News600 corpus, a set of 604 real
world news webpages which have
been annotated with 30 semantic labels. An empirical analysis of the
two models on this dataset shows that the inclusion of structural
information is indeed advantageous.
@article{spengler2009learning,
author = {Alex Spengler and Patrick Gallinari},
title = {Learning to Extract Content from News Webpages},
journal ={Advanced Information Networking and Applications Workshops, International Conference on},
volume = {0},
year = {2009},
isbn = {978-0-7695-3639-2},
pages = {709-714},
doi = {http://doi.ieeecomputersociety.org/10.1109/WAINA.2009.97},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
}
Spengler, Alex (2005): Maximum Margin Markov Networks for
XML Tag Relabelling. MSc thesis, University of Karlsruhe (TH),
Germany.
![]()
This thesis explores a number of different
discriminative models for learning to label the segments of
semi-structured documents. This is interesting as it allows to query
heterogeneous data collections using predefined, homogeneous segment
semantics. We focus on the recently introduced M3N framework, a
parameter estimation technique for models with interdependent outputs
which maximises the margin of the decision boundary. An instance of
this framework, similar to Sequential Minimal Optimisation, is
discussed. We hypothesise that this instance outperforms other models
such as conditional random fields and multiclass Support Vector
Machines. Experiments on a XML relabelling task underpin that this is
indeed the case. Although high-dimensional output spaces and large
tree-width Markov networks remain problematic, this is a highly
promising result, indicating that these algorithms render maximum
margin Markov networks (M3Ns) more than theoretically sound and
probabilistically principled.
@mastersthesis{spengler2005maximum,
author = {Spengler, Alex},
title = {Maximum Margin {M}arkov Networks for {XML} Tag Relabelling},
school = {{University of Karlsruhe (TH), Germany}},
year = {2005},
month = {December}
}
Spengler, Alex (2003): Neonatal Baby Monitoring. MSc
thesis, University of Edinburgh, UK.
![]()
In this
thesis we investigate the use of probabilistic graphical models for
neonatal baby monitoring applications. In particular, we concentrate
on detecting artefact patterns in physiological data using a
conditional Gaussian approach. We describe a system that learns the
necessary parameters from the given data and produces marginal
posterior probabilities for the latent variables that have been used
to model the artefact processes. It should be emphasised that the
current system does not include the temporal evolution of the measured
signals, but we indicate how this can be done within the presented
framework. We also discuss our approach in the context of prior work
and present ways to overcome identified problems.
@mastersthesis{spengler2003neonatal,
author = {Spengler, Alex},
title = {Neonatal Baby Monitoring},
school = {{University of Edinburgh, UK}},
year = {2003},
month = {September}
}
Nabu - A Firefox/Firebug extension for visual HTML annotation.
News600 - A corpus of 604 semantically labeled news web pages for web content extraction.
Alex Spengler gratefully acknowledges support from Microsoft Research through its European PhD Scholarship Programme.
© Copyright 2009 Alex Spengler. All rights reserved.