In April 2016 Manchester eScholar was replaced by the University of Manchester’s new Research Information Management System, Pure. In the autumn the University’s research outputs will be available to search and browse via a new Research Portal. Until then the University’s full publication record can be accessed via a temporary portal and the old eScholar content is available to search and browse via this archive.

PDFX: fully-automated PDF-to-XML conversion of scientific literature

Alexandru Constantin, Steve Pettifer, Andrei Voronkov

In: Proceedings of the 2013 ACM symposium on Document engineering : The 13th ACM Symposium on Document Engineering; 10 Sep 2013-13 Feb 2014; Auditorium Santa Apollonia, Via San Gallo, 25, 50129 Firenze, Italy. Florence, Italy: ACM; 2013. p. 177-180.

Access to files

Abstract

PDFX is a rule-based system designed to reconstruct the logical structure of scholarly articles in PDF form, regardless of their formatting style. The system's output is an XML document that describes the input article's logical structure in terms of title, sections, tables, references, etc. and also links it to geometrical typesetting markers in the original PDF, such as paragraph and column breaks. The key aspect of the presented approach is that the rule set used relies on relative parameters derived from font and layout specifics of each article, rather than on a template-matching paradigm. The system thus obviates the need for domain- or layout-specific tuning or prior training, exploiting only typographical conventions inherent in scientific literature. Evaluated against a significantly varied corpus of articles from nearly 2000 different journals, PDFX gives a 77.45 F1 measure for top-level heading identification and 74.03 for extracting individual bibliographic items. The service is freely available for use at http://pdfx.cs.man.ac.uk/.

Bibliographic metadata

Type of resource:
Content type:
Type of conference contribution:
Publication date:
Conference title:
The 13th ACM Symposium on Document Engineering
Conference venue:
Auditorium Santa Apollonia, Via San Gallo, 25, 50129 Firenze, Italy
Conference start date:
2013-09-10
Conference end date:
2014-02-13
Publisher:
ACM
Place of publication:
Florence, Italy
Proceedings start page:
177
Proceedings end page:
180
Proceedings pagination:
177-180
Contribution total pages:
4
Abstract:
PDFX is a rule-based system designed to reconstruct the logical structure of scholarly articles in PDF form, regardless of their formatting style. The system's output is an XML document that describes the input article's logical structure in terms of title, sections, tables, references, etc. and also links it to geometrical typesetting markers in the original PDF, such as paragraph and column breaks. The key aspect of the presented approach is that the rule set used relies on relative parameters derived from font and layout specifics of each article, rather than on a template-matching paradigm. The system thus obviates the need for domain- or layout-specific tuning or prior training, exploiting only typographical conventions inherent in scientific literature. Evaluated against a significantly varied corpus of articles from nearly 2000 different journals, PDFX gives a 77.45 F1 measure for top-level heading identification and 74.03 for extracting individual bibliographic items. The service is freely available for use at http://pdfx.cs.man.ac.uk/.

Institutional metadata

University researcher(s):

Record metadata

Manchester eScholar ID:
uk-ac-man-scw:218911
Created by:
Constantin, Alexandru
Created:
7th February, 2014, 12:47:25
Last modified by:
Constantin, Alexandru
Last modified:
30th July, 2014, 18:32:10

Can we help?

The library chat service will be available from 11am-3pm Monday to Friday (excluding Bank Holidays). You can also email your enquiry to us.