Manchester eScholar Services

Supported by The University of Manchester Library

In April 2016 Manchester eScholar was replaced by the University of Manchester’s new Research Information Management System, Pure. In the autumn the University’s research outputs will be available to search and browse via a new Research Portal. Until then the University’s full publication record can be accessed via a temporary portal and the old eScholar content is available to search and browse via this archive.

Related resources

University researcher(s)

    Academic department(s)

    Integrating text-mining approaches to identify entities and extract events from the biomedical literature

    Gerner, Lars Martin Anders

    [Thesis]. Manchester, UK: The University of Manchester; 2012.

    Access to files

    Abstract

    The amount of biomedical literature available is increasing at an exponential rate and is becoming increasingly difficult to navigate. Text-mining methods can po-tentially mitigate this problem, through the systematic and large-scale extraction of structured information from inherently unstructured biomedical text. This the-sis reports the development of four text-mining systems that, by building on each other, has enabled the extraction of information about a large number of pub-lished statements in the biomedical literature. The first system, LINNAEUS, en-ables highly accurate detection (“recognition”) and identification (“normaliza-tion”) of species names in biomedical articles. Building on LINNAEUS, we im-plemented a range of improvements in the GNAT system, enabling high-throughput gene/protein detection and identification. Using gene/protein identifi-cation from GNAT, we developed the Gene Expression Text Miner (GETM), which extracts information about gene expression statements. Finally, building on GETM as a pilot project, we constructed the BioContext integrated event ex-traction system, which was used to extract information about over 11 million dis-tinct biomolecular processes in 10.9 million abstracts and 230,000 full-text arti-cles. The ability to detect negated statements in the BioContext system enables the preliminary analysis of potential contradictions in the biomedical literature. All tools (LINNAEUS, GNAT, GETM, and BioContext) are available under open-source software licenses, and LINNAEUS and GNAT are available as online web-services. All extracted data (36 million BioContext statements, 720,000 GETM statements, 72,000 contradictions, 37 million mentions of spe-cies names, 80 million mentions of gene names, and 57 million mentions of ana-tomical location names) is available for bulk download. In addition, the data ex-tracted by GETM and BioContext is also available to biologists through easy-to-use search interfaces.

    Additional content not available electronically

    • Supplementary file 1: List of acronyms used by LINNAEUS and associated species probabilities.• Supplementary file 2: List of additional synonyms used for LINNAEUS in addition to data in the NCBI Taxonomy.• Supplementary file 3: List of species terms occurring in the English language that frequently are FPs.• Supplementary file 4: Details of the inter-annotator agreement calculations for the LINNAEUS evaluation corpus.• Supplementary file 5: List of potentially misspelled species names in MEDLINE.• Supplementary file 6: List of the 100 most frequently mentioned species (with absolute and relative frequency numbers).• Supplementary file 7: GETM evaluation corpus• Supplementary file 8: B+G evaluation corpus for BioContext• Supplementary file 9: List of detected contradictions in the data extracted by BioContext(these files are located on a CD attached to the thesis)

    Bibliographic metadata

    Type of resource:
    Content type:
    Form of thesis:
    Type of submission:
    Degree type:
    Doctor of Philosophy
    Degree programme:
    PhD Bioinformatics
    Publication date:
    Location:
    Manchester, UK
    Total pages:
    172
    Abstract:
    The amount of biomedical literature available is increasing at an exponential rate and is becoming increasingly difficult to navigate. Text-mining methods can po-tentially mitigate this problem, through the systematic and large-scale extraction of structured information from inherently unstructured biomedical text. This the-sis reports the development of four text-mining systems that, by building on each other, has enabled the extraction of information about a large number of pub-lished statements in the biomedical literature. The first system, LINNAEUS, en-ables highly accurate detection (“recognition”) and identification (“normaliza-tion”) of species names in biomedical articles. Building on LINNAEUS, we im-plemented a range of improvements in the GNAT system, enabling high-throughput gene/protein detection and identification. Using gene/protein identifi-cation from GNAT, we developed the Gene Expression Text Miner (GETM), which extracts information about gene expression statements. Finally, building on GETM as a pilot project, we constructed the BioContext integrated event ex-traction system, which was used to extract information about over 11 million dis-tinct biomolecular processes in 10.9 million abstracts and 230,000 full-text arti-cles. The ability to detect negated statements in the BioContext system enables the preliminary analysis of potential contradictions in the biomedical literature. All tools (LINNAEUS, GNAT, GETM, and BioContext) are available under open-source software licenses, and LINNAEUS and GNAT are available as online web-services. All extracted data (36 million BioContext statements, 720,000 GETM statements, 72,000 contradictions, 37 million mentions of spe-cies names, 80 million mentions of gene names, and 57 million mentions of ana-tomical location names) is available for bulk download. In addition, the data ex-tracted by GETM and BioContext is also available to biologists through easy-to-use search interfaces.
    Additional digital content not deposited electronically:
    • Supplementary file 1: List of acronyms used by LINNAEUS and associated species probabilities.• Supplementary file 2: List of additional synonyms used for LINNAEUS in addition to data in the NCBI Taxonomy.• Supplementary file 3: List of species terms occurring in the English language that frequently are FPs.• Supplementary file 4: Details of the inter-annotator agreement calculations for the LINNAEUS evaluation corpus.• Supplementary file 5: List of potentially misspelled species names in MEDLINE.• Supplementary file 6: List of the 100 most frequently mentioned species (with absolute and relative frequency numbers).• Supplementary file 7: GETM evaluation corpus• Supplementary file 8: B+G evaluation corpus for BioContext• Supplementary file 9: List of detected contradictions in the data extracted by BioContext(these files are located on a CD attached to the thesis)
    Thesis main supervisor(s):
    Thesis co-supervisor(s):
    Thesis advisor(s):
    Language:
    en

    Institutional metadata

    University researcher(s):
    Academic department(s):

    Record metadata

    Manchester eScholar ID:
    uk-ac-man-scw:158970
    Created by:
    Gerner, Lars
    Created:
    15th April, 2012, 21:56:55
    Last modified by:
    Gerner, Lars
    Last modified:
    1st June, 2012, 12:54:42