从医学文献的表格中提取患者数据

如何提取患者的BMI、体重和数量

Medline 是数据库 pubmed是搜索引擎 PMC 全称为 PubMed Central，建立于 2000 年，由 NCBI 进行管理，是可以免费获取全文的生物医学和生命科学全文数据库，保存了 NLM 收录印刷杂志的电子副本。PMC 的核心原则是开放获取全文。PMC 并不出版文献，而是收集参与杂志及符合研究资助条件的作者提交文献全文。

新药研发、疾病治疗的研究产生了大量的临床试验文献。其中的表格可以了解参与人员的基本信息、trial arms, names and the side effects of the tested drugs、interactions between biomedical substances, such as drug-drug or protein-protein interactions

第二大是药品说明书数据库(Drugs@FDA、FDA Online Label Repository和DailyMed)，如何从数据库的说明书文件中提取药物配伍禁忌

MEDLINE,PMC and DailyMed中文献都提供XML格式下载

https://github.com/cstubben/pmcXML/blob/2b0fb3465dff00dbb13f3552d4c21f4de5ec3a70/R/pmcOAI.R

https://dtd.nlm.nih.gov/archiving/tag-library/3.0/index.html

临床试验的文献

MEDLINE 30种语言5600种杂志包含27,000,000+份文献
MEDLINE 包含近80万临床试验文献
科研人员如何面对数量呈指数级增长的文献！

借力数据挖掘？

大规模文本处理的挖掘技术
当下大多数挖掘技术着重在纯文本，表格、图形被忽略

DailyMed:This website contains 102740 drug listings as submitted to the Food and Drug Administration (FDA).

目的：为药学工作人员根据不同需求选择不同的药品说明书数据库提供参考。方法：搜集Drugs@FDA、FDA Online Label Repository和DailyMed三大开放获取药品说明书数据库的信息组织方式，从检索功能设置、检索结果显示内容、数据来源与服务目标三个方面进行比较分析。结果与结论：在检索功能上，以DailyMed最丰富；在检索结果显示内容上，DailyMed的数据格式化程度最高，FDA Online Label Repository次之，而Drugs@FDA只进行了部分格式化；3个数据库都提供页面复制和打印的功能，其中DailyMed的界面更为友好、内容更开放、提供了全部下载功能；在数据来源与服务目标上，Drugs@FDA和FDA Online Label Repository的系统开发者都为FDA，DailyMed为美国国立医学图书馆，Drugs@FDA采用的数据来源为经过FDA严格审批之后的药品说明书，对某一药品信息的描述最全面；FDA Online Label Repository则为厂商提交给FDA的原始说明书，其内容最新，甚至有未上市的药品；DailyMed采用的数据来源取自于已上市的药品包装盒上的信息，还包括很多上市但未经过严格审批的说明书信息，其覆盖药品最全。

医学文档中的表格

PMC数据库中72%的文章都有表格，3.1个表格/每个文章
NLM的DailyMed数据库中药品说明书，4.1个表格/每份说明书
表格通常与临床试验设计和结论相关

Representation for visualization. Tables are primarily used in a way that data can be easily viewed. Most methods and languages that support describing tables, including XML, HTML or LaTex, are designed with the focus on visualisation. In mark-up languages, tables contain a lot of information about what a table should look like (see Figure 1.8), but very little about how the table entries relate (Thompson 1996, Hurst & Douglas 1997). Since the focus is on visualisation and visual representation, a table author only needs to focus on the visual appearance of the table, ignoring description of the functions of areas or relationships. Therefore, reading and computational analysis of tables described in this manner require a method that is able to disentangle visual structure before further analysis

Variety of tables structural layouts and visual relationships. There is no ”common” table structure. The combination of cell arrangement, their spanning, content, and function (headings or data) determine how the table is read and understood. Cells can span over several other cells both horizontally and vertically. Some of the examples of tabular structures are presented in Figure 1.9. Cells in a table are visually related, presenting multiple dimensions and annotations of the data, in contrast to linear textual information. Tables are flexible in their structure, providing authors with means to shape them according to their data presentation needs. Table structure makes automated detection of functional areas (functional analysis of table) and resolving inter-cell relationships (structural analysis of table) challenging. Table layouts and their visual relationships are not specific to any domain. Complex tables can be found in any domain. However, domain specific knowledge is useful for detecting functional areas or resolving relationships within the table. Evaluation and creation of new machine learning-based approaches to disentangle visual layouts will be simplified if gold standard annotated corpora existed. Several annotation schemas have been proposed for annotating textual resources and over the time they have been standardised. At the moment there is almost no commonly accepted annotation schema for tables, neither research on how table content should be annotated, while preserving the structure and relationships between the cells.

Variety of value presentation formats. Values in cells can be presented using various syntactic representation formats. While some authors may present mean and standard deviation in one cell using the plus-minus () sign (i.e. 122), some will use brackets (i.e. 12(2)) and some will use two separate cells. Extraction of these values requires knowledge of possible value presentation patterns article authors most frequently use in tables. The value presentation formats may be different in different research domains. Often same representation format can be used for presenting different things in different domains. For example, presentation, such as 122, in biomedical literature would usually indicate mean or median with standard deviation, while in computer science domain it may be often used for mean and standard error.

The goal of table mining is to make table information easily accessible and to interpret tables automatically. The process of making published information from literature in various sources structured, managed, searchable and easily accessible in the future, while maintaining value, is commonly referred as data curation (Choudhury 2008). If done manually, data curation is a laborious and expensive task. Automated or assisted curation can speed up curation process by more than 70% and help make information computationally interpretable (Alex et al. 2008).

In order to successfully process table and extract information from them, these challenges need to be addressed.

表格数据挖掘的挑战

内容庞杂
版式多样
数据类型多样
容易误解的可视化表达方式
缺少标记好的训练数据
如何从XML、TXT、PDF、JPG中自动化提取

案例

从临床试验文献的表格中提取患者数量、患者BMI和体重数据
多层次挖掘方法处理表格中的信息
- 实现大规模半自动化提取
- 表格中数据再加工整理

我们所选用的方法

基于规则的
- 针对收集的表格人工分析，整理出规则
三步法
- 表格检测
- 功能性分析
- 信息抽取/语义分析

方法概述

处理流程

表格-Latest

描述信息
- 表名/标题
- 脚注/footer/legend
导航栏
- row header/stub
- super-row/sub-row-header
- column header
表格数据

在这论文中，王的定义已被略微修改，stub head属于stub。值得一提的是，当然，并不是所有表格都会存在这四部分。例如，对于很大比例的表格，stub和row headers都可以不存在，the column headers are not “boxed”.。除了这些定义之外，本论文工作采用表格定义：header, column, row, title, caption, superheader,nested header, subheader, block, cell and element.。图1说明了这些定义。

element 指的是PDF page 中的一个词或者数字。表格中的cell和element的区别在于一个cell中会包含多个element.一个block 包含多个 cell。尽管 title 和caption可能并不是表格的一部分，由于包含了表格内容的重要信息，就包含在定义和提取过程中了，尤其是需要对表格数据进行功能、语义处理就特别需要提取这些信息并且与表格进行关联。

superheader 是一种跨多行并且在他之下会有其他column header (typically nested headers, each associated with a single column). A subheader is a cell in a table that usually exists on a row that contains no table body elements, and it is associated with all the stub elements below it, or until the next subheader below is found. Only in tables where the stub contains more than one column, the subheaders may exist on body data containing rows. The left-to-right style of writing used by all western languages, is guarantee enough that the stub can be trusted to be located at the left end of the table, in Column 1. There are of course exceptions, but the percentage of such tables, where the stub columns are not at the left end of the table is negligible. Slightly more commonly, a duplicate of the stub can exist in the middle of, or at the rightmost column of a table.

表格-Hurst

(i) the stub that contains the row- and subheaders;
(ii) the boxhead that contains the column headers (excluding the stub head);
(iii) the stub head that contains the header for the stub
(iv) the body that contains the actual data of the table

表格类型-List

List列表形式的表格

表格类型-Matrix

Matrix矩阵形式的表格

表格类型-Super-row

Super-row形式的表格

表格类型-multi-tables

表格模型

ref 启发式主要依赖：the arrangement of cells, spanning cells, a presence of special characters (e.g. horizontal lines) or empty cells, cell similarity。
- 1.Identifying table boundaries in digital documents via sparse line detection.
- 2.A comparison of two unsupervised table recognition methods from digital scientific articles
- 3.Detecting Table Region in PDF Documents Using Distant Supervision
- 1.基于行中是否存在空白、行中的对齐、行与行之间的模式的相似度等结构化信息的启发法
- 2.基于文本行中特殊字符的比例、行内行间的相对位置等信息的结构化信息的启发法
- 3.基于关键词、阅读顺序、行的稀疏程度等特征用于表格定位。
- 4.基于决策树、HMM、CRF的版面、布局特征等机器学习方法来进行表格定位，比如行列数量、交叉点的类型、边框和单元格的存在与否、文本的左右对齐、行间距、单元格内容的相似性和重复性、图片/超链接/控件存在与否、表头是否存在、标题等。
- Automatic selection of table areas in documents for information extraction
- 2000博士论文-the interpretation of tables in text
- 2015-Table understanding using a rule engine
- 2015-On methods and tools of table detection, extraction and annotation in PDF documents
- 2015-TEXUS: a task-based approach for table extraction and understanding
- 2016- A corpus of tables in full-text biomedical research publications
- optical character recognition：1999：T-recs table recognition and validation approach.
- decision trees：1999：Learning to recognize tables in free text.
- Support Vector Machines：2008：Discriminating meaningful web tables from decorative tables using a composite kernel.
- heuristics：2005： pdf2table: A method to extract table information from pdf *

表格检测 &&Physical Layout Analysis

目标：区分哪些区域是表格哪些不是(是文本text、图片image、图表chart)
Non-DL
- contiguous lines 1
- contiguous text blocks 2
- distant supervision 3

表格检测 &&Physical Layout Analysis 续

目标：区分哪些区域是表格哪些不是(是文本text、图片image、图表chart)
DL
- 决策树、SVM、启发式(Unsupervised vs supervised）
- DIA 、region sementation( pixel based or texture based segmentation 4) 、treat pdf page as image

表格功能性分析 && logical layout analysis

目标：区分哪些单元格是表名、标题、表头、行和列、单元格网格结构、单元格间的关系重构表格逻辑结构
确定表格类型-List、Matrix、super-row、multi-table
DL
- 决策树、CRF
- 需要标记好的训练数据
启发式（版面+领域知识）
- word clustering 1
- partitioning the projection histograms 2 && X-Y cut 3
- 针对表格类型的规则

表格功能性分析 && logical layout analysis续

目标：区分哪些单元格是表名、标题、表头、行和列、单元格网格结构、单元格间的关系重构表格逻辑结构
表头识别
- 数字
- 重复
- 字母
- 其他(数据类型、字体大小、文本对齐方式、分割线、子表头等)
Table Identification Guidelines 4

表格语义分析

目标：区分单元格中数据的概念、属性、以及相互之间的关系
不同层次
- 表格业务类型区分(化验单、发票等)
- 单元格取值类型 (such as number, currency, percentage, named entity, and non-numeric words)
- keywords list
- domain specific ontology and lexicons 1

信息抽取

从表格中提取感兴趣的变量、变量值
模板

抽取患者数量

启发式方法
检索标题、表头、单元格
针对标题
- n = %d
- %d Adj*（patients|participants|subjects|individuals）
针对表头
- n = %d
- can be partial,needs adding up
针对单元格
- stub contains defined word or phrase
- can be partial,needs adding up

抽取BMI数据

根据关键词列表(BMI,body mass index)和黑名单(change,increase)
stub、表头中存在关键词意味着存在数据的可能性
取值范围为14-40

抽取患者体重

根据关键词列表()和黑名单()
stub、表头中存在关键词意味着存在数据的可能性
存在于黑名单中的词则不是
取值范围不管用
- 有人可以是40-150公斤
- lbs/英磅：80-130lbs
- 婴儿:1500-5000g

结果

语料中包括1273份文档，3573个表格
每个表格平均80个单元格
功能性、结构处理的评估
- 每种表格随机选择100个
信息抽取的评估
- 患者数量
  - 758 contained data
  - 50份随机文档
- BMI和体重
  - 113份文档包含此类信息

功能性分析的结果

信息抽取的结果

讨论

Better scoped values，such as BMI can be modelled-better performance
白名单、黑名单定义要尽可能的完整
呈现格式和方法的多样性
易产生误解的样式
整体结果还很不错

总结

利用大规模表格数据挖掘从临床试验采集人群信息
根据版面对表格进行分类
仅以患者数量、BMI、体重为例进行了演示
成果显著