Representation for visualization. Tables are primarily used in a way that data can
be easily viewed. Most methods and languages that support describing tables, including
XML, HTML or LaTex, are designed with the focus on visualisation. In mark-up
languages, tables contain a lot of information about what a table should look like (see
Figure 1.8), but very little about how the table entries relate (Thompson 1996, Hurst
& Douglas 1997). Since the focus is on visualisation and visual representation, a table
author only needs to focus on the visual appearance of the table, ignoring description
of the functions of areas or relationships. Therefore, reading and computational analysis
of tables described in this manner require a method that is able to disentangle visual
structure before further analysis
Variety of tables structural layouts and visual relationships. There is no ”common”
table structure. The combination of cell arrangement, their spanning, content,
and function (headings or data) determine how the table is read and understood. Cells
can span over several other cells both horizontally and vertically. Some of the examples
of tabular structures are presented in Figure 1.9. Cells in a table are visually
related, presenting multiple dimensions and annotations of the data, in contrast to linear
textual information. Tables are flexible in their structure, providing authors with
means to shape them according to their data presentation needs. Table structure makes
automated detection of functional areas (functional analysis of table) and resolving
inter-cell relationships (structural analysis of table) challenging. Table layouts and
their visual relationships are not specific to any domain. Complex tables can be found
in any domain. However, domain specific knowledge is useful for detecting functional
areas or resolving relationships within the table. Evaluation and creation of new machine
learning-based approaches to disentangle visual layouts will be simplified if gold
standard annotated corpora existed. Several annotation schemas have been proposed
for annotating textual resources and over the time they have been standardised. At the
moment there is almost no commonly accepted annotation schema for tables, neither
research on how table content should be annotated, while preserving the structure and
relationships between the cells.
Variety of value presentation formats. Values in cells can be presented using various
syntactic representation formats. While some authors may present mean and standard
deviation in one cell using the plus-minus () sign (i.e. 122), some will use
brackets (i.e. 12(2)) and some will use two separate cells. Extraction of these values
requires knowledge of possible value presentation patterns article authors most
frequently use in tables. The value presentation formats may be different in different
research domains. Often same representation format can be used for presenting
different things in different domains. For example, presentation, such as 122, in
biomedical literature would usually indicate mean or median with standard deviation,
while in computer science domain it may be often used for mean and standard error.
The goal of table mining is to make table information easily accessible and to interpret
tables automatically. The process of making published information from literature
in various sources structured, managed, searchable and easily accessible in the future,
while maintaining value, is commonly referred as data curation (Choudhury 2008). If
done manually, data curation is a laborious and expensive task. Automated or assisted
curation can speed up curation process by more than 70% and help make information
computationally interpretable (Alex et al. 2008).
In order to successfully process table and extract information from them, these
challenges need to be addressed.