3. Analysis And Computation Of Attributes

A multipage document image and, optionally, a metadata file are input of the analysis step. Currently, the system accepts PDF and TIFF files as inputs. First, a preprocessing step is applied to the document, which includes OCR and layout analysis via commercial software. The output of the preprocessing, which is a collection of document elements, is further analyzed to assign logical labels to the document elements, such as title, sub-titles, author names, abstract, figures, and figure captions. Besides visual information, the analysis step also determines audible document information from the document image and metadata. Examples of audible information include figure captions, keywords, authors' names, publication name, etc., that can be converted to synthesized speech. We compute the keywords of a document with TF-IDF analysis [4].

Optimization problems related to documents generally involve some spatial constraints, such as optimizing layout and size for readability and reducing spacing [2][5]. In such frameworks, some information attributes are commonly associated with different parts of a document. In our framework, since we try to optimize not only the spatial presentation but also time presentation, we associate time attributes with each document element in addition to information attributes.

Document elements are divided into the following three groups: purely visual, Ev, purely audible, Ea/, and synchronized audiovisual, Eav. Visual elements include document elements such as figures and graphs without any captions. Audible elements include elements that can be communicated easily in the audio channel without a visual representation. Examples of audible elements include keywords, and number of pages. Audiovisual elements are composed of elements that are presented on the audio and visual channel simultaneously. Examples include figures with captions.

3.1. Time Attributes

Given a document element e, the time attribute, t(e) is the approximate duration that is sufficient for a user to comprehend a document element. For computing time attributes for figures without any captions, we make the assumption that complex figures take a longer time to comprehend. The complexity of a figure element e is measured by the figure entropy H(e), which is computed using Multi-resolution Bit Distribution described in [6]. A time attribute for a figure element is computed as t(e) = α H(e) / H(P) , where H(e) is the figure entropy, H(P) is the entropy of the entire page, and α is a time constant. Time required to comprehend a photo might be different than that of a graph, there for different α can be used for these different figure types. We do not distinguish different figure types in this paper and α is fixed to 4 seconds, which is the average time a user spends on a figure in our experiments.

Time attribute for a text document element (e.g., title, abstract) is determined to be the duration of the visual effects necessary to show the text segment to the user in a readable resolution. In previous experiments, we determined that text should be at least 7 points high in order to be readable [2]. If text is not readable when the whole document is fitted into the display area (i.e. thumbnail view), then a zoom operation is performed where the text is fitted to the display area. If even zooming in to the text is not sufficient for readability, then zooming into a part of the text is performed. Then a pan operation is carried out in order to show the user the remainder of the text. In order to compute time attributes for text elements, first the document image is downsampled to fit the display area. Then Z(e) is determined as the zoom factor that is necessary to bring the height of the smallest font in the text to the minimum readable height. Finally the time attribute for a visual element eEv is computed as follows:

equation 1

where ne is number of characters in e, Zc is zoom time (in our implementation this is fixed to be 1 second), and SSC (Speech Synthesis Constant) is the average time required to play back the synthesized audio character. SSC is computed as follows: (1) Synthesize a text document with the known number of characters, K, (2) measure the total time it takes for the synthesized speech to be spoken out, T, and compute SSC=T/K. SSC constant may change depending on the language choice, synthesizer that is used and the synthesizer options (female vs. male voice, accent type, talk speed, etc). With the AT&T speech SDK that we used to prototype Multimedia Thumbnails, SSC is computed to be equal to 75 ms when a female voice was used. Computation of t(e) remains the same even if a text element cannot be shown with one zoom operation and both zoom and pan operations are required. In such cases, the presentation time is utilized by first zooming into a portion of the text, for example the first m characters, and keeping the focus on the text for SCC×m seconds. Then the remainder of the time, i.e. SCC×(nem), is spent on the pan operation.

Time attributes for an audible document element, e∈Ea , is also computed in a similar fashion: t(e)=SCC × ne, where SCC is the speech synthesis constant and ne is the number of characters in the document element.

Audiovisual elements are composed of an audio component, A(e), and a visual component, V(e). Time attribute for an audiovisual element is computed as the maximum of time attributes for its visual and audible components:

t(e) = max(t(V(e)),t(A(e))) .

For example, t(e) of a figure element is computed as the maximum of time required to comprehend the figure and the duration of synthesized figure caption.

3.2. Information Attributes

An information attribute determines how much information a particular document part contains for the user. Obviously, this very much depends on the user's viewing/browsing style and the task on hand. For example, information in the abstract could be very important if the task is to understand the document, but it may not be as important if the task is merely to determine if the document has been seen before.

figure 2
Figure 2. Percentage of users who viewed different parts of the documents.

In order to understand how important the information is in the different parts of a document, we performed an observational user study. Nine users participated in the study and three documents were shown to each of them (giving us 27 separate data points) for the task of understanding the contents of a document in a limited time. Users browsed the high resolution PDF documents on a small (PDA-size) display. The users' navigation behaviors were recorded and analyzed in order to understand which document parts were viewed during browsing. Figure 2 shows the percentage of users who viewed various document parts. This initial experiment gives us an idea about how much users value different document elements. For example, 100% of the users read the title, whereas very few users looked at the references, publication name and the date. We use these results to assign information attributes to text elements depending on the amount of being viewed. For example, the title has the information value of 1.0, where references are given the value 0.13.