1. Introduction
Mobile devices are becoming more ubiquitous and used increasingly for tasks that were traditionally performed on desktop and laptop computers. Even though the memory and computational capabilities of these devices will continue to improve, the small display sizes and limited input capabilities for user interaction are likely to remain the major bottlenecks for many mobile applications. These limitations have lead to research on user interfaces and applications, generally addressing two issues: (1) overcoming non-recognizability of information on the small screens [1]-[7] and (2) assisting navigation when there is limited input capability [8].
There has been research on addressing these issues for browsing web pages[1]-[3], photos[4][5], and videos on small displays[6], which we review in Section 2. Besides web pages, photos, and video clips, formatted documents are another type of "information carrier." Current solutions for viewing documents on small displays consist mainly of running a document viewer such as MSWord or Acrobat Reader. However, interactions with such a document viewer through zooming and scrolling are difficult to perform on small mobile devices.
Most techniques used to adapt content to mobile devices reformat content by changing the document layout [1]-[3][7]. Layout of documents, defined by page-breaks, line-breaks, columns, margins, etc., communicates semantic information about the document such as hierarchical structure and reading order, much as hyperlinks in web pages encode structure. Users may find it difficult to recognize a document or navigate through it if that document's layout has been changed [9].
Automatic browsing through content without reformatting has been applied to photos [4][5]. In contrast to photographs of natural scenes or people, document images contain typically a large amount of high frequency data such as text. Saliency points in documents images that are useful for navigation may include title, authors, abstract, figures, and references section [10], in contrast to people's faces and a foreground object in typical photographs. Moreover, text in documents is meant to be read in a predetermined reading order and image and text units may be linked, e.g., a figure picture and a figure caption. These properties are document specific and absent from generic photos.
To address the problem of viewing documents on small displays, we propose a new document visualization called Multimedia Thumbnail (MMNail). An MMNail can be seen as a short video clip of a document that gives a guided tour through a document. In an MMNail visualization both visual and audio channels of a mobile device are used to communicate document information. The visual channel is used to present dense spatial information by zooming into and panning over the most important document elements, such as title and figures, while the audio channel is used to communicate speech-synthesizable document information, so called audible information, such as keywords and figure captions. In this way, both recognizability and navigation problems are addressed by having text readable and figures comprehensible after zooming and panning and minimizing the navigational input required by the user.

Figure 1. Multimedia Thumbnails present documents in small displays by automatically zooming into important document parts, such as (a) title and (b) figure, as well as transcoding some document text, such as figure captions, into an audio signal via synthesized speech.
The MMNail generation algorithm is composed of extracting semantic visual and audible document elements, optimizing the selection of document elements with respect to given time, display, and application constraints, and synthesizing selected document elements into a playable MMNail. We introduced the initial MMNail algorithm in [11], performed an exploratory study of users' document browsing behavior on a mobile device and collected initial feedback on Multimedia Thumbnail examples [12]. In this paper we improved the MMNail generation algorithm as well as the playback interface based on the initial user feedback. In our previous work we concentrated on scanned documents stored in a raster-scan representation. In this paper we propose a way for generating MMNails for input formats such as HTML and MSWord, utilizing the semantic information present in these representations, as described in Section 3.1. In Section 3.2 we explain how to optimally select document information for a given visual and audio channel. From our earlier user study [12] it became clear that an MMNail optimized for single time duration could not satisfy all user needs. In this paper we address this problem by introducing scalable MMNails with respect to the time duration, as described in Section 3.3. A flexible user interface for MMNail playback is also introduced in Section 4. Examples of MMNails are given in Section 5. In order to evaluate our system design and implementation we performed a new set of user studies. In the new user study we compared viewing of regular formatted documents in a PDF viewer with viewing of Multimedia Thumbnails. Our results indicate statistically significant (a p-value of 0.05 in a paired sample t-test) improvement of document comprehension with MMNail viewing over that with PDF viewing, as reported in detail in Section 6. A summary of our work and discussion of applications and future directions are given in Section 7.






