3.2. MMNail Optimization

Initially we designed an ad-hoc algorithm that selected certain parts of documents based on the target MMNail duration. For example, if MMNail is 20 seconds we included title and page thumbnails and if it is 30 seconds, we included title, thumbnails and figures, and so on. However after generating MMNails for various different documents and receiving feedback from users it became clear that (1) documents vary in content greatly, e.g., some documents have readable titles that do not need zooming in, some documents have many figures with long captions, and some other documents don't have any figures, and (2) MMNail generation depends on user preferences and target applications [12]. Therefore, it is important to develop an MMNail generator which is easily expandable to include different document elements and configurable for personal preferences and target application.

In this section, we present a generalized optimization framework which selects document elements to form an MMNail based on time, application, user, and display size constraints. An overview of the optimizer is presented in Figure 3. In the optimizer, first, for each document element we compute a time attribute, i.e. time required to display the element, and an information attribute, i.e. information content of the element. The computation of attributes is described in detail in Sections 3.2.1 and 3.2.2. Display constraints of the viewing device are taken into account when computing time attributes. For example, it takes longer time to present a text paragraph in a readable form in a small viewing area than in a large viewing area due to increased amount of zooming and scrolling in the small viewing area. Similarly, target application and user preferences need to be taken into account when computing information attributes. For example, for some tasks the abstract or keyword elements can have higher importance than other elements such as a body text paragraph.

figure 3
Figure 3. Overview of the MMNail optimizer.

The optimization module selects elements from the set of document elements, such that the total information content of the MMNail is maximized, subject to a time constraint. Let the information content of an element e be denoted by I(e), the time required to present e by t(e), the set of available document elements by E, and the target MMNail duration by T. The optimization problem is

maximizeequation 1 pt. 1 
subject toequation 1 pt. 2(1)
  x(e)∈{0,1}, eE 

with optimization variables x(e), eE . For an element e, x(e)=1 means e is selected to be included in the MMNail, and x(e)=0 means e is not selected for inclusion.

The problem (1) is a '0-1 knapsack' problem. Therefore it is a hard combinatorial optimization problem [23]. In our implementation, a greedy approximation is employed to solve (1). First, the document elements eE are sorted according to the ratio I(e)/t(e) in descending order, i.e., equation where m is the number of elements in E . Then the document elements are included into the MMNail, starting from the first element, until the sum of the time attributes of the included elements is the same or very close to the target duration T. For practical purposes, this approximation of problem (1) should work quite well, as we expect the individual elements to have much shorter display time than the total MMNail duration.

3.2.1. Time Attributes

The time attribute of a document element can be interpreted as the approximate duration that is sufficient for a user to comprehend that element. Computation of time attributes depends on the type of the document element.

The time attribute of a text document element (e.g., title) is determined to be the duration of the visual effects necessary to show the text segment to the user at a readable resolution. In our experiments, text was determined to be at least 6 pixels high in order to be readable on an LCD (Apple Cinema) display. If text is not readable once the whole document is fitted into the display area (i.e. in a thumbnail view), a zoom operation is performed. If even zooming into the text such that the entire text region still fits on the display is not sufficient for readability, then zooming into a part of the text is performed. A pan operation is carried out in order to show the user the remainder of the text. In order to compute time attributes for text elements, first the document image is down-sampled to fit the display area. Then a zoom factor Z(e)is determined as the factor that is necessary to scale the height of the smallest font in the text to the minimum readable height. Finally the time attribute for a visual element e that contains text is computed as

equation 2

where ne is number of characters in e, ZC is zoom time (in our implementation this is fixed to be 1 second), and SSC (Speech Synthesis Constant) is the average time required to play back the synthesized audio character. SSC is computed as first synthesizing a text segment containing k characters, next measuring the total time it takes for the synthesized speech to be spoken out, τ, and then computing SSC=τ/k. The SSC constant may change depending on the language choice, synthesizer that is used, and the synthesizer options (female vs. male voice, accent type, talk speed, etc). Using the AT&T speech SDK [21], SSC is found to be equal to 75 ms when a female voice was used. The computation of t(e) remains the same even if an element cannot be shown with one zoom operation and both zoom and pan operations are required. In such cases, the complete presentation of the element consists of first zooming into a portion of the text, for example the first me out of a total of ne characters, and keeping the focus on the text for SSCxme seconds. Then the remainder of the time, i.e. SSCx(ne-me), is spent on the pan operation.

The time attribute for an audible text document element e, e.g., a keyword, is computed as

t(e) = SSC x ne  (2)

where SSC is the speech synthesis constant and ne is the number of characters in the document element.

For computing time attributes for figures without any captions, we make the assumption that complex figures take a longer time to comprehend. The complexity of a visual figure element e is measured by the figure entropy H(e) that is computed extracting bits from a low-bitrate layer of the JPEG2000 compressed image as described in [22]. The time attribute for a figure element is computed as t(e) = αH(e)/H, where H(e) is the figure entropy, H is the mean entropy, and α is a time constant. H is empirically determined by measuring the average entropy for a large collection of document figures. Time required to comprehend a photo might be different than that of a graph or a table, therefore, different α can be used for these different figure types. Moreover, high level content analysis, such as face detection, can be applied to assign time attributes to figures. We do not perform content analysis or distinguish different figure types in this paper and α is fixed to 4 seconds, which is the average time a user spends on a figure in our experiments.

An audiovisual element e is composed of an audio component, A(e), and a visual component, V(e). A time attribute for an audiovisual element is computed as the maximum of time attributes for its visual and audible components: t(e) = max(t(V(e)),t(A(e))) , where t(V(e)) is computed as in (2) and t(A(e)) as in (3). For example, t(e) of a figure element is computed as the maximum of time required to comprehend the figure and the duration of synthesized figure caption.

3.2.2. Information Attributes

An information attribute determines how much information a particular document element contains for the user. Clearly this very much depends on the user's viewing/browsing style, target application, and the task on hand. For example, information in the abstract could be very important if the task is to understand the document, but it may not be as important if the task is merely to determine if the document has been seen before.

In order to understand how important the information is in different parts of a document, we performed an exploratory and descriptive user study that we reported in [12]. The users were given two tasks: a document browsing task, where users were asked to browse documents in a small viewing area in order to determine if they have seen these documents before, and a document understanding task, where users were asked to browse documents for a limited time in order to answer questions about the document content. The users' navigation behaviors were recorded and analyzed in order to understand what document parts were viewed during browsing [12]. Table 2 shows the percentage of users who viewed various document parts when performing the two tasks. This experiment gave us an idea about how much users value different document elements. For example, 100% of the users read the title in the document understanding task, whereas very few users looked at the references, publication name and the date. We use these results to assign information attributes to text elements. For example in the document understanding task, the title is assigned the information value of 1.0 based on 100% viewing, and references are given the value 0.13 based on 13% viewing.

Table 2. Percentage of users who viewed different parts of the source documents for document search and understanding tasks.

Document PartViewing percentage for search taskViewing percentage for understanding task
Title83%100%
Abstract13%87%
Figures38%93%
First page thumbnail83%73%
References8%13%
Publication name4%7%
Publication date4%7%

3.2.3. Two-Stage Optimization

After the time and the information attributes are computed for the visual, audible, and audiovisual elements, the optimizer produces the best MMNail by selecting a combination of elements. The best thumbnail is one that maximizes the total information content of the thumbnail and can be displayed in the given time.

A document element e belongs to either the set of purely visual elements Ev, the set of purely audible elements Ea, or the set of synchronized audiovisual elements Eav. A Multimedia Thumbnail has two presentation channels, visual and audio. Purely visual elements and purely audible elements can be played simultaneously through the visual and audio channel, respectively. On the other hand, displaying a synchronized audiovisual element requires both channels. For a feasible MMNail, playing of an audiovisual element can not coincide with playing of a visual or an audible element at any time.

figure 4

Our method to produce an MMNail consists of two stages. In the first stage we select purely visual and synchronized audiovisual elements to fill the video channel. This leaves the audio channel partially filled, as illustrated in Figure 4. In the second stage we select purely audible elements to fill the partially filled audio channel.

The optimization problem of the first stage is of type (1), using a the set of visual and audiovisual document elements, EvEav , as the element set.

maximizeequation 4 pt. 1 
subject toequation 4 pt. 2(4)
  x(e)∈{0,1}, eEvEav 

We solve this problem approximately using the linear programming relaxation as shown for the problem (1). The selected purely visual and synchronized audiovisual elements are placed in time in the order they occur in the document. The first stage optimization almost fills the visual channel, and fills the audio channel partially.

In the second stage we need to select purely audio elements to fill the audio channel which has separate empty time intervals. Let the total time duration to be filled in the audio channel be . If the selected purely audible elements have a total display time of approximately , it may still be difficult to place the elements in the audio channel because the empty time duration is not contiguous. Therefore, a conservative approach is taken and instead of we use βTˆ , where β ∈ [0,1] . Further, we only consider the purely audio elements which do not have a long display time to be included in the MMNail. We do this by comparing the display time of an element with the average length of the empty intervals of the audio channel Tˆ / R , where R is the number of empty intervals, and consider the element set a = {e ∈ Ea | t(e)γTˆ / R} , where γ ∈[0, R] . The optimization problem of the second stage becomes

maximizeequation 5 pt. 1 
subject toequation 5 pt. 2(5)
  x(e)∈{0,1}, eE^a 

with optimization variables x(e), ea . This problem is of the type (1) and it is approximately solved by the greedy approximation as shown earlier.

In our implementation we set β = 1/ 2 and γ =1. For most documents, setting β = 1/ 2 results in selection of audio elements such that a Δn exists where each audio element can be placed in a continuous audio segment. If this is not the case, the second stage of the optimization is solved again for a smaller γ until all selected audio element's complete time span can be preserved.

Instead of the two stage optimization approach, we can formulate a single optimization problem to choose the visual, audiovisual, and the audible elements simultaneously. In this case, the optimization problem is

maximizeequation 6 pt. 1 
subject toequation 6 pt. 2(6)
 equation 6 pt. 3 
  x(e)∈{0,1}, eEaEvEav 

where x(e), eEaEvEav are the optimization variables.

The greedy approximation described to solve the relaxed problem (1) can not be applied to problem (6), but (6) can be relaxed to a linear program (LP), so that any generic LP solver can be applied. The advantage of solving the two stage optimization problem is that the calibration of the information attributes of the purely audible elements becomes independent of the information attributes of the visual elements. This means that multiplying the information content of all the audible elements by a positive number does not affect the solution of the two stage optimization problem; but will change the solution of the problem (6).

Readers should note that the two stage optimization gives selection of purely visual elements a priority over that of purely audible elements. If it is desired that audible elements have priority over visual elements (e.g. for consumption of document content while driving a car), the first stage of the optimization can be used to select audiovisual and purely audible elements, and the second stage is used to optimize selection of purely visual elements.