A Paper-Based Interface for Video Browsing and Retrieval
An example of using Video Paper for analyzing a news broadcast, like the one shown in Figure 1, would be looking for a story about Idaho. A quick glance over that page lets a user to see the map in the lower left corner and focuses his attention on that section, thus eliminating the need to look elsewhere in the document.
2. System Architecture
The Video Paper system architecture is shown in Figure 2. Television programs are recorded from satellite transmissions, broadcast sources, or cable. The recording process preserves the closed caption content. The server converts the video data to MPEG2 format and generates the video paper representation.
Video paper contains key frames and a formatted version of the closed caption. A bar code is generated for every key frame that identifies the video recording and the position of the key frame within the video. This is so that the action of scanning the bar code can cause the video replay to begin at the corresponding position.
A remote control device is comprised of a PDA (e.g., Compaq iPAQ) with a bar code reader and a wireless interface (e.g., 802.11b). Software on the PDA decodes scanned bar codes and sends commands to the server that controls replay of the video on a television attached to the video rendering card on the server. In addition to the bar codes attached to key frames, meta control bar codes are included that pause the replay, rewind, fast forward, or display the closed caption on the television.
We also developed a portable version of the video paper system in which the MPEG2 video data is written on a PCMCIA disk drive that is added to an iPAQ together with a bar code reader. A modified version of the remote control software invokes the video replay on the iPAQ instead of on a separate television. This allows the Video Paper system to be used in places where there's no network connection, such as on a train.
3. Key Frame Selection Algorithm
The algorithm for key frame selection is an important part of maximizing the usefulness of video paper documents. Following a clustering step, video frames are input to a routine that recognizes the presence of a fixed set of categories. Examples include human faces, buildings, animals, crowds, logos, text, etc. It can also detect specific instances of each category. For example, it can recognize a specific person's face, the Empire State Building, horses, the Ricoh logo, etc.






