3. Meta Data Creation

Efficient access to a recorded meeting is essential for users searching for specific information. Frequently, meetings are boring, unstructured affairs that are not amenable to a hit-or miss search strategy. After a fast-forwarding a few times in a meeting video while looking for something, most people will give up unless what they are seeking is important enough to spend the required time.

Our goal is to augment the audio and video information with meta data that enables a goal-directed search strategy in which users can easily navigate to the specific point in the recording that provides the information they're looking for. In addition, a user interface presents the meta data on a time-line and provides an easy means for browsing it and selectively replaying the audio or video.

The use of meta data to help navigate video has been investigated by others. The Informedia project automatically applied a variety of analyses, including speech recognition and natural language processing, on TV footage [6]. The Broadcast News Navigator [7] derived information from the audio, video, and closed caption streams and performed linguistic information that improved the accessibility of the data. Such multi-track information was also used for video editing [8] and has been applied to browsing recorded meetings [9].

3.1. Real-Time Sound Localization

To avoid the need of handling and saving multiple channels of audio data, sound localization is performed in real-time. The audio signal is processed in segments of 25msec. Since we are interested only in human speech, segments that do not contain speech in at least one of the channels are ignored.

Following speech detection, 360-degree sound localization is calculated as follows. For each pair of microphones on the diagonal, an angle between 0 and 180 degrees is calculated based on phase difference. This angle defines a cone of confusion centered at the midpoint of the diagonal. In theory, the intersection of two cones computed from both diagonal pairs defines the azimuth and elevation of the sound source. Unfortunately, the angle computed by each pair is not perfect. Moreover, phase difference measured on a finite sampling rate over a small baseline is discrete, and the angular resolution over all directions is non-uniform. Higher resolution is obtained near the center, and lower towards both ends. Therefore, we need to compute the intersection of two cones of unequal thickness, if they intersect at all. Furthermore, we want to take into consideration the confidence associated with each angle estimate.

figure 5
Figure 5. Illustration of the steps in generating view selection metadata. Audio data are divided into blocks of 30 seconds in step 1. Audio segments are grouped in Step 2, followed by clustering in Step 3. Afterward, groups are assigned to clusters.

To resolve these issues, we use an accumulator over the parameter space of azimuth by elevation. Azimuth ranges from 0 to 360 degrees and elevation ranges from 0 to 90 degrees. For each possible (azimuth,elevation) covered by each cone, its entry is incremented by the confidence associated with the cone. The highest scoring entry in the accumulator corresponds to the best parameter estimate. All entries in the accumulator are decayed by a factor at the end of each segment. However, in trying to estimate both azimuth and elevation, we found the solution unstable and sensitive to the quantization chosen. Furthermore, it does not account for the fact that sound sources close to the middle are detected more accurately than those close to either end. Therefore, the scores at all elevations are summed up for each azimuth, and the best azimuth is returned if its score exceeds a threshold. Therefore, for each segment where speech is found, a triplet of time-stamp, angle and score, denoted as (t, θi, wi), is written to a file. We observed this process is capable of performing in real-time, consuming approximately 25% to 40% CPU load on a 933MHz PC.

3.2. Automatic View Selection

At the end of a recording, the result of sound localization is further processed to produce a sequence of viewing instructions for the virtual camera to generate a normal perspective view during playback (see Figure 4). The objective of this view selection process is to create natural looking shots like those produced by a cameraperson. The steps for generating this sequence of instructions are described below.

Raw sound localization results are filtered, grouped, clustered, and smoothed to form viewing instructions. Since the initial analysis is performed on short audio segments (25msec), the results of speech detection and sound localization are sporadic. To find real speech utterances, we use only data where speech is detected in at least 5 consecutive segments. For each group of contiguous segments, a direction is calculated as a weighted average of the azimuth and weight. The total weight of the segment is assigned as the weight for the group. Consequently, groups containing more segments and more reliable sound direction estimates have larger weights.

In the next step, clustering is performed on the directions of these groups to find general speaker location using the ISODATA algorithm. We use a modified version where cluster means are calculated using the weighted average to take group weights into consideration, and an angular distance that wraps around at 360 is used. New clusters are formed for points more than a threshold away from the center. Clusters are merged if their centers are closer than a threshold, currently 30 degrees.

After clustering, every group is assigned to a cluster. This is roughly equivalent to unsupervised speaker clustering based on their (angular) location. To allow for speaker movement, this operation is performed on audio data of a chosen block size. Currently, we use a block size of 30 seconds in our system. It should be pointed out that the real goal of clustering is to identify distinct shot directions. Therefore, it is acceptable to use a single shot centered between two speakers if they are sitting nearby rather than centering exactly on the speaker.

Having obtained clusters of shot directions, we perform the final step to generate the viewing instruction. First of all, neighboring groups that belong to the same cluster are merged to cover any silent period between them. Neighboring groups that belong to different clusters are extended to meet half way, each covering half of the silent period. This allows the virtual camera to focus on the speaker before the actual speech starts. The result of this process is a sequence of view angles corresponding to shots. To avoid rapid switching of camera angles, shots shorter than 2 seconds are removed and considered as silence. The same algorithm is used to find coverage for that period.

In contrast to the work of [1] where regions containing the largest motion are selected, our system focuses on the speaker. The information we obtain at the end of clustering can be improved by speaker identification and displayed as speaker segments, as in [2] . Compared to the virtual director of [4], speaker directions are detected automatically instead of annotated manually.

3.3. Meeting Location Recognition

Unlike systems that are based on instrumentation of a conference room where most meetings are carried out in one place, the ability to identify the location can provide a very useful retrieval cue for searching meetings recorded with a portable recorder. Furthermore, we do not want to rely on the presence of an outside data source, like a GPS signal.

The first problem that needs to be addressed is background extraction. Since the recorder is manually operated, it is unreasonable to assume that a clean shot of the background can be obtained with no person in the room. Secondly, certain fixtures in the room such as electronic whiteboards and window curtains can be at different locations in different recordings. Moreover, the appearance of different rooms in the same company can be quite similar. In addition, the panoramic view of the room is dependent on the position of the recorder. The size of objects can change dramatically depending on the distance to the camera.

The approach includes background extraction and image matching. We use adaptive background modeling to extract the background [10]. Our algorithm is based on an extension of the method of [11]. A Gaussian mixture approximates the distribution of values at every pixel over time. For each Gaussian constituent, its likelihood of being background is estimated based on its variance, frequency of occurrence, color and neighborhood constraints. Therefore, an image of the background can be constructed based on the most likely background Gaussian at every pixel. Since this background estimate changes over time, for example due to the movement of objects in the room, we extract a new image every time a significant change in the background model is detected. These images are dewarped into a panoramic cylindrical projection as shown in Figure 6.

figure 6
Figure 6. From top to bottom, examples of a panoramic video frame, the extracted foreground and the background image.

To identify the location, the background images are matched against room templates in the database. Since the number of placements for the recorder in a particular room is usually limited, they are categorically organized and stored as separate templates. In our case, one template is obtained from each end of a table in a conference room. We match the templates with the backgrounds of the meeting recordings by comparing their color histograms. The histograms are formed in the HSV color space because distance values in this space approximate human perception. The color space represented with 256 bins, where Hue is quantized into 16, Saturation and Value are quantized into 4 bins each.

Several background images are extracted for each meeting and an intersection histogram is computed using the histograms of these images. The intersection histogram is compared using Euclidian distance with each template in the database to find the closest matching meeting room. Employing an intersection histogram allows us to further eliminate the non-stationary objects in the meeting room and smooth out any background extraction errors. The use of multiple templates for each room provides a robust method for location identification. In our experiments, we successfully identified the 5 meeting rooms that we have in our research facility. We plan to include more results in the final version of the paper. We are also investigating improvements to the algorithm by using the size and the layout of the meeting room to address the issue of distinguishing rooms with similar colors.