3.4. Meeting Description Document

Inarguably, having the time, location, main topic, and participant list of a multimedia meeting document, such as the one shown in Figure 8, helps users browse, search and access a large collection of meeting documents. In our system, users generate meeting content description documents semiautomatically. This is accomplished by extracting metadata automatically where possible and giving users the opportunity to either confirm the accuracy of the data or re-enter it.

figure 8
Figure 8. Meeting description document.

The most basic meeting document metadata includes a description of the meeting, its time, date, and location. The date and time of the meeting are automatically obtained from the time stamp of the meeting document. To improve the start and end time accuracy, we also employ image and audio processing to detect exactly when the meeting started (e.g. when a speaker or motion is detected for the first time) and stopped. The location of the meeting is found automatically by using the method described in Section 3.3.

Automatic extraction of meeting title and description is more difficult. This can be accomplished by comparing the participant list and the time/location information with those of previous meetings and suggesting meeting descriptions, such as "regular group meeting." Moreover, if a presentation is detected in the meeting or there is a scheduled talk, this information can also be used to suggest meeting descriptions.

3.4.1. Localization of meeting participants

Locating meeting participants is a non-trivial problem especially considering that a clean shot of the background is not available and participants are likely to have minimal motion. We address this problem by using sound localization to find the approximate location of each meeting participant. Then the precise location of each face is found by identifying the skin regions in this approximate location.

Skin pixels are detected in the normalized RG-space [5]. Small holes in skin-colored regions are removed by a morphological closing and then connected component analysis identifies face region candidates. In environments with complex backgrounds, many objects, such as wood, clothes, and walls, may have colors similar to skin. Therefore, further analysis of the skin-colored regions, using known facts about luminance variation [13] and geometric features of faces [14]-[16], is performed to further prune non-face regions. Some example face localization results are shown in Figure 9.

figure 9
Figure 9. Face localization in various meetings.

3.4.2. Best-shot Selection

One of our goals is to find representative shots of the meeting attendees that can be included in the meeting description document. It is possible to extract many shots of the participants from the video sequence. However, generally not all of these shots are presentable. It is desirable to obtain frames where the individual is not occluded and facing the camera. We find such frames by first extracting several still shots of the speaker, one when she/he first starts speaking, one from when she/he finishes speaking (for the first time) and one between these two times. These shots are then evaluated to pick the best shot of a participant.

The best shot is selected by evaluating the size of the face region and the percentage of skin pixels detected in the bestfitted ellipse around the face region. The larger faces with more skin pixels are selected as better shots. An example is shown in Figure 10. Currently, the resolution of captured video is not sufficient to accurately detect the eye and mouth regions. However, once a higher resolution video is obtained, the selection of the best attendee shots can be improved by testing the visibility of mouth and eyes. This can also be combined with the geometry of the face to detect whether or not a person is looking straight ahead.

figure 10
Figure 10. An example of best-shot selection.

3.4.3. Participant Identification

Recognizing people, even more specifically recognizing meeting participants using audiovisual data, has been an active research topic in recent years [13][18]. Nevertheless, face recognition and speaker identification may fail quite often because of poor lighting conditions, poor microphone quality, camera position, low video resolution, or even simply because a particular person looks different that day. The low resolution of our portable meeting recorder makes face recognition from video unreliable. Currently, after obtaining a set of participant shots, we present our best guess to the user and let the user confirm or change the people ID results.

3.5. Searching and Browsing with Visual and Audio Content

Searching and browsing audiovisual information is a time consuming task. In our meeting recorder system, after each meeting is recorded, the audio file is automatically sent out for transcription. This step is expected to be removed when automatic speech recognition systems become more accurate. The transcription is then used to perform a text-based search. Even though searching the transcriptions is a powerful way to access the spoken meeting content, it may not be sufficient to search for visual and audio events such as a person getting up to write something on the whiteboard or an emotional discussion. Therefore, we provide the user with a visual representation of the visual and audio activity content of the meeting recordings.

3.5.1. Visual Activity Analysis

Motion content in video can be used to efficiently search and browse particular events in a video sequence as demonstrated in various applications such as sports events and news broadcasting [18]-[20] . In meeting sequences, most of the time there is minimal motion. High motion segments usually correspond to significant events such as a participant getting up to make a presentation, someone joining the meeting, etc. Therefore, providing a visualization of the activity in a meeting enables efficient meeting browsing.

Several motion activity descriptors exist in the literature. Some of these descriptors are based on the magnitudes and directions of the motion vectors in the MPEG bitstream [21]. However, these descriptors have a strong dependence on the bit rate and video encoder parameters. MPEG-7 defines a motion activity descriptor, which describes the amount of motion as well as the number and size of the active regions in a frame [22][23]. Visualization of this descriptor value is not intuitive. The visual activity measure we employ uses the local luminance changes in a video sequence. A large luminance difference between two consecutive frames is generally an indication of a significant content change, such as when somebody gets up to present, leaves the room, etc. However, other events, such as dimming the lights or all the participants moving slightly, may result in a large luminance difference between two frames. In order to eliminate such events, we define the visual activity as the luminance changes in a small window rather than luminance change in the whole frame.

The luminance changes are found by computing the luminance difference between the consecutive intra coded (I) frames. We employ I-frames because the luminance values in Iframes are coded without prediction from the other frames, and they are therefore independently decodable [24][25]. We compute luminance differences on the average values of 8x8 pixel blocks obtained from the DC coefficients. The DC coefficients are extracted from the MPEG bit stream without full decompression. Average values of the 8´8 pixel blocks are found by first compensating for the DC prediction and then scaling by 8.

Because the video in our system is donut shaped, the pixels in the outer parts of the video contain less object information (i.e. more pixels per object). Therefore, the pixel values are weighted according to their location to compensate for this when computing the frame differences. The assignment of weights is done considering the parabolic properties of the mirror as follows

equation 1

where r is the radius of the DC coefficient location in frame centered polar coordinates and Rmax is the maximum radius of the donut image. The coefficients that do not contain any information (the location that corresponds to outside of the mirror area) are weighed zero.

We employ a window size of 9x9 DC coefficients, which corresponds to a 72x72 pixel area. The weighted luminance difference is computed for every possible location of this window in a video frame. The local visual activity, Φ , is defined as the maximum of these differences as follows:

equation 2

where W and H are the width and height of the video frame (in number of DC blocks), L is the size of the small local activity frame (in number of DC blocks), ω(r) is the weight of the DC block at location r (in polar coordinates), and Aij is the average luminance value in a block at location (ix8,jx8).

Figure 11 shows the plot of local visual activity measure for a meeting video. The large values of visual activity correspond to important visual events. As shown in the figure, most peaks the visual activity score measure corresponds to significant visual events, for example, a person taking his place at the table (Figure 10.a), another person leaving the meeting room (Figure 10.c), entering the room (Figure 10.d and Figure 10.e), etc. On the other hand, the video segment shown in Figure 10.b does not have a visual significance. This segment has a large activity value because the person moved close to the camera and appeared as a large moving object because of the perspective. Exclusion of such segments from the important visual events is possible only if we compensate for the distance of the objects from the camera via utilizing techniques such as stereovision.

figure 11
Figure 11. Examples of high visual activity scores corresponding to significant visual events.

3.5.2 Audio Analysis

Our system enables navigation of meeting content based on the magnitude of audio and speaker changes. High speech volume often corresponds to meeting segments involving discussions or high emotion. Being able browse the meeting using speaker changes allows the user to skim through the audio efficiently and listen only to the speakers he/she is interested in.

There are many techniques that segment audio to obtain speaker segments and acoustics classes [26]-[30]. In [26], Arons gives an overview of audio segmentation. Pfau et al. propose an HMM-based speaker segmentation method using a mixture of Gaussians [27] . Error rates of 20% are reported even in controlled environments. Kimber et al propose an audio browsing tool based on acoustics classes [28]. In [29], Tritschler et al. perform speaker clustering using a Bayesian Information Criterion. It is reported that speaker segmentation is particularly difficult when the speakers are distant from the microphone, the room has many reflective surfaces, training data is not available, and/or multiple speakers talk at the same time.

In our system, many of these obstacles are present. Our experiments showed that basing speaker segmentation on the results of sound localization performed much better than using audio features for speaker clustering. Currently we are working on combining sound localization with people tracking to further improve speaker segmentation.