1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949
|
<pre>Internet Engineering Task Force (IETF) A. Romanow
Request for Comments: 7205 Cisco
Category: Informational S. Botzko
ISSN: 2070-1721 M. Duckworth
Polycom
R. Even, Ed.
Huawei Technologies
April 2014
<span class="h1">Use Cases for Telepresence Multistreams</span>
Abstract
Telepresence conferencing systems seek to create an environment that
gives users (or user groups) that are not co-located a feeling of co-
located presence through multimedia communication that includes at
least audio and video signals of high fidelity. A number of
techniques for handling audio and video streams are used to create
this experience. When these techniques are not similar,
interoperability between different systems is difficult at best, and
often not possible. Conveying information about the relationships
between multiple streams of media would enable senders and receivers
to make choices to allow telepresence systems to interwork. This
memo describes the most typical and important use cases for sending
multiple streams in a telepresence conference.
Status of This Memo
This document is not an Internet Standards Track specification; it is
published for informational purposes.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Not all documents
approved by the IESG are a candidate for any level of Internet
Standard; see <a href="./rfc5741#section-2">Section 2 of RFC 5741</a>.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
<a href="http://www.rfc-editor.org/info/rfc7205">http://www.rfc-editor.org/info/rfc7205</a>.
<span class="grey">Romanow, et al. Informational [Page 1]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-2" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to <a href="https://www.rfc-editor.org/bcp/bcp78">BCP 78</a> and the IETF Trust's Legal
Provisions Relating to IETF Documents
(<a href="http://trustee.ietf.org/license-info">http://trustee.ietf.org/license-info</a>) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
<a href="#section-1">1</a>. Introduction . . . . . . . . . . . . . . . . . . . . . . . . <a href="#page-3">3</a>
<a href="#section-2">2</a>. Overview of Telepresence Scenarios . . . . . . . . . . . . . <a href="#page-4">4</a>
<a href="#section-3">3</a>. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . <a href="#page-6">6</a>
<a href="#section-3.1">3.1</a>. Point-to-Point Meeting: Symmetric . . . . . . . . . . . . <a href="#page-7">7</a>
<a href="#section-3.2">3.2</a>. Point-to-Point Meeting: Asymmetric . . . . . . . . . . . <a href="#page-7">7</a>
<a href="#section-3.3">3.3</a>. Multipoint Meeting . . . . . . . . . . . . . . . . . . . <a href="#page-9">9</a>
<a href="#section-3.4">3.4</a>. Presentation . . . . . . . . . . . . . . . . . . . . . . <a href="#page-10">10</a>
<a href="#section-3.5">3.5</a>. Heterogeneous Systems . . . . . . . . . . . . . . . . . . <a href="#page-11">11</a>
<a href="#section-3.6">3.6</a>. Multipoint Education Usage . . . . . . . . . . . . . . . <a href="#page-12">12</a>
<a href="#section-3.7">3.7</a>. Multipoint Multiview (Virtual Space) . . . . . . . . . . <a href="#page-14">14</a>
<a href="#section-3.8">3.8</a>. Multiple Presentation Streams - Telemedicine . . . . . . <a href="#page-15">15</a>
<a href="#section-4">4</a>. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . <a href="#page-16">16</a>
<a href="#section-5">5</a>. Security Considerations . . . . . . . . . . . . . . . . . . . <a href="#page-16">16</a>
<a href="#section-6">6</a>. Informative References . . . . . . . . . . . . . . . . . . . <a href="#page-16">16</a>
<span class="grey">Romanow, et al. Informational [Page 2]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-3" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
<span class="h2"><a class="selflink" id="section-1" href="#section-1">1</a>. Introduction</span>
Telepresence applications try to provide a "being there" experience
for conversational video conferencing. Often, this telepresence
application is described as "immersive telepresence" in order to
distinguish it from traditional video conferencing and from other
forms of remote presence not related to conversational video
conferencing, such as avatars and robots. The salient
characteristics of telepresence are often described as: being actual
sized, providing immersive video, preserving interpersonal
interaction, and allowing non-verbal communication.
Although telepresence systems are based on open standards such as RTP
[<a href="./rfc3550" title=""RTP: A Transport Protocol for Real-Time Applications"">RFC3550</a>], SIP [<a href="./rfc3261" title=""SIP: Session Initiation Protocol"">RFC3261</a>], H.264 [<a href="#ref-ITU.H264" title=""Advanced video coding for generic audiovisual services"">ITU.H264</a>], and the H.323 [<a href="#ref-ITU.H323" title=""Packet-based Multimedia Communications Systems"">ITU.H323</a>]
suite of protocols, they cannot easily interoperate with each other
without operator assistance and expensive additional equipment that
translates from one vendor's protocol to another.
The basic features that give telepresence its distinctive
characteristics are implemented in disparate ways in different
systems. Currently, telepresence systems from diverse vendors
interoperate to some extent, but this is not supported in a
standards-based fashion. Interworking requires that translation and
transcoding devices be included in the architecture. Such devices
increase latency, reducing the quality of interpersonal interaction.
Use of these devices is often not automatic; it frequently requires
substantial manual configuration and a detailed understanding of the
nature of underlying audio and video streams. This state of affairs
is not acceptable for the continued growth of telepresence -- these
systems should have the same ease of interoperability as do
telephones. Thus, a standard way of describing the multiple streams
constituting the media flows and the fundamental aspects of their
behavior would allow telepresence systems to interwork.
This document presents a set of use cases describing typical
scenarios. Requirements will be derived from these use cases in a
separate document. The use cases are described from the viewpoint of
the users. They are illustrative of the user experience that needs
to be supported. It is possible to implement these use cases in a
variety of different ways.
Many different scenarios need to be supported. This document
describes in detail the most common and basic use cases. These will
cover most of the requirements. There may be additional scenarios
that bring new features and requirements that can be used to extend
the initial work.
<span class="grey">Romanow, et al. Informational [Page 3]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-4" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
Point-to-point and multipoint telepresence conferences are
considered. In some use cases, the number of screens is the same at
all sites; in others, the number of screens differs at different
sites. Both use cases are considered. Also included is a use case
describing display of presentation material or content.
The multipoint use cases may include a variety of systems from
conference room systems to handheld devices, and such a use case is
described in the document.
This document's structure is as follows: <a href="#section-2">Section 2</a> gives an overview
of scenarios, and <a href="#section-3">Section 3</a> describes use cases.
<span class="h2"><a class="selflink" id="section-2" href="#section-2">2</a>. Overview of Telepresence Scenarios</span>
This section describes the general characteristics of the use cases
and what the scenarios are intended to show. The typical setting is
a business conference, which was the initial focus of telepresence.
Recently, consumer products are also being developed. We
specifically do not include in our scenarios the physical
infrastructure aspects of telepresence, such as room construction,
layout, and decoration. Furthermore, these use cases do not describe
all the aspects needed to create the best user experience (for
example, the human factors).
We also specifically do not attempt to precisely define the
boundaries between telepresence systems and other systems, nor do we
attempt to identify the "best" solution for each presented scenario.
Telepresence systems are typically composed of one or more video
cameras and encoders and one or more display screens of large size
(diagonal around 60 inches). Microphones pick up sound, and audio
codec(s) produce one or more audio streams. The cameras used to
capture the telepresence users are referred to as "participant
cameras" (and likewise for screens). There may also be other
cameras, such as for document display. These will be referred to as
"presentation cameras" or "content cameras", which generally have
different formats, aspect ratios, and frame rates from the
participant cameras. The presentation streams may be shown on
participant screens or on auxiliary display screens. A user's
computer may also serve as a virtual content camera, generating an
animation or playing a video for display to the remote participants.
We describe such a telepresence system as sending one or more video
streams, audio streams, and presentation streams to the remote
system(s).
<span class="grey">Romanow, et al. Informational [Page 4]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-5" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
The fundamental parameters describing today's typical telepresence
scenarios include:
1. The number of participating sites
2. The number of visible seats at a site
3. The number of cameras
4. The number and type of microphones
5. The number of audio channels
6. The screen size
7. The screen capabilities -- such as resolution, frame rate,
aspect ratio
8. The arrangement of the screens in relation to each other
9. The number of primary screens at each site
10. Type and number of presentation screens
11. Multipoint conference display strategies -- for example, the
camera-to-screen mappings may be static or dynamic
12. The camera point of capture
13. The cameras fields of view and how they spatially relate to each
other
As discussed in the introduction, the basic features that give
telepresence its distinctive characteristics are implemented in
disparate ways in different systems.
There is no agreed upon way to adequately describe the semantics of
how streams of various media types relate to each other. Without a
standard for stream semantics to describe the particular roles and
activities of each stream in the conference, interoperability is
cumbersome at best.
In a multiple-screen conference, the video and audio streams sent
from remote participants must be understood by receivers so that they
can be presented in a coherent and life-like manner. This includes
the ability to present remote participants at their actual size for
their apparent distance, while maintaining correct eye contact,
<span class="grey">Romanow, et al. Informational [Page 5]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-6" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
gesticular cues, and simultaneously providing a spatial audio sound
stage that is consistent with the displayed video.
The receiving device that decides how to render incoming information
needs to understand a number of variables such as the spatial
position of the speaker, the field of view of the cameras, the camera
zoom, which media stream is related to each of the screens, etc. It
is not simply that individual streams must be adequately described,
to a large extent this already exists, but rather that the semantics
of the relationships between the streams must be communicated. Note
that all of this is still required even if the basic aspects of the
streams, such as the bit rate, frame rate, and aspect ratio, are
known. Thus, this problem has aspects considerably beyond those
encountered in interoperation of video conferencing systems that have
a single camera/screen.
<span class="h2"><a class="selflink" id="section-3" href="#section-3">3</a>. Use Cases</span>
The use cases focus on typical implementations. There are a number
of possible variants for these use cases; for example, the audio
supported may differ at the end points (such as mono or stereo versus
surround sound), etc.
Many of these systems offer a "full conference room" solution, where
local participants sit at one side of a table and remote participants
are displayed as if they are sitting on the other side of the table.
The cameras and screens are typically arranged to provide a panoramic
view of the remote room (left to right from the local user's
viewpoint).
The sense of immersion and non-verbal communication is fostered by a
number of technical features, such as:
1. Good eye contact, which is achieved by careful placement of
participants, cameras, and screens.
2. Camera field of view and screen sizes are matched so that the
images of the remote room appear to be full size.
3. The left side of each room is presented on the right screen at
the far end; similarly, the right side of the room is presented
on the left screen. The effect of this is that participants of
each site appear to be sitting across the table from each other.
If 2 participants on the same site glance at each other, all
participants can observe it. Likewise, if a participant at one
site gestures to a participant on the other site, all
participants observe the gesture itself and the participants it
includes.
<span class="grey">Romanow, et al. Informational [Page 6]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-7" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
<span class="h3"><a class="selflink" id="section-3.1" href="#section-3.1">3.1</a>. Point-to-Point Meeting: Symmetric</span>
In this case, each of the 2 sites has an identical number of screens,
with cameras having fixed fields of view, and 1 camera for each
screen. The sound type is the same at each end. As an example,
there could be 3 cameras and 3 screens in each room, with stereo
sound being sent and received at each end.
Each screen is paired with a corresponding camera. Each camera/
screen pair is typically connected to a separate codec, producing an
encoded stream of video for transmission to the remote site, and
receiving a similarly encoded stream from the remote site.
Each system has one or multiple microphones for capturing audio. In
some cases, stereophonic microphones are employed. In other systems,
a microphone may be placed in front of each participant (or pair of
participants). In typical systems, all the microphones are connected
to a single codec that sends and receives the audio streams as either
stereo or surround sound. The number of microphones and the number
of audio channels are often not the same as the number of cameras.
Also, the number of microphones is often not the same as the number
of loudspeakers.
The audio may be transmitted as multi-channel (stereo/surround sound)
or as distinct and separate monophonic streams. Audio levels should
be matched, so the sound levels at both sites are identical.
Loudspeaker and microphone placements are chosen so that the sound
"stage" (orientation of apparent audio sources) is coordinated with
the video. That is, if a participant at one site speaks, the
participants at the remote site perceive her voice as originating
from her visual image. In order to accomplish this, the audio needs
to be mapped at the received site in the same fashion as the video.
That is, audio received from the right side of the room needs to be
output from loudspeaker(s) on the left side at the remote site, and
vice versa.
<span class="h3"><a class="selflink" id="section-3.2" href="#section-3.2">3.2</a>. Point-to-Point Meeting: Asymmetric</span>
In this case, each site has a different number of screens and cameras
than the other site. The important characteristic of this scenario
is that the number of screens is different between the 2 sites. This
creates challenges that are handled differently by different
telepresence systems.
This use case builds on the basic scenario of 3 screens to 3 screens.
Here, we use the common case of 3 screens and 3 cameras at one site,
and 1 screen and 1 camera at the other site, connected by a point-to-
point call. The screen sizes and camera fields of view at both sites
<span class="grey">Romanow, et al. Informational [Page 7]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-8" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
are basically similar, such that each camera view is designed to show
2 people sitting side by side. Thus, the 1-screen room has up to 2
people seated at the table, while the 3-screen room may have up to 6
people at the table.
The basic considerations of defining left and right and indicating
relative placement of the multiple audio and video streams are the
same as in the 3-3 use case. However, handling the mismatch between
the 2 sites of the number of screens and cameras requires more
complicated maneuvers.
For the video sent from the 1-camera room to the 3-screen room,
usually what is done is to simply use 1 of the 3 screens and keep the
second and third screens inactive or, for example, put up the current
date. This would maintain the "full-size" image of the remote side.
For the other direction, the 3-camera room sending video to the
1-screen room, there are more complicated variations to consider.
Here are several possible ways in which the video streams can be
handled.
1. The 1-screen system might simply show only 1 of the 3 camera
images, since the receiving side has only 1 screen. 2 people are
seen at full size, but 4 people are not seen at all. The choice
of which one of the 3 streams to display could be fixed, or could
be selected by the users. It could also be made automatically
based on who is speaking in the 3-screen room, such that the
people in the 1-screen room always see the person who is
speaking. If the automatic selection is done at the sender, the
transmission of streams that are not displayed could be
suppressed, which would avoid wasting bandwidth.
2. The 1-screen system might be capable of receiving and decoding
all 3 streams from all 3 cameras. The 1-screen system could then
compose the 3 streams into 1 local image for display on the
single screen. All 6 people would be seen, but smaller than full
size. This could be done in conjunction with reducing the image
resolution of the streams, such that encode/decode resources and
bandwidth are not wasted on streams that will be downsized for
display anyway.
3. The 3-screen system might be capable of including all 6 people in
a single stream to send to the 1-screen system. For example, it
could use PTZ (Pan Tilt Zoom) cameras to physically adjust the
cameras such that 1 camera captures the whole room of 6 people.
Or, it could recompose the 3 camera images into 1 encoded stream
to send to the remote site. These variations also show all 6
people but at a reduced size.
<span class="grey">Romanow, et al. Informational [Page 8]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-9" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
4. Or, there could be a combination of these approaches, such as
simultaneously showing the speaker in full size with a composite
of all 6 participants in a smaller size.
The receiving telepresence system needs to have information about the
content of the streams it receives to make any of these decisions.
If the systems are capable of supporting more than one strategy,
there needs to be some negotiation between the 2 sites to figure out
which of the possible variations they will use in a specific point-
to-point call.
<span class="h3"><a class="selflink" id="section-3.3" href="#section-3.3">3.3</a>. Multipoint Meeting</span>
In a multipoint telepresence conference, there are more than 2 sites
participating. Additional complexity is required to enable media
streams from each participant to show up on the screens of the other
participants.
Clearly, there are a great number of topologies that can be used to
display the streams from multiple sites participating in a
conference.
One major objective for telepresence is to be able to preserve the
"being there" user experience. However, in multi-site conferences,
it is often (in fact, usually) not possible to simultaneously provide
full-size video, eye contact, and common perception of gestures and
gaze by all participants. Several policies can be used for stream
distribution and display: all provide good results, but they all make
different compromises.
One common policy is called site switching. Let's say the speaker is
at site A and the other participants are at various "remote" sites.
When the room at site A shown, all the camera images from site A are
forwarded to the remote sites. Therefore, at each receiving remote
site, all the screens display camera images from site A. This can be
used to preserve full-size image display, and also provide full
visual context of the displayed far end, site A. In site switching,
there is a fixed relation between the cameras in each room and the
screens in remote rooms. The room or participants being shown are
switched from time to time based on who is speaking or by manual
control, e.g., from site A to site B.
Segment switching is another policy choice. In segment switching
(assuming still that site A is where the speaker is, and "remote"
refers to all the other sites), rather than sending all the images
from site A, only the speaker at site A is shown. The camera images
of the current speaker and previous speakers (if any) are forwarded
to the other sites in the conference. Therefore, the screens in each
<span class="grey">Romanow, et al. Informational [Page 9]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-10" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
site are usually displaying images from different remote sites -- the
current speaker at site A and the previous ones. This strategy can
be used to preserve full-size image display and also capture the non-
verbal communication between the speakers. In segment switching, the
display depends on the activity in the remote rooms (generally, but
not necessarily based on audio/speech detection).
A third possibility is to reduce the image size so that multiple
camera views can be composited onto one or more screens. This does
not preserve full-size image display, but it provides the most visual
context (since more sites or segments can be seen). Typically in
this case, the display mapping is static, i.e., each part of each
room is shown in the same location on the display screens throughout
the conference.
Other policies and combinations are also possible. For example,
there can be a static display of all screens from all remote rooms,
with part or all of one screen being used to show the current speaker
at full size.
<span class="h3"><a class="selflink" id="section-3.4" href="#section-3.4">3.4</a>. Presentation</span>
In addition to the video and audio streams showing the participants,
additional streams are used for presentations.
In systems available today, generally only one additional video
stream is available for presentations. Often, this presentation
stream is half-duplex in nature, with presenters taking turns. The
presentation stream may be captured from a PC screen, or it may come
from a multimedia source such as a document camera, camcorder, or a
DVD. In a multipoint meeting, the presentation streams for the
currently active presentation are always distributed to all sites in
the meeting, so that the presentations are viewed by all.
Some systems display the presentation streams on a screen that is
mounted either above or below the 3 participant screens. Other
systems provide screens on the conference table for observing
presentations. If multiple presentation screens are used, they
generally display identical content. There is considerable variation
in the placement, number, and size of presentation screens.
In some systems, presentation audio is pre-mixed with the room audio.
In others, a separate presentation audio stream is provided (if the
presentation includes audio).
<span class="grey">Romanow, et al. Informational [Page 10]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-11" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
In H.323 [<a href="#ref-ITU.H323" title=""Packet-based Multimedia Communications Systems"">ITU.H323</a>] systems, H.239 [<a href="#ref-ITU.H239" title=""Role management and additional media channels for H.300-series terminals"">ITU.H239</a>] is typically used to
control the video presentation stream. In SIP systems, similar
control mechanisms can be provided using the Binary Floor Control
Protocol (BFCP) [<a href="./rfc4582" title=""The Binary Floor Control Protocol (BFCP)"">RFC4582</a>] for the presentation token. These
mechanisms are suitable for managing a single presentation stream.
Although today's systems remain limited to a single video
presentation stream, there are obvious uses for multiple presentation
streams:
1. Frequently, the meeting convener is following a meeting agenda,
and it is useful for her to be able to show that agenda to all
participants during the meeting. Other participants at various
remote sites are able to make presentations during the meeting,
with the presenters taking turns. The presentations and the
agenda are both shown, either on separate screens, or perhaps
rescaled and shown on a single screen.
2. A single multimedia presentation can itself include multiple
video streams that should be shown together. For instance, a
presenter may be discussing the fairness of media coverage. In
addition to slides that support the presenter's conclusions, she
also has video excerpts from various news programs that she shows
to illustrate her findings. She uses a DVD player for the video
excerpts so that she can pause and reposition the video as
needed.
3. An educator who is presenting a multiscreen slide show. This
show requires that the placement of the images on the multiple
screens at each site be consistent.
There are many other examples where multiple presentation streams are
useful.
<span class="h3"><a class="selflink" id="section-3.5" href="#section-3.5">3.5</a>. Heterogeneous Systems</span>
It is common in meeting scenarios for people to join the conference
from a variety of environments, using different types of endpoint
devices. A multiscreen immersive telepresence conference may include
someone on a PC-based video conferencing system, a participant
calling in by phone, and (soon) someone on a handheld device.
What experience/view will each of these devices have?
Some may be able to handle multiple streams, and others can handle
only a single stream. (Here, we are not talking about legacy
systems, but rather systems built to participate in such a
conference, although they are single stream only.) In a single video
<span class="grey">Romanow, et al. Informational [Page 11]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-12" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
stream, the stream may contain one or more compositions depending on
the available screen space on the device. In most cases, an
intermediate transcoding device will be relied upon to produce a
single stream, perhaps with some kind of continuous presence.
Bit rates will vary -- the handheld device and phone having lower bit
rates than PC and multiscreen systems.
Layout is accomplished according to different policies. For example,
a handheld device and PC may receive the active speaker stream. The
decision can either be made explicitly by the receiver or by the
sender if it can receive some kind of rendering hint. The same is
true for audio -- i.e., that it receives a mixed stream or a number
of the loudest speakers if mixing is not available in the network.
For the PC-based conferencing participant, the user's experience
depends on the application. It could be single stream, similar to a
handheld device but with a bigger screen. Or, it could be multiple
streams, similar to an immersive telepresence system but with a
smaller screen. Control for manipulation of streams can be local in
the software application, or in another location and sent to the
application over the network.
The handheld device is the most extreme. How will that participant
be viewed and heard? It should be an equal participant, though the
bandwidth will be significantly less than an immersive system. A
receiver may choose to display output coming from a handheld device
differently based on the resolution, but that would be the case with
any low-resolution video stream, e.g., from a powerful PC on a bad
network.
The handheld device will send and receive a single video stream,
which could be a composite or a subset of the conference. The
handheld device could say what it wants or could accept whatever the
sender (conference server or sending endpoint) thinks is best. The
handheld device will have to signal any actions it wants to take the
same way that an immersive system signals actions.
<span class="h3"><a class="selflink" id="section-3.6" href="#section-3.6">3.6</a>. Multipoint Education Usage</span>
The importance of this example is that the multiple video streams are
not used to create an immersive conferencing experience with
panoramic views at all the sites. Instead, the multiple streams are
dynamically used to enable full participation of remote students in a
university class. In some instances, the same video stream is
displayed on multiple screens in the room; in other instances, an
available stream is not displayed at all.
<span class="grey">Romanow, et al. Informational [Page 12]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-13" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
The main site is a university auditorium that is equipped with 3
cameras. One camera is focused on the professor at the podium. A
second camera is mounted on the wall behind the professor and
captures the class in its entirety. The third camera is co-located
with the second and is designed to capture a close-up view of a
questioner in the audience. It automatically zooms in on that
student using sound localization.
Although the auditorium is equipped with 3 cameras, it is only
equipped with 2 screens. One is a large screen located at the front
so that the class can see it. The other is located at the rear so
the professor can see it. When someone asks a question, the front
screen shows the questioner. Otherwise, it shows the professor
(ensuring everyone can easily see her).
The remote sites are typical immersive telepresence rooms, each with
3 camera/screen pairs.
All remote sites display the professor on the center screen at full
size. A second screen shows the entire classroom view when the
professor is speaking. However, when a student asks a question, the
second screen shows the close-up view of the student at full size.
Sometimes the student is in the auditorium; sometimes the speaking
student is at another remote site. The remote systems never display
the students that are actually in that room.
If someone at a remote site asks a question, then the screen in the
auditorium will show the remote student at full size (as if they were
present in the auditorium itself). The screen in the rear also shows
this questioner, allowing the professor to see and respond to the
student without needing to turn her back on the main class.
When no one is asking a question, the screen in the rear briefly
shows a full-room view of each remote site in turn, allowing the
professor to monitor the entire class (remote and local students).
The professor can also use a control on the podium to see a
particular site -- she can choose either a full-room view or a
single-camera view.
Realization of this use case does not require any negotiation between
the participating sites. Endpoint devices (and a Multipoint Control
Unit (MCU), if present) need to know who is speaking and what video
stream includes the view of that speaker. The remote systems need
some knowledge of which stream should be placed in the center. The
ability of the professor to see specific sites (or for the system to
show all the sites in turn) would also require the auditorium system
<span class="grey">Romanow, et al. Informational [Page 13]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-14" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
to know what sites are available and to be able to request a
particular view of any site. Bandwidth is optimized if video that is
not being shown at a particular site is not distributed to that site.
<span class="h3"><a class="selflink" id="section-3.7" href="#section-3.7">3.7</a>. Multipoint Multiview (Virtual Space)</span>
This use case describes a virtual space multipoint meeting with good
eye contact and spatial layout of participants. The use case was
proposed very early in the development of video conferencing systems
as described in 1983 by Allardyce and Randal [<a href="#ref-virtualspace">virtualspace</a>]. The use
case is illustrated in Figure 2-5 of their report. The virtual space
expands the point-to-point case by having all multipoint conference
participants "seated" in a virtual room. In this case, each
participant has a fixed "seat" in the virtual room, so each
participant expects to see a different view having a different
participant on his left and right side. Today, the use case is
implemented in multiple telepresence-type video conferencing systems
on the market. The term "virtual space" was used in their report.
The main difference between the result obtained with modern systems
and those from 1983 are larger screen sizes.
Virtual space multipoint as defined here assumes endpoints with
multiple cameras and screens. Usually, there is the same number of
cameras and screens at a given endpoint. A camera is positioned
above each screen. A key aspect of virtual space multipoint is the
details of how the cameras are aimed. The cameras are each aimed on
the same area of view of the participants at the site. Thus, each
camera takes a picture of the same set of people but from a different
angle. Each endpoint sender in the virtual space multipoint meeting
therefore offers a choice of video streams to remote receivers, each
stream representing a different viewpoint. For example, a camera
positioned above a screen to a participant's left may take video
pictures of the participant's left ear; while at the same time, a
camera positioned above a screen to the participant's right may take
video pictures of the participant's right ear.
Since a sending endpoint has a camera associated with each screen, an
association is made between the receiving stream output on a
particular screen and the corresponding sending stream from the
camera associated with that screen. These associations are repeated
for each screen/camera pair in a meeting. The result of this system
is a horizontal arrangement of video images from remote sites, one
per screen. The image from each screen is paired with the camera
output from the camera above that screen, resulting in excellent eye
contact.
<span class="grey">Romanow, et al. Informational [Page 14]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-15" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
<span class="h3"><a class="selflink" id="section-3.8" href="#section-3.8">3.8</a>. Multiple Presentation Streams - Telemedicine</span>
This use case describes a scenario where multiple presentation
streams are used. In this use case, the local site is a surgery room
connected to one or more remote sites that may have different
capabilities. At the local site, 3 main cameras capture the whole
room (the typical 3-camera telepresence case). Also, multiple
presentation inputs are available: a surgery camera that is used to
provide a zoomed view of the operation, an endoscopic monitor, a
flouroscope (X-ray imaging), an ultrasound diagnostic device, an
electrocardiogram (ECG) monitor, etc. These devices are used to
provide multiple local video presentation streams to help the surgeon
monitor the status of the patient and assist in the surgical process.
The local site may have 3 main screens and one (or more) presentation
screen(s). The main screens can be used to display the remote
experts. The presentation screen(s) can be used to display multiple
presentation streams from local and remote sites simultaneously. The
3 main cameras capture different parts of the surgery room. The
surgeon can decide the number, the size, and the placement of the
presentations displayed on the local presentation screen(s). He can
also indicate which local presentation captures are provided for the
remote sites. The local site can send multiple presentation captures
to remote sites, and it can receive from them multiple presentations
related to the patient or the procedure.
One type of remote site is a single- or dual-screen and one-camera
system used by a consulting expert. In the general case, the remote
sites can be part of a multipoint telepresence conference. The
presentation screens at the remote sites allow the experts to see the
details of the operation and related data. Like the main site, the
experts can decide the number, the size, and the placement of the
presentations displayed on the presentation screens. The
presentation screens can display presentation streams from the
surgery room, from other remote sites, or from local presentation
streams. Thus, the experts can also start sending presentation
streams that can carry medical records, pathology data, or their
references and analysis, etc.
Another type of remote site is a typical immersive telepresence room
with 3 camera/screen pairs, allowing more experts to join the
consultation. These sites can also be used for education. The
teacher, who is not necessarily the surgeon, and the students are in
different remote sites. Students can observe and learn the details
of the whole procedure, while the teacher can explain and answer
questions during the operation.
<span class="grey">Romanow, et al. Informational [Page 15]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-16" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
All remote education sites can display the surgery room. Another
option is to display the surgery room on the center screen, and the
rest of the screens can show the teacher and the student who is
asking a question. For all the above sites, multiple presentation
screens can be used to enhance visibility: one screen for the zoomed
surgery stream and the others for medical image streams, such as MRI
images, cardiograms, ultrasonic images, and pathology data.
<span class="h2"><a class="selflink" id="section-4" href="#section-4">4</a>. Acknowledgements</span>
The document has benefitted from input from a number of people
including Alex Eleftheriadis, Marshall Eubanks, Tommy Andre Nyquist,
Mark Gorzynski, Charles Eckel, Nermeen Ismail, Mary Barnes, Pascal
Buhler, and Jim Cole.
Special acknowledgement to Lennard Xiao, who contributed the text for
the telemedicine use case, and to Claudio Allocchio for his detailed
review of the document.
<span class="h2"><a class="selflink" id="section-5" href="#section-5">5</a>. Security Considerations</span>
While there are likely to be security considerations for any solution
for telepresence interoperability, this document has no security
considerations.
<span class="h2"><a class="selflink" id="section-6" href="#section-6">6</a>. Informative References</span>
[<a id="ref-ITU.H239">ITU.H239</a>] ITU-T, "Role management and additional media channels for
H.300-series terminals", ITU-T Recommendation H.239,
September 2005.
[<a id="ref-ITU.H264">ITU.H264</a>] ITU-T, "Advanced video coding for generic audiovisual
services", ITU-T Recommendation H.264, April 2013.
[<a id="ref-ITU.H323">ITU.H323</a>] ITU-T, "Packet-based Multimedia Communications Systems",
ITU-T Recommendation H.323, December 2009.
[<a id="ref-RFC3261">RFC3261</a>] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
A., Peterson, J., Sparks, R., Handley, M., and E.
Schooler, "SIP: Session Initiation Protocol", <a href="./rfc3261">RFC 3261</a>,
June 2002.
[<a id="ref-RFC3550">RFC3550</a>] Schulzrinne, H., Casner, S., Frederick, R., and V.
Jacobson, "RTP: A Transport Protocol for Real-Time
Applications", STD 64, <a href="./rfc3550">RFC 3550</a>, July 2003.
[<a id="ref-RFC4582">RFC4582</a>] Camarillo, G., Ott, J., and K. Drage, "The Binary Floor
Control Protocol (BFCP)", <a href="./rfc4582">RFC 4582</a>, November 2006.
<span class="grey">Romanow, et al. Informational [Page 16]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-17" ></span>
<span class="grey"><a href="./rfc7205">RFC 7205</a> Telepresence Use Cases April 2014</span>
[<a id="ref-virtualspace">virtualspace</a>]
Allardyce, L. and L. Randall, "Development of
Teleconferencing Methodologies with Emphasis on Virtual
Space Video and Interactive Graphics", April 1983,
<<a href="http://www.dtic.mil/docs/citations/ADA127738">http://www.dtic.mil/docs/citations/ADA127738</a>>.
Authors' Addresses
Allyn Romanow
Cisco
San Jose, CA 95134
US
EMail: allyn@cisco.com
Stephen Botzko
Polycom
Andover, MA 01810
US
EMail: stephen.botzko@polycom.com
Mark Duckworth
Polycom
Andover, MA 01810
US
EMail: mark.duckworth@polycom.com
Roni Even (editor)
Huawei Technologies
Tel Aviv
Israel
EMail: roni.even@mail01.huawei.com
Romanow, et al. Informational [Page 17]
</pre>
|