In my project called rest of words, I tried to build a video chat browser application enables physical-isolated users could talk with each other in a more creativive and effortless way. In this blog, I tried to document my process so far, including my research and understanding on webRTC; the system architecture of my projects I tried to draw by gradually understanding more about webRTC and machine-learning for live stream; and the prototype I made so far.

Research on Web Real-Time Communication

How webRTC solves the problems when peerA try to talk with peerB?

These problems would happen if A is in local area network and with a firewall, or they tried to get connected from different ends with different constraints. I will try to describe the main process of how webRTC solving this as much as I can, and using my own language which I assume would be less abstract to understand, even if some of "how this process would happen" is still hazy to me.

Screen Shot 2020-12-05 at 2.38.11 AM.png

  1. Exchange Session Description

    The media data is described in SDP (Session Description Protocol). The two clients migth want to know the media format they could both support, to ensure that they could decode the stream after they received that. They should talk with each other to draw a conclusion for that. The server which helps peers to do that would be called signaling server. The common methods always look like CreateOffer( ) and CreateAnswer( ), to negotiate the contents about the version, session name, usage of transmission channel, .etc, to ensure the two peer could talk in the same language format.

  2. Negotiate network information of peers

    If peer A and peer B want to connect to each other directly, they should know the address of each other. In an ideal world, the peers are all in Internet, so they can get connected point to point directly. But in the real world, this process may happen when we are all in local area network with the firewall, so we need to go through NAT(Network Address Traslation) to find a bridge to get connected. The way to do that could be put as this: peer A and peer B both send signal to a public IP address, so the server on this address could know the IP of peer A and peer B. Then this server helps to discover ways for peers to talk to each other as directly as possible using ICE (Interactive Connectivity Establishment) frameworks, containing various methods like STUN and TURN, with NAT as its rerequisite. From my understanding, the STUN could detect the public IP address of clients transversed from NAT, but STUN won't always work due to some limitations, then the TURN(Traversal using relays around NAT) comes to help STUN to complete that.

    The signaling server, which I always call it peer server in my project, could do the exchange and negotiate of SDP data and network information, also it serves as a manager to manage all chatting rooms and peer list, handling the joining and leaving of peers. It should be accessible to all peers.

  3. Build a connection

    After exchange and negotiate of SDP data and network information, the peers might be able to build a peer to peer connection, depending on the results of the neogotiation. If they are under the same local area network, they may talk to each other without STUN. Else if they are under different local area network, they need STUN to detect the public IP address to enable communicaiton with each other. Else if, the STUN connection is blocked due to some limitations, they need the help of TURN server.

Protocol definitions mentioned above:

NAT - Network Address Traslation: When sending our request from a privete device in the local area network in our home, to visit a public IP address on the Internet, we are sending this request with our IP information stored in the IP headers of packets we sent. With the help of NAT, we could change the hidden addresses(usually consisting of private IP addresses) into a single (public) IP address. It would make the outgoing packets appear to be originated from the routing device itself, not from the hidden host.

ICE - Interactive Connectivity Establishment is a technique to find ways for two computers to talk to each other as directly as possible in peer-to-peer connection. In P2P connections, we want to avoid communicating through a central server (which would slow down communication, and be expensive), but direct communication between client applications on the Internet is very tricky due to network address translators (NAT), firewalls, and other network barriers. Link: ICE examples, Better explanations of STUN and TURN.

NAT gateways track outbound requests from a private network and maintain the state of each established connection to later direct responses from the peer on the public network to the peer in the private network, which would otherwise not be directly addressable. This tracking makes it possible for using Session Traversal Utilities for NAT (STUN) to discover the connectable address among the clients. If direct media traffic between peers is not allowed by a firewall, STUN Traversal Using Relays around NAT (TURN) places a third-party server to relay messages between two clients .

SIP - Session Initiation Protocol is a signalling protocol used to establish a “session” between 2 or more participants, modify that session, and eventually terminate that session. For example, when we called someone and started a video chat, this call itself could be called a “session”.The messages are text-based. And actual data transmission is done by the TCP or the UDP on layer five-session layer of the OSI model. The Session Description Protocol (SDP) controls which of the protocols is used.

(This part of definition of SIP and ICE is heavily borrowed from the reference links Protocal Definitions Reference in the last part.)


Screen Shot 2020-12-08 at 11.39.33 PM.png

The system architecture of this project

  • Peer server: Built with peer.js, which enable that a peer coulc create a media stream connection to a remote peer, with peer-identification and basic self-intro information as a new peer. No peer-to-peer data goes through the server, and the server acts only as a connection broker.

  • Socket server: Built with webSocket, enable the communication of text or binary data between the clients and the server.

  • P2P media stream connection: the following diagram explained how a new peer join the peer-to-peer media stream connection:

    • the new peer A got a peer_id from peer server and instantiated himself as a “peer”

    • peer A emited the peer_id to socket server

    • socket server emited peer_id to all connected clients

    • remote peers (represented by peer N here) called this peer A with his id, sending their own stream at the same

    • peer A answered the call with his local stream

    • connected clients display the remote stream they received and local stream

Screen Shot 2020-12-06 at 2.15.55 AM.png
 

Backlogs of the prototype I made so far

1.png

In this prototype (clickable) I handled mostly of the client-side, got a more clear picture about the server-side but have not successfully made an useable server. I used the p5 live media library host on https://p5livemedia.itp.io (including peer server and socket.io), which provides a “hybrid” API of basic peer server and socket server, dealing with stream and string data. I listed how I used the existing API to handle the three types of communication.

  1. Stream: send and receive video/audio live stream

Callback:

  • Get the stream from other users and then push that into the list of all connections we have, the items in this lists looks like this:

    {'video': myVideo, ‘background': mybackgroundVideo, ‘name': "Me", 'x': 10}

  • Classifying the stream with pre-trained machine-learining models (Posenet, BodyPixel and teachable machine)

  • Call the function of sending the name data and getting the name data to be displayed in the front end.

  • Call the function of sending the location data, which is using the nose location of the user, and getting the location data to make their own video move with the nose of each user .

2. Data: strings and values with tag identifier and user id

Callback: refresh all the fields of in the connection list by the data type.

For example, the in every item in the list we saved its name and location, when we receive the data:

let d = JSON.parse(data);

if (d.dataType == 'name') { allConnections[id].name = d.name; }

else if (d.dataType == 'location') { allConnections[id].x = d.x; }

3. Disconnected

Print out the id of the user who lost connection and delete him from the the list of all connections we have.

The existing features of this prototype:

  1. real-time video streams of two users

  2. facial expression to tint the video

  3. facial expression to trigger the video sticker

  4. nose movement to move your video when mouse dragged

  5. background filter

  6. change your name in the input field

movement-serious.gif
withlulu2.gif

Coming features:

  1. Connect to peer server/Disconnect from peer server could be control from client side, not automatically

  2. Turn on/off video

  3. Show/hide chat

  4. Database identification to show the user his own chat history

  5. Host the server on cloud and try to keep it run forever (for a while)

  6. Develop my own signaling server and socket server, because the library I used now only provide when the instance of live media is on but there need to be theLiveMediaInstance.stop(), and also to support the features mentioned above

  7. Solve the asychronized issue of getting peer server connection, video stream and models

 

I found that I would get different results when I run this sketch under different networks. When I did that in NYU SH I would get less chance of errors. When I did that at Bytedance, I cannot visit this sketch at all. When I did that in my home, there always come unexpeted errors because of the asychronization of various objects that my sketch requests for. I tried to do traceroute to editor.p5js.org, I found *** for 64 times then ended. In the further development, I would try to solve this problem with better use of syn/asyn in JS and implement on the network path design.

 

Reference:

The library I used:

  1. JS from p5 server

  2. LiveStream lib host on https://p5livemedia.itp.io (including peer server and socket.io)

  3. BodyPixel and poseNet from ml5 library

  4. Teachable machine model host on https://teachablemachine.withgoogle.com

Coding reference:

  1. Zoom annotations with Machine Learning and p5.js from Shiffman

  2. p5Live Mutliple with Data from Dano

  3. The idea of using nose position data to trigger the video inspired from Rae

The learning materials to understand WebRTC:

  1. Get Started with WebRTC

2. Send data between browsers with WebRTC data channels

3. WebRTC 由浅入深

4. 手把手搭建WebRTC测试环境,实现1对1视频通话

The learning materials to build a video chat web application:

1. How to Build a Video Chat Application with Node.js, Socket.io and TypeScript

The difficulties I met when trying that:

  • Building src directory structure, including routes, html, servers and package.json.

  • Understanding TypeScript

    The concepts I learnt from that: (explained in the first chapter)

  • STUN and TURN server

  • ICE candidate

  • JSEP's architecture

2. Build the backend services needed for a WebRTC app

3. Signaling and video calling

4. Send data between browsers with WebRTC data channels

Protocal Definitions Reference:

  1. What is SIP – Session Initiation Protocol

  2. Wikipedia, Network address translation

  3. Interactive Connectivity Establishment

  4. ICE examples

  5. Better explanations of STUN and TURN