Using neural networks for enhancing video call quality

One of the things our tech team is always fascinated by is the way in which neural networks can be deployed to solve a variety of use cases. And even though we work with conversational text and chatbots to improve customer service for our customers, we are inherently curious and tackle various challenges through our experimental work.

So it is no wonder that a few months ago, during one of our casual Friday afternoon brainstorming sessions, our talk drifted to the poor quality of video calls with an iPhone. It is well known that the popularity in various entertainment and social media applications has paved way for increased video usage. Usage of mobile video in conference calls is also growing. Even so while Apple’s FaceTime is a great solution, we were rather unhappy with the quality of the call.

Yes, it shows 4G on the top corner of the screen and the speed is supposed to be amazingly blazing but the truth is that most of the time it is 3G and the lag is noticeable. True, 5G is an option to make calls better but that of course is out of our hands and most likely still years away. With that in mind we started working on our experimental project on how we could use our neural networks in making the video calls better.

The value of such a solution is to improve the quality of video content streamed between phones. Artificial intelligence has been applied to improving quality of videos in general (e.g. Magic Pony that was acquired by Twitter earlier this year) and our focus was to apply specific AI techniques into improving video calls. The key factor in the solution is using neural networks on the user side (e.g. a dedicated app in the phone) to compress the video signal. That compressed signal is sent over the network and decoded again on the other user side (i.e. app in the phone). The coding and decoding allows compressing the video into small enough data packets that the transfer can be made over slower networks while still maintaining the quality of the video when decoded.

We made a few FaceTime video calls, each one around 5 minutes in length giving us 18,000 frames (or pictures) per one call. After scaling these frames to the tiny 64×64 format we started applying autoencoders for modeling the data with our deep learning-specialized TitanX GPU. We looked around and the reason why we started using autoencoders is that regular compression algorithms are not separating between the objects that are in the video. Meaning they compress the frames as they are irrespective of what is on the frame. What we are focusing on with the autoencoders is taking into account the nature of the objects in the video and as such its capability to differentiate between the important and less important objects in the frame which can result in providing better video quality. For example, in a FaceTime video call the face of the person speaking is in focus while the background is not that important. With this it makes sense to optimize for the quality of the face during the video call rather than on the background. Such focusing can improve the efficiency of video content transport on networks and also can help to deliver a better video viewing experience to customers.

Videocalls and AI

What we’ve learned thus far is that conventional cost functions used for reconstructing images might not be sufficient for accurate encoding of images containing faces. This is because faces contain a lot of detailed information, which is very important for the user but might be overlooked by the traditional cost function. As a result our work has shifted towards looking more into cost functions preserving covariance structure in reconstructed images. Our initial tests show that cost functions optimized for preserving covariant structure yield considerably more accurate reconstructions.