skip to main content
Packet Framing
We now know that the overhead placed on our payload is quite substantial in order to get it dropped into a packet and sent over the network. But we have some wiggle room. Instead of little payloads, we can send bigger payloads. If our overhead for every unencrypted packet is 54 bytes then the best way to reduce the bandwidth utilization is to reduce the number of packets being sent.
Let's say that the 10 bytes of payload consituted 20 milliseconds of audio. That means that for every 1000 milliseconds (1 second) of audio, we'd be sending 50 packets (1000/20). Now, the payload in each of those packets is 10 bytes and the overhead is 54 bytes. Therefore, every second, we'd be sending 500 bytes of payload (10 bytes of payload * 50 packets) and 2,700 bytes of overhead (54 * 50). The total number of bytes we'd send every second is therefore 3,200 bytes (500 + 2,700).
However, let's say we changed the payload size and instead of sending audio every 20 milliseconds, we send at 40 millisecond intervals. Now, the number of packets we'd send every second halves to 25 packets. The total amount of payload we send is still 500 bytes (now 20 bytes per payload but only 25 packets in a second - i.e. 20 * 25 = 500) but the total overhead is now only 1,350 bytes (54 * 25); a full 50% reduction in overhead!
The tradeoff, though, is we now have to wait for 40 milliseconds before we send any audio; and it means receivers have to wait 40 milliseconds to receive it. Doing so increases audio latency, which is typically undesirable, but in a push-to-talk environment this modest increase in latency is not perceived by users.
There's a bigger risk here, though. If a 20 millisecond packet gets lost on the network, the receiver only looses 20 milliseconds of audio. If a 40 millisecond packet is lost, the audio dropout is larger. We can go even bigger, we can send 60 millisecond packets, or 80, or even 100 millisecond packets. Consider if we sent 100 millisecond packets: the payload size for 100 milliseconds would now be 50 bytes (10 bytes for 20 milliseconds * 5 ... because 100 / 20 = 5), and we'd only be sending 10 packets. Therefore, in a second, our bandwidth consumption would be 500 payload bytes (as always) plus 540 bytes of taxes. A total of 1,040 bytes in a second vs 3,200!
All this stuff is equally relevant if our streams are encrypted—but with extras added on for the encryption initialization vector and padding. Basically, we can treat encryption as simply another, albeit optional, tax.
Important: As these numbers get bigger and networks drop or delay packets, audio quality begins suffering. So we need to be very careful on what numbers we choose.
The considerations above drive us ultimately to packet framing which refers to the size of the audio payload that we transmit. You'll see in the bandwidth utilization tables below how sizing changes bandwidth consumption. These numbers are as accurate as possible but, as with anything technical, there's always some wiggling that needs to be considered.
For instance, CODECs such as G.711 and GSM always produce a reliably-sized output. Other CODECs such as AMR and Opus are variable in size, meaning that they will try to actively reduce the amount of traffic based on how much the sender is saying, what their volume level is, the complexity of their voice, and so on. With these so-called VBR CODECs (Variable Bit Rate) you may sometimes see lower output sizes than listed in the tables. But you should rarely see higher values. Best, though, to add a "fudge factor" of 1 Kbps or so to the numbers used for planning purposes.