## Data compression

 Quick links 3.3.1 Number bases 3.3.2 Converting number bases 3.3.3 Units of information 3.3.4 Binary arithmetic 3.3.5 Character encoding 3.3.6 Representing images 3.3.7 Representing sound 3.3.8 Data compression

### Syllabus content

Explain what data compression is. Understand why data may be compressed and that there are different ways to compress data.

Students should understand that it is common for data to be compressed and should be able to explain why it may be necessary or desirable to compress data.

Explain how data can be compressed using Huffman coding. Be able to interpret Huffman trees.

Students should be familiar with the process of using a tree to represent the Huffman code.

Students should be able to interpret a given Huffman tree to determine the code used for a particular node within the tree.

Be able to calculate the number of bits required to store a piece of data compressed using Huffman coding.

Be able to calculate the number of bits required to store a piece of uncompressed data in ASCII.

Students should be familiar with carrying out calculations to determine the number of bits saved by compressing a piece of data using Huffman coding.

Explain how data can be compressed using run length encoding (RLE).

Students should be familiar with the process of using frequency/data pairs to reduce the amount of data stored.

Represent data in RLE frequency/data pairs.

Students could be given a bitmap representation and they would be expected to show the frequency and value pairs for each row,

eg 0000011100000011

would become 5 0 3 1 6 0 2 1.

## Starter 1

This is monochrome image (2-bit) Calculate the file size using the information given in the previous work.

This is a colour (3-bit image) (There are no marks for comments on the quality of the image please!) Calculate the file size using the information given in the previous work.

How many different colours can a 3-bit image have?

## Explanation

### Digital audio

This is the graphic representation ofsome digital music stored on a PC. in fact this is the graphic representation of part of the tune shown under the image. Digital audio is collected from the real world by sampling the sound periodically (but very often) and converting the sound at that time into a number. If this is done often enough and quickly enough then the stored sound will be a reasonable representation of the actual sound.

We hear sound as our ears process the changes in air pressure. The sound that made the air pressure change can also be recorded via a mircophone, via an Analogue to Digital converter (ADC) The sound is sampled by the computer.

#### Digital audio quality

Factors that affect the quality of digital audio include:

• sample rate - the number of audio samples captured every second
• bit depth - the number of bits available for each clip
• bit rate - the number of bits used per second of audio

#### Sample rate

The sample rate is how many samples, or measurements, of the sound are taken each second. The more samples that are taken, the more detail about where the waves rise and fall is recorded and the higher the quality of the audio. Also, the shape of the sound wave is captured more accurately.

Each sample represents the amplitude of the digital signal at a specific point in time. The amplitude is stored as either an integer or afloating point number and encoded as a binary number. A common audio sample rate for music is 44,100 samples per second. The unit for the sample rate is hertz (Hz). 44,100 samples per second is 44,100 hertz or 44.1 kilohertz (kHz).

Telephone networks and VOIP services can use a sample rate as low as 8 kHz. This uses less data to represent the audio. At 8 kHz, the human voice can still be heard clearly - but music at this sample rate would sound low quality. #### Bit depth

Bit depth is the number of bits available for each sample. The higher the bit depth, the higher the quality of the audio. Bit depth is usually 16 bits on a CD and 24 bits on a DVD.

A bit depth of 16 has a resolution of 65,536 possible values, but a bit depth of 24 has over 16 million possible values.

16-bit resolution means each sample can be any binary value between 0000 0000 0000 0000 and 1111 1111 1111 1111. 32,768 + 16,384 + 8192 + 4096 + 2048 + 1024 + 512 + 256 + 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 65,536

24 bit means the maximum binary number is 1111 1111 1111 1111 1111 1111 which creates 16,777,215 possible values.

When an audio file is created it has to be encoded as a particular file type. Uncompressed audio files are made when high-quality recordings are created. High-quality audio will be created as a PCM and stored in a file format such as WAV or AIFF.

#### Bit rate

The bit rate of a file tells us how many bits of data are processed every second. Bit rates are usually measured in kilobits per second (kbps).

#### Calculating bit rate

The bit rate is calculated using the formula:

Frequency × bit depth × channels = bit rate

A typical, uncompressed high-quality audio file has a sample rate of 44,100 samples per second, a bit depth of 16 bits per sample and 2 channels of stereo audio. The bit rate for this file would be:

44,100 samples per second × 16 bits per sample × 2 channels = 1,411,200 bits per second (or 1,411.2 kbps)

A four-minute (240 second) song at this bit rate would create a file size of:

14,411,200 × 240 = 338,688,000 bits (or 40.37 megabytes)

#### Compression

Compression is a useful tool for reducing file sizes. When images, sounds or videos are compressed, data is removed to reduce the file size. This is very helpful when streaming and downloading files.

Streamed music and downloadable files, such as MP3s, are usually between 128 kbps and 320 kbps - much lower than the 1,411 kbps of an uncompressed file.

Videos are also compressed when they are streamed over a network. Streaming HD video requires a high-speed internet connection. Without it, the user would experience buffering and regular drops in quality. HD video is usually around 3 mbps. SD is around 1,500 kbps.

Compression can be lossy or lossless.

Lossless compression means that as the file size is compressed, the audio quality remains the same - it does not get worse. Also, the file can be restored back to its original state. FLAC and ALAC areopen source lossless compression formats. Lossless compression can reduce file sizes by up to 50% without losing quality.

Lossy compression permanently removes data. For example, a WAV file compressed to an MP3 would be lossy compression. The bit rate could be set at 64 kbps, which would reduce the size and quality of the file. However, it would not be possible to recreate a 1,411 kbps quality file from a 64 kbps MP3.

With lossy compression, the original bit depth is reduced to remove data and reduce the file size. The bit depth becomes variable.

MP3 and AAC are lossy compressed audio file formats widely supported on different platforms. MP3 and AAC are both patented codecs. Ogg Vorbis is an open source alternative for lossy compression.

Not all audio file formats will work on all media players.

#### Digital video

A digital film is created from a series of static images played at a high speed. Digital films are usually around 24 frames per second but can be anything up to around 100 frames per second or more.

Films have a frame rate per second (fps). This is similar tosample rate. HD film is normally 50 or 60 fps. This can also be measured in frequency (Hz). TV and computer screens have a specification in Hz to indicate the frame rate they support.

Digital films also have a bit rate that accounts for the total audio and image data processed every second.

# Video compression

Videos are compressed in order to:

• reduce the resolution
• reduce the dimensions
• reduce the bit rate

Data lost during the compression process can cause poor picture quality or even random coloured blocks that appear and disappear on the screen. These blocks are called artefacts.

Examples of popular lossy video file formats include MP4 and MOV. Video file formats use codecs to carry out compression algorithms on the video's picture and audio data.

# Codecs and compression algorithms

Codecs are programs that encode data as usable files, whetherimagesaudio or video. Compression codecs are designed to remove data without losing quality (where possible). Algorithms work out what data can be removed and reduce file size.

# Run length encoding (RLE)

One of the simplest examples of compression is RLE. RLE is a basic form of data compression that converts consecutive identical values into a code consisting of the character and the number marking the length of the run. The more similar values there are, the more values can be compressed. The sequence of data is stored as a single value and count.

For example, for a minute of a scene filmed at a beach there would be similar colours on screen for the duration of the shot, such as the blues of the sky and the sea, and the yellows of the sand. Each frame would contain similar data so the file doesn't need to record all the colours each time. Compression software understands that it's seeing the same colours over and over again so it can recycle some of the data it has captured before, rather than storing every detail of every frame.

## Exercise

Use this link and explain in your book the difference between lossless and lossy compression in as few words as possible.

## From ASCII Coding to Huffman Coding

Many programming languages use ASCII coding for characters (ASCII stands for American Standard Code for Information Interchange). Some recent languages, e.g., Java, use UNICODE which, because it can encode a bigger set of characters, is more useful for languages like Japanese and Chinese which have a larger set of characters than are used in English.

We'll use ASCII encoding of characters as an example. In ASCII, every character is encoded with the same number of bits: 8 bits per character. Since there are 256 different values that can be encoded with 8 bits, there are potentially 256 different characters in the ASCII character set -- note that 28 = 256. The common characters, e.g., alphanumeric characters, punctuation, control characters, etc., use only 7 bits; there are 128 different characters that can be encoded with 7 bits. In C++ for example, the type char is divided into subtypes unsigned-char and (the default signed) char. As we'll see, Huffman coding compresses data by using fewer bits to encode more frequently occurring characters so that not all characters are encoded with 8 bits. In Java there are no unsigned types and char values use 16 bits (Unicode compared to ASCII). Substantial compression results regardless of the character-encoding used by a language or platform.

### A Simple Coding Example

We'll look at how the string "go go gophers" is encoded in ASCII, how we might save bits using a simpler coding scheme, and how Huffman coding is used to compress the data resulting in still more savings.

With an ASCII encoding (8 bits per character) the 13 character string "go go gophers" requires 104 bits. The table below on the left shows how the coding works.

coding a message
ASCII coding   3-bit coding
char ASCII binary
g 103 1100111
o 111 1101111
p 112 1110000
h 104 1101000
e 101 1100101
r 114 1110010
s 115 <1110011
space 32 1000000
char code binary
g 0 <000
o 1 001
p 2 010
h 3 011
e 4 100
r 5 101
s 6 110
space 7 111

The string "go go gophers" would be written (coded numerically) as 103 111 32 103 111 32 103 111 112 104 101 114 115. Although not easily readable by humans, this would be written as the following stream of bits (the spaces would not be written, just the 0's and 1's)

1100111 1101111 1100000 1100111 1101111 1000000 1100111 1101111 1110000 1101000 1100101 1110010 1110011

Since there are only eight different characters in "go go gophers", it's possible to use only 3 bits to encode the different characters. We might, for example, use the encoding in the table on the right above, though other 3-bit encodings are possible.

Now the string "go go gophers" would be encoded as 0 1 7 0 1 7 0 1 2 3 4 5 6 or, as bits:

000 001 111 000 001 111 000 001 010 011 100 101 110 111

By using three bits per character, the string "go go gophers" uses a total of 39 bits instead of 104 bits. More bits can be saved if we use fewer than three bits to encode characters like g, o, and space that occur frequently and more than three bits to encode characters like e, p, h, r, and s that occur less frequently in "go go gophers". This is the basic idea behind Huffman coding: to use fewer bits for more frequently occurring characters. We'll see how this is done using a tree that stores characters at the leaves, and whose root-to-leaf paths provide the bit sequence used to encode the characters.

### Towards a Coding Tree

 Using a tree (actually a binary trie, more on that later) all characters are stored at the leaves of a complete tree. In the diagram to the right, the tree has eight levels meaning that the root-to-leaf path always has seven edges. A left-edge (black in the diagram) is numbered 0, a right-edge (blue in the diagram) is numbered 1. The ASCII code for any character/leaf is obtained by following the root-to-leaf path and catening the 0's and 1's. For example, the character 'a', which has ASCII value 97 (1100001 in binary), is shown with root-to-leaf path of right-right-left-left-left-left-right. The structure of the tree can be used to determine the coding of any leaf by using the 0/1 edge convention described. If we use a different tree, we get a different coding. As an example, the tree below on the right yields the coding shown on the left.

<
char binary
'g' 10
'o' 11
'p' 0100
'h' 0101
'e' 0110
'r' 0111
's' 000
'  ' 001 Using this coding, "go go gophers" is encoded (spaces wouldn't appear in the bitstream) as:

10 11 001 10 11 001 10 11 0100 0101 0110 0111 000

This is a total of 37 bits, which saves two bits from the encoding in which each of the 8 characters has a 3-bit encoding that is shown above! The bits are saved by coding frequently occurring characters like 'g' and 'o' with fewer bits (here two bits) than characters that occur less frequently like 'p', 'h', 'e', and 'r'.

The character-encoding induced by the tree can be used to decode a stream of bits as well as encode a string into a stream of bits. You can try to decode the following bitstream; the answer with an explanation follows:

01010110011100100001000101011001110110001101101100000010101 011001110110

To decode the stream, start at the root of the encoding tree, and follow a left-branch for a 0, a right branch for a 1. When you reach a leaf, write the character stored at the leaf, and start again at the top of the tree. To start, the bits are 010101100111. This yieldsleft-right-left-right to the letter 'h', followed (starting again at the root) with left-right-right-left to the letter 'e', followed by left-right-right-right to the letter 'r'. Continuing until all the bits are processed yields

her sphere goes here

### Prefix codes and Huffman Codes

When all characters are stored in leaves, and every interior/(non-leaf) node has two children, the coding induced by the 0/1 convention outlined above has what is called the prefix property: no bit-sequence encoding of a character is the prefix of any other bit-sequence encoding. This makes it possible to decode a bitstream using the coding tree by following root-to-leaf paths. The tree shown above for "go go gophers" is an optimal tree: there are no other trees with the same characters that use fewer bits to encode the string "go go gophers". There are other trees that use 37 bits; for example you can simply swap any sibling nodes and get a different encoding that uses the same number of bits. We need an algorithm for constructing an optimal tree which in turn yields a minimal per-character encoding/compression. This algorithm is called Huffman coding, and was invented by D. Huffman in 1952. It is an example of a greedy algorithm.

## Huffman Coding

We'll use Huffman's algorithm to construct a tree that is used for data compression. In the previous section we saw examples of how a stream of bits can be generated from an encoding, e.g., how "go go gophers" was written as 1011001101100110110100010101100111000. We also saw how the tree can be used to decode a stream of bits. We'll discuss how to construct the tree here.

We'll assume that each character has an associated weight equal to the number of times the character occurs in a file, for example. In the "go go gophers" example, the characters 'g' and 'o' have weight 3, the space has weight 2, and the other characters have weight 1. When compressing a file we'll need to calculate these weights, we'll ignore this step for now and assume that all character weights have been calculated. Huffman's algorithm assumes that we're building a single tree from a group (or forest) of trees. Initially, all the trees have a single node with a character and the character's weight. Trees are combined by picking two trees, and making a new tree from the two trees. This decreases the number of trees by one at each step since two trees are combined into one tree. The algorithm is as follows:

1. Begin with a forest of trees. All trees are one node, with the weight of the tree equal to the weight of the character in the node. Characters that occur most frequently have the highest weights. Characters that occur least frequently have the smallest weights.
2. Repeat this step until there is only one tree:

Choose two trees with the smallest weights, call these trees T1 and T2. Create a new tree whose root has a weight equal to the sum of the weights T1 + T2 and whose left subtree is T1 and whose right subtree is T2.
3. The single tree left after the previous step is an optimal encoding tree.

We'll use the string "go go gophers" as an example. Initially we have the forest shown below. The nodes are shown with a weight/count that represents the number of times the node's character occurs. We pick two minimal nodes. There are five nodes with the minimal weight of one, it doesn't matter which two we pick. In a program, the deterministic aspects of the program will dictate which two are chosen, e.g., the first two in an array, or the elements returned by a priority queue implementation. We create a new tree whose root is weighted by the sum of the weights chosen. We now have a forest of seven trees as shown here: Choosing two minimal trees yields another tree with weight two as shown below. There are now six trees in the forest of trees that will eventually build an encoding tree. Again we must choose the two trees of minimal weight. The lowest weight is the 'e'-node/tree with weight equal to one. There are three trees with weight two, we can choose any of these to create a new tree whose weight will be three. Now there are two trees with weight equal to two. These are joined into a new tree whose weight is four. There are four trees left, one whose weight is four and three with a weight of three. Two minimal (three weight) trees are joined into a tree whose weight is six. In the diagram below we choose the 'g' and 'o' trees (we could have chosen the 'g' tree and the space-'e' tree or the 'o' tree and the space-'e' tree.) There are three trees left. The minimal trees have weights of three and four, these are joined into a tree whose weight is seven leaving two trees. Finally, the last two trees are joined into a final tree whose weight is thirteen, the sum of the two weights six and seven. Note that this tree is different from the tree we used to illustrate Huffman coding above, and the bit patterns for each character are different, but the total number of bits used to encode "go go gophers" is the same. The character encoding induced by the last tree is shown below where again, 0 is used for left edges and 1 for right edges.

char binary
'g' 00
'o' 01
'p' 1110
'h' 1101
'e' 101
'r' 1111
's' 1100
'  ' 100

The string "go go gophers" would be encoded as shown (with spaces used for easier reading, the spaces wouldn't appear in the real encoding).

00 01 100 00 01 100 00 01 1110 1101 101 1111 1100

Once again, 37 bits are used to encode "go go gophers". There are several trees that yield an optimal 37-bit encoding of "go go gophers". The tree that actually results from a programmed implementation of Huffman's algorithm will be the same each time the program is run for the same weights (assuming no randomness is used in creating the tree).

### Why is Huffman Coding Greedy?

Huffman's algorithm is an example of a greedy algorithm. It's called greedy because the two smallest nodes are chosen at each step, and this local decision results in a globally optimal encoding tree. In general, greedy algorithms use small-grained, or local minimal/maximal choices to result in a global minimum/maximum. Making change using U.S. money is another example of a greedy algorithm.

• Problem: give change in U.S. coins for any amount (say under \$1.00) using the minimal number of coins.

• Solution (assuming coin denominations of \$0.25, \$0.10, \$0.05, and \$0.01, called quarters, dimes, nickels, and pennies, respectively): use the highest-value coin that you can, and give as many of these as you can. Repeat the process until the correct change is given.

• Example: make change for \$0.91. Use 3 quarters (the highest coin we can use, and as many as we can use). This leaves \$0.16. To make change use a dime (leaving \$0.06), a nickel (leaving \$0.01), and a penny. The total change for \$0.91 is three quarters, a dime, a nickel, and a penny. This is a total of six coins, it is not possible to make change for \$0.91 using fewer coins.

The solution/algorithm is greedy because the largest denomination coin is chosen to use at each step, and as many are used as possible. This locally optimal step leads to a globally optimal solution. Note that the algorithm does not work with different denominations. For example, if there are no nickels, the algorithm will make change for \$0.31 using one quarter and six pennies, a total of seven coins. However, it's possible to use three dimes and one penny, a total of four coins. This shows that greedy algorithms are not always optimal algorithms.

## Video tutorial

Provided you overlook the sniffing this is a good explanation of Huffman code.

## Exercise

1. Use the Huffman algorithm to encode "an aardvark ate alan's apple" as efficiently as possible.

1. show that extended ASCII would encode this as 224 characters.
2. calculate the compression by using Huffman code.

2. Here is the tree, generate the code for each character. Then decode this message. 3. Show that this becomes this 4. Construct the code table from this tree. 5. Here is the frequency table...

 space 6 n 4 i 3 a 3 e 3 f 3 h 2 s 2 m 2 o 2 t 1 x 1 p 1 l 1 r 1 u 1 c 1 d 1 g 1

and this is the tree Construct the code from the tree. Click on the tree to see it full size.

6. The Huffman encoding algorithm is an optimal compression algorithm when only the frequency of individual letters are used to compress the data. Here's an example of optimized Huffman coding using the French subject string "j'aime aller sur le bord de l'eau les jeudis ou les jours impairs". Construct the code table from the diagram (which you should copy into your book). ```Just in case you're interested the translation of the French is "I like to go on the water's edge on Thursdays or odd days"
```

7. Here is the tree (you can click on the image to make it bigger) Decode this mesage from the BBC. (Perhaps you can use Excel to do the hard work.)

 111010 1100 101 111011 111001 1000 11011 11011 01 00101 101 00100 111100 101 1100 1001 0000 1000 0011 01 101 1001 111111 01 111000 101 01 0000 01 0011 111110 1000 1001 0001 101 01 111101 1100 01 0001 11010 01 11010

RLE

## 3.1 Fundamentals of algorithms ## 3.2 Programming ## 3.3 Fundamentals of data representation ## 3.4 Computer systems ## 3.5 Fundamentals of computer networks ## 3.6 Fundamentals of cyber security ## 3.7 Ethical, legal and environmental impacts of digital technology on wider society, including issues of privacy ## 3.8 Aspects of software development  