This module discusses some basic computer (science) background with the main goal to explain the difference between binary and text files. Knowing this distinction is important for command line and Git usage.

A computer is a collection of devices with as main component the CPU, the Central Processing Unit. A CPU consists of a collection of billions of transistors organized such that it can perform logic and arithmetic operations. A transistor is essentially an electronic switch with three terminals: if voltage is high on one of the input terminals, it allows current to flow between the other two terminals, and as such it is a switch.

This basic operation of switching with high and low voltage is the foundation of computers and we denote high voltage or low voltage with a bit, 1 for high voltgate and 0 for low voltage. Other representations of the values are true or false, usually in the context of logic operations.

Grouping bits allows counting in the binary numeral system:

number bits
0 0
1 1
2 10
3 11
4 100
5 101
etc.

In the decimal numeral system we have 10 commonly used symbols ‘0’, ‘1’, ‘2’, …, ‘9’ called digits. We want to stress that we only mean the symbol here without attributing a value to it. To represent the value 1 we use the digit ‘1’, for representing 9 we use the digit ‘9’, for representing 10 we use a sequence of two digits ‘1’ and ‘0’, and for representing the value 345, we use a sequence of three digits ‘3’, ‘4’, and ‘5’. The representation “345” as the number 345 actually means 3 * 102 + 4 * 101 + 5 * 100 and with three digits, we can represent 103 values.

In the binary numeral system, we only need two digits, namely ‘1’ and ‘0’ and the binary number “101010” represents 1 * 25 + 0 * 24 + 1 * 23 + 0 * 22 + 1 * 21 + 0 * 20, which is 42. With 6 binary digits or contracted to bits we can represent 26 = 64 values.

We often group 8 bits together in what is called a byte. A byte can then represent 28 = 256 values. We can easily denote the contents of a byte in the hexidecimal numeral system where we use 16 symbols to denote a value, the 10 digits and the symbols ‘a’, ‘b’, …, ‘f’ or ‘A’, ‘B’, … ‘F’ for the values 10 to 15. Since 16 * 16 is 256 we only need two hexadecimal symbols to represent a byte. To indicate we use the hexadecimal system for a number, we prefix the number with the “0x”, so “0x2A” is 2 * 161 + 10 * 160 = 42. Similarly to indicate the use of a binary number, the prefix “0b” is used, so “0b101010” is a binary number representing 42. For numbers without prefix you can assume decimal numbers.

Besides representing numbers, we would also like to represent text. For example, we could determine that 0 would be the letter ‘a’, that 1 would be the letter ‘b’, etc. This would define a so-called character encoding. Luckily, there is a commonly used standard for that, the American Standard Code for Information Interchange or ASCII that requires 7 bits to represent 128 characters and control sequences. A common and compatible extension is Unicode that allows many more than 128 characters.

Binary vs. text files

You can make a broad distinction between files on a computer: binary files and text files. Text files are files that contain only characters, Unicode or ASCII. Examples are .txt files, .html, .md and most files that contain source code for programming such as .py, .c, or .java. In short, a text file contains bytes such that every byte (or sequence of bytes) can be interpreted as a character.

Binary files, however, are files that contain bytes that cannot necessarily be interpreted as characters. Although binary files may contain characters (and most of them do), they can also contain all kinds of sequences of bytes that cannot be translated back to characters. Typically these files can only be interpreted by programs that understand how to interpret the sequence of bytes. Examples are .doc, .xls, .exe.

Command line tools are especially useful for manipulating text and text files. In the next module we will look at command line tools and text files.