Read a text file into one big sequence of bytes

Bibliography

  1. HOWTOs
    1. Fetch Internet Resources Using The urllib Package
    2. Unicode characters
  2. urllib.request.urlopen returns an http.client.HTTPResponse
  3. Conversion functions
    1. decode changes bytes to str
    2. encode changes str to bytes

Character codes vs. bytes

Each character is represented by (i.e., stored in the computer as) a number. For examples, see the ASCII chart: lowercase m is represented by 109; lowercase o is represented by 111; lowercase è (with a grave accent) is represented by 232. Letters in other alphabets (Russian, Greek, etc.) are represented by numbers up in the thousands.

Information is transmitted across the Internet as a sequence of little numbers called bytes. A byte is capable of holding a small number such as 109, but a byte would be unconfortable holding a larger number such as 232. The solution is to break down the 232 into the pair of smaller numbers 195 and 168. We therefore say that è is a multi-byte character, while plain old m is a single-byte character.

There’s a formula, called UTF-8, that tells you how to break down a big number into bytes, and conversely, how to recombine the bytes back into the original big number. The two bytes 195 and 168 were created by applying this formula. The function decode knows the formula, and can recombine the bytes 195 and 168 back into the original big number 232 that stands for è.

string of characters: m o v è d
each character represented by one number: 109 111 118 232 64
each number represented by one or more bytes: 109 111 118 195 168 64

Decode the sequence of bytes into a string of characters.

The input file
http://http://oit2.scps.nyu.edu/~meretzkm/python/string/romeo.txt
contains Romeo and Juliet I, i, 84–86. The input file is downloaded from the Internet as a sequence of bytes, not as a string of characters. In lines 31 and 35 of the Python script, the variable sequenceOfBytes is a sequence of bytes. In line 35, the variable s is a string of characters.

onebigsequenceofbytes.py

The output is in iambic pentameter, with 10 syllables per line.

status = 200
msg = OK
Request fulfilled, document follows

Date           Sat, 14 Sep 2019 13:18:04 GMT
Server         Apache/2.4.10 (Fedora) mod_jk/1.2.40 PHP/5.5.26 mod_wsgi/3.5 Python/3.3.2
Last-Modified  Mon, 26 Jun 2017 11:06:58 GMT
ETag           "87-552daf312d880"
Accept-Ranges  bytes
Content-Length 135
Connection     close
Content-Type   text/plain; charset=UTF-8

On pain of torture, from those bloody hands
Throw your mistempered weapons to the ground,
And hear the sentence of your movèd prince.

Things to try

  1. Let’s see the difference between the sequence of bytes sequenceOfBytes and the string of characters s. Insert the following code after creating the sequenceOfBytes.
    print(f"type(sequenceOfBytes) = {type(sequenceOfBytes)}")
    print(f"len(sequenceOfBytes) = {len(sequenceOfBytes)}")
    print(sequenceOfBytes)
    print()
    
    Insert the following code immediately after creating the s.
    print(f"type(s) = {type(s)}")
    print(f"len(s) = {len(s)}")
    

    print prints a sequence of bytes with a prefix b to remind you that you’re seeing a sequence of bytes instead of a string of characters. But other than that, print makes every effort to make the sequence of bytes look just like a string of characters. One by one, print prints the character that is represented by the value of each byte—when there is a character that is represented by the value of that byte. For example, lowercase m is represented by the byte 109.

    But no character is represented by the byte 195. print prints the 195 and the following byte in hexadecimal (as C3 and A8; see the escape sequences \x for hexadecimal and \n for newline). If the 195 and 168 were recombined into a single number, the single number would represent the character è, but print does not perform this recombination. The recombination is performed by decode.

    Note that the sequenceOfBytes contains 135 bytes, but the s contains only 134 characters. That’s because the character è is encoded as two bytes.

    type(sequenceOfBytes) = <class 'bytes'>
    len(sequenceOfBytes) = 135
    b'On pain of torture, from those bloody hands\nThrow your mistempered weapons to the ground,\nAnd hear the sentence of your mov\xc3\xa8d prince.\n'
    
    type(s) = <class 'str'>
    len(s) = 134
    On pain of torture, from those bloody hands
    Throw your mistempered weapons to the ground,
    And hear the sentence of your movèd prince.
    
  2. Split the string into separate lines and print them one by one.
    for i, line in enumerate(s.splitlines(), start = 84):
        print(i, line)
    
    84 On pain of torture, from those bloody hands
    85 Throw your mistempered weapons to the ground,
    86 And hear the sentence of your movèd prince.
    
  3. What type of file-like object did we put into the variable fileObject? Here is the type, the base class of the type, the base class of the base class of the type, etc., all the way back down to class object.
    print(f"type(fileObject) = {type(fileObject)}")
    print()
    
    for t in type(fileObject).mro(): #method resolution order
        print(t)                     #t is a type
    
    print()
    
    type(fileObject) = <class 'http.client.HTTPResponse'>
    
    <class 'http.client.HTTPResponse'>
    <class 'io.BufferedIOBase'>
    <class '_io._BufferedIOBase'>
    <class 'io.IOBase'>
    <class '_io._IOBase'>
    <class 'object'>