Each character is represented by (i.e., stored in the computer as) a number. For examples, see the ASCII chart: lowercase m is represented by 109; lowercase o is represented by 111; lowercase è (with a grave accent) is represented by 232. Letters in other alphabets (Russian, Greek, etc.) are represented by numbers up in the thousands.
Information is transmitted across the Internet as a sequence of little numbers called bytes. A byte is capable of holding a small number such as 109, but a byte would be unconfortable holding a larger number such as 232. The solution is to break down the 232 into the pair of smaller numbers 195 and 168. We therefore say that è is a multi-byte character, while plain old m is a single-byte character.
There’s a formula, called
UTF-8,
that tells you how to break down a big number into bytes,
and conversely, how to recombine the bytes back into the original big number.
The two bytes 195 and 168 were created by applying this formula.
The function
decode
knows the formula,
and can recombine the bytes 195 and 168 back into the original big number 232
that stands for è.
string of characters: | m | o | v | è | d | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
each character represented by one number: | 109 | 111 | 118 | 232 | 64 | |||||||||||
each number represented by one or more bytes: | 109 | 111 | 118 | 195 | 168 | 64 |
The input file
http://http://oit2.scps.nyu.edu/~meretzkm/python/string/romeo.txt
contains
Romeo and Juliet
I, i, 84–86.
The input file
is downloaded from the Internet as a
sequence
of bytes,
not as a
string
of characters.
In lines
31
and
35
of the Python script,
the variable
sequenceOfBytes
is a
sequence
of bytes.
In
line
35,
the variable
s
is a
string
of characters.
The output is in iambic pentameter, with 10 syllables per line.
status = 200 msg = OK Request fulfilled, document follows Date Sat, 14 Sep 2019 13:18:04 GMT Server Apache/2.4.10 (Fedora) mod_jk/1.2.40 PHP/5.5.26 mod_wsgi/3.5 Python/3.3.2 Last-Modified Mon, 26 Jun 2017 11:06:58 GMT ETag "87-552daf312d880" Accept-Ranges bytes Content-Length 135 Connection close Content-Type text/plain; charset=UTF-8 On pain of torture, from those bloody hands Throw your mistempered weapons to the ground, And hear the sentence of your movèd prince.
sequenceOfBytes
and the
string
of characters
s
.
Insert the following code after creating the
sequenceOfBytes
.
print(f"type(sequenceOfBytes) = {type(sequenceOfBytes)}") print(f"len(sequenceOfBytes) = {len(sequenceOfBytes)}") print(sequenceOfBytes) print()Insert the following code immediately after creating the
s
.
print(f"type(s) = {type(s)}") print(f"len(s) = {len(s)}")
print
prints a
sequence
of bytes
with a
prefix
b
to remind you that you’re seeing a
sequence
of bytes
instead of a
string
of characters.
But other than that,
print
makes every effort to make the sequence of bytes look just like a
string of characters.
One by one,
print
prints the character that is represented by the value of each byte—when
there is a character that is represented by the value of that byte.
For example, lowercase
m
is represented by the byte 109.
But no character is represented by the byte 195.
print
prints the 195 and the following byte in hexadecimal (as
C3
and
A8
;
see the
escape
sequences
\x
for hexadecimal and
\n
for newline).
If the 195 and 168 were recombined into a single number,
the single number would represent the character
è,
but
print
does not perform this recombination.
The recombination is performed by
decode
.
Note that the
sequenceOfBytes
contains 135 bytes,
but the
s
contains only 134 characters.
That’s because the character è is encoded as two bytes.
type(sequenceOfBytes) = <class 'bytes'> len(sequenceOfBytes) = 135 b'On pain of torture, from those bloody hands\nThrow your mistempered weapons to the ground,\nAnd hear the sentence of your mov\xc3\xa8d prince.\n' type(s) = <class 'str'> len(s) = 134 On pain of torture, from those bloody hands Throw your mistempered weapons to the ground, And hear the sentence of your movèd prince.
for i, line in enumerate(s.splitlines(), start = 84): print(i, line)
84 On pain of torture, from those bloody hands 85 Throw your mistempered weapons to the ground, 86 And hear the sentence of your movèd prince.
fileObject
?
Here is the
type,
the base class of the type,
the base class of the base class of the type,
etc.,
all the way back down to class
object
.
print(f"type(fileObject) = {type(fileObject)}") print() for t in type(fileObject).mro(): #method resolution order print(t) #t is a type print()
type(fileObject) = <class 'http.client.HTTPResponse'> <class 'http.client.HTTPResponse'> <class 'io.BufferedIOBase'> <class '_io._BufferedIOBase'> <class 'io.IOBase'> <class '_io._IOBase'> <class 'object'>