Type Neutral Codec API¶
(Note: this article was substantially rewritten after some initial feedback from Armin Ronacher. As always, old versions are available on BitBucket)
One of the complaints with Python 3 is that it broke the old idiom for
many text-to-text and binary-to-binary transforms: the encode()
and
decode()
methods of 8-bit and Unicode string objects.
In Python 2, these methods were fairly thin shells around the type-neutral
codecs
module. Both 8-bit and Unicode strings had both methods and
the type of the return value was based on the specific encoding passed in.
In Python 3, these convenience methods have instead been incorporated
directly into the text model of the language. Text strings only have an
encode()
method, and that method can only be used with codecs that
produce bytes objects. Similarly bytes and bytearray objects only have a
decode()
method which can only be used with codecs that produce string
objects.
For example (Python 2.7):
>>> x = u'Hello World!'.encode("rot-13").encode("koi8-r").encode("bz2")
>>> x
'BZh91AY&SY]\xc2\xf0\xb7\x00\x00\x01\x97\x80`\x00\x00\x10\x02\x00\x12\x000 \x001\x06LA\x06\x98\x9a\x166$\x1et\xf1w$S\x85\t\x05\xdc/\x0bp'
>>> x.decode("bz2").decode("koi8-r").decode("rot-13")
u'Hello World!'
If you try the first or last step of that chain in Python 3, it fails:
>>> x = "Hello World!".encode("rot-13").encode("koi8-r").encode("bz2")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encoder did not return a bytes object (type=str)
>>> x = "Hello World!".encode("koi8-r").encode("bz2")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'
This means the old text-to-text and binary-to-binary transforms can only be
accessed via the type neutral codecs
module APIs. To make matters
even more annoying, the shorthand aliases for most of those codecs are
still missing, even though the codecs themselves were restored for Python
3.2.
There is a suggestion that this be replaced directly with a similarly
method-based transform
/untransform
API, but I’m now convinced that’s
a bad idea being considered only due to the precedent set by Python 2.
Instead, I believe it makes more sense to take a step back and consider a
fully type-neutral solution, just like the codecs
module itself.
The simple alternative I plan to propose is introducing a pair of top level functions in the codecs module that are type neutral alternatives to the type restricted str and bytes convenience functions. The semantics would be equivalent to these pure Python versions:
def encode(input, encoding, errors='strict'):
encoder = getencoder(encoding)
result, len_consumed = encoder(input, errors)
if len_consumed < len(input):
... # Copy str.encode behaviour for this case
return result
def decoder(input, encoding, errors='strict'):
decoder = getdecoder(encoding)
result, len_consumed = decoder(input, errors)
if len_consumed < len(input):
... # Copy bytes.decode behaviour for this case
return result
Getting Cute with Codec Pipelines¶
Armin assures me the following example isn’t all that useful in practice, but it was a fun exercise in exploring what is possible when working directly with the codecs API.
Below is a sketch of a simple CodecPipeline
that works on both Python 2
and Python 3. It accepts an arbitrary number of codec names as positional
parameters, as well as the error handling scheme as a keyword-only
parameter:
import codecs
class CodecPipeline(object):
"""Chains multiple codecs into a single encode/decode operation"""
def __init__(self, *names, **kwds):
self.default_errors = self._bind_kwds(**kwds)
encoders = []
decoders = []
self.codecs = names
for name in names:
info = self._lookup_codec(name)
encoders.append(info.encode)
decoders.append(info.decode)
self.encoders = encoders
decoders.reverse()
self.decoders = decoders
def _bind_kwds(self, errors=None):
if errors is None:
errors = "strict"
return errors
def _lookup_codec(self, name):
# Work around for http://bugs.python.org/issue15331 in 3.x
try:
return codecs.lookup(name)
except LookupError:
return codecs.lookup(name + "_codec")
def __repr__(self):
names = self.codecs
errors = self.default_errors
if not names:
return "{}(errors={!r})".format(type(self).__name__, errors)
return "{}({}, errors={!r})".format(type(self).__name__,
", ".join(map(repr, names)),
errors)
def encode(self, input, errors=None):
"""Apply all encoding operations in the pipeline"""
if errors is None:
errors = self.default_errors
result = input
for encode in self.encoders:
result, __ = encode(result, errors)
return result
def decode(self, input, errors=None):
"""Apply all decoding operations in the pipeline"""
if errors is None:
errors = self.default_errors
result = input
for decode in self.decoders:
result,__ = decode(result, errors)
return result
And using it in Python 2 looks like this:
>>> cp = CodecPipeline("rot-13", "koi8-r", "bz2")
>>> cp
CodecPipeline('rot-13', 'koi8-r', 'bz2', errors='strict')
>>> cp.encode(u'Hello World!')
'BZh91AY&SY]\xc2\xf0\xb7\x00\x00\x01\x97\x80`\x00\x00\x10\x02\x00\x12\x000 \x001\x06LA\x06\x98\x9a\x166$\x1et\xf1w$S\x85\t\x05\xdc/\x0bp'
>>> cp.decode(cp.encode(u'Hello World!'))
u'Hello World!'
Python 3 looks almost identical, aside from the lack of the u
prefix on
the string literals (and, in Python 3.3, such prefixes are once again legal
on the input front).
>>> cp = CodecPipeline.from_chain("rot-13", "koi8-r", "bz2")
>>> cp
CodecPipeline('rot-13', 'koi8-r', 'bz2', errors='strict')
>>> cp.encode('Hello World!')
'BZh91AY&SY]\xc2\xf0\xb7\x00\x00\x01\x97\x80`\x00\x00\x10\x02\x00\x12\x000 \x001\x06LA\x06\x98\x9a\x166$\x1et\xf1w$S\x85\t\x05\xdc/\x0bp'
>>> cp.decode(cp.encode(u'Hello World!'))
'Hello World!'