How To Add Custom Build Steps and Commands To setup.py

A setup.py script using distutils / setuptools is the standard way to package Python code. Often, however, we need to perform custom actions for code generation, running tests, profiling, or building documentation, etc., and we’d like to integrate these actions into setup.py. In other words, we’d like to add custom steps to setup.py build or setup.py install, or add a new command altogether to setup.py.

Let’s see how this is done.

Adding Custom setup.py Commands and Options

Let’s implement a custom command that runs Pylint on all Python files in our project. The high level idea is:

  1. Implement command as a subclass of distutils.cmd.Command;

  2. Add the newly defined command class to the cmdclass argument to setup().

To see this in action, let’s add the following to our setup.py:

import distutils.cmd
import distutils.log
import setuptools
import subprocess


class PylintCommand(distutils.cmd.Command):
  """A custom command to run Pylint on all Python source files."""

  description = 'run Pylint on Python source files'
  user_options = [
      # The format is (long option, short option, description).
      ('pylint-rcfile=', None, 'path to Pylint config file'),
  ]

  def initialize_options(self):
    """Set default values for options."""
    # Each user option must be listed here with their default value.
    self.pylint_rcfile = ''

  def finalize_options(self):
    """Post-process options."""
    if self.pylint_rcfile:
      assert os.path.exists(self.pylint_rcfile), (
          'Pylint config file %s does not exist.' % self.pylint_rcfile)

  def run(self):
    """Run command."""
    command = ['/usr/bin/pylint']
    if self.pylint_rcfile:
      command.append('--rcfile=%s' % self.pylint_rcfile)
    command.append(os.getcwd())
    self.announce(
        'Running command: %s' % str(command),
        level=distutils.log.INFO)
    subprocess.check_call(command)


setuptools.setup(
    cmdclass={
        'pylint': PylintCommand,
    },
    # Usual setup() args.
    # ...
)

Now, running python setup.py --help-commands will show:

Standard commands:
  ...
Extra commands:
  pylint: run Pylint on Python source files
  ...

We can now run the command we just defined with:

$ python setup.py pylint

…or with a custom option:

$ python setup.py pylint --pylint-rcfile=.pylintrc

To learn more, you can check out documentation on inheriting from distutils.cmd.Command as well as the source code of some built-in commands, such as build_py.

Adding Custom Steps to setup.py build

Let’s say we are really paranoid about code style and we’d like to run Pylint as part of setup.py build. We can do this in the following manner:

  1. Create a subclass of setuptools.command.build_py.build_py (or distutils.command.build_py.build_py if using distutils) that invokes our new Pylint command;

  2. Add the newly defined command class to the cmdclass argument to setup().

For example, we can implement the following in our setup.py:

import setuptools.command.build_py


class BuildPyCommand(setuptools.command.build_py.build_py):
  """Custom build command."""

  def run(self):
    self.run_command('pylint')
    setuptools.command.build_py.build_py.run(self)


setuptools.setup(
    cmdclass={
        'pylint': PylintCommand,
        'build_py': BuildPyCommand,
    },
    # Usual setup() args.
    # ...
)

For more examples, I encourage you to check out the setuptools source code.

Python: Multiprocessing and Exceptions

Python’s multiprocessing module provides an interface for spawning and managing child processes that is familiar to users of the threading module. One problem with the multiprocessing module, however, is that exceptions in spawned child processes don’t print stack traces:

Consider the following snippet:

import multiprocessing
import somelib

def f(x):
  return 1 / somelib.somefunc(x)

if __name__ == '__main__':
  with multiprocessing.Pool(5) as pool:
    print(pool.map(f, range(5)))

and the following error message:

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    print(pool.map(f, range(5)))
  File "/usr/lib/python3.3/multiprocessing/pool.py", line 228, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
    raise self._value
ZeroDivisionError: division by zero

What triggered the ZeroDivisionError? Did somelib.somefunc(x) return 0, or did some other computation in somelib.somefunc() cause the exception? You will notice that we only see the stack trace of the main process, whereas the stack trace of the code that actually triggered the exception in the worker processes is not shown at all.

Luckily, Python provides a handy traceback module for working with exceptions and stack traces. All we have to do is catch the exception inside the worker process, and print it. Let’s change the code above to read:

import multiprocessing
import traceback
import somelib

def f(x):
  try:
    return 1 / somelib.somefunc(x)
  except Exception as e:
    print('Caught exception in worker thread (x = %d):' % x)

    # This prints the type, value, and stack trace of the
    # current exception being handled.
    traceback.print_exc()

    print()
    raise e

if __name__ == '__main__':
  with multiprocessing.Pool(5) as pool:
    print(pool.map(f, range(5)))

Now, if you run the same code again, you will see something like this:

Caught exception in worker thread (x = 0):
Traceback (most recent call last):
  File "test.py", line 7, in f
    return 1 / somelib.somefunc(x)
  File "/path/to/somelib.py", line 2, in somefunc
    return 1 / x
ZeroDivisionError: division by zero

Traceback (most recent call last):
  File "test.py", line 16, in <module>
    print(pool.map(f, range(5)))
  File "/usr/lib/python3.3/multiprocessing/pool.py", line 228, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
    raise self._value
ZeroDivisionError: division by zero

The printed traceback reveals somelib.somefunc() to be the actual culprit.

In practice, you may want to save the exception and the stack trace somewhere. For that, you can use the file argument of print_exc in combination with StringIO. For example:

import logging
import io  # Import StringIO in Python 2
...

def Work(...):
  try:
    ...
  except Exception as e:
    exc_buffer = io.StringIO()
    traceback.print_exc(file=exc_buffer)
    logging.error(
        'Uncaught exception in worker process:\n%s',
        exc_buffer.getvalue())
    raise e

Unicode I/O and Locales in Python

I recently ran into a weird error when running some Python code in a chroot jail.

s = '你好'
with open('/tmp/asdf', 'w') as f:
  f.write(s)

gave me

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

The same happened with interprocess I/O:

with subprocess.Popen(
    '/usr/bin/cat',
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    universal_newlines=True) as proc:
  (cmd_stdout, cmd_stderr) = proc.communicate('你好')

gave me

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/subprocess.py", line 578, in check_output
    output, unused_err = process.communicate(timeout=timeout)
  File "/usr/lib/python3.3/subprocess.py", line 908, in communicate
    stdout = _eintr_retry_call(self.stdout.read)
  File "/usr/lib/python3.3/subprocess.py", line 479, in _eintr_retry_call
    return func(*args)
  File "/usr/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

It turns out that Python str's are encoded to/decoded from raw bytes during I/O (print, file I/O, IPC, etc) using the default system locale encoding. The advantage is that, if your system locale is set up correctly, everything just works - there’s no explicit encoding/decoding between strings and bytes. The downside is that your Python code that runs fine on one machine can fail mysteriously on a different machine.

In my case, the chroot jail yielded:

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

Solution A

The simplest solution is to set the system locale, either just for the Python program or for your shell. For example,

# Run ./my_program.py with a custom LANG value.
LANG=en_US.utf-8 ./my_program.py

or

# Set locale for current shell session.
export LANG=en_US.utf-8
./my_program.py

In fact, it’s probably a good idea to add the export line to your ~/.bashrc, or follow however your Linux distro decides locales should be set.

Solution B

On the other hand, you can explicitly set the encoding used during I/O in your Python code.

For file I/O, in Python 3.x, you can set the encoding argument of open:

# Python 3.x
with open('/tmp/asdf', 'w', encoding='utf-8') as f:
  f.write('你好')

In Python 2.x, you can use codecs.open:

# Python 2.x
import codecs
with codecs.open('/tmp/asdf', 'w', encoding='utf-8') as f:
  f.write('你好')

Alternatively, you can use raw mode for file I/O:

with open('/tmp/asdf', 'wb') as f:
  f.write('你好'.encode('utf-8'))

For IPC with subprocess, you must not use universal_newlines=True, as that will always attempt to encode/decode using the system locale. Instead:

with subprocess.Popen(
    '/usr/bin/cat',
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE) as proc:
  (cmd_stdout_bytes, cmd_stderr_bytes) = proc.communicate('你好'.encode('utf-8'))
  (cmd_stdout, cmd_stderr) = (
      cmd_stdout_bytes.decode('utf-8'), cmd_stderr_bytes.decode('utf-8'))