Thursday, 26 July 2012

A potential security vulnerability in logging.fileConfig()

Although now superseded by the dictConfig() API proposed in PEP 391, the much older fileConfig() API has been widely used to configure logging. It is also possible to do on-the-fly reconfiguration of logging through logging’s listen() API, which listens on a socket for configuration data in either the format understood by fileConfig(), or JSON format for use with dictConfig(). Although listen() only ever binds to localhost and so cannot be reached from a remote machine, it is possible in certain scenarios, if you use the listen() function, that code which you don’t control will be executed in the process which calls listen(). To see how, consider a shared hosting environment where multiple users, who have no connection with each other other than being hosted on the same server, run processes. Ordinarily these processes are isolated from each other, but if one user’s process uses listen(), another user’s process could connect to the listened-on socket and send arbitrary configuration data. Because fileConfig() calls eval() under the hood to convert from text elements in the configuration text to Python objects for use in the configuration operation, potentially arbitrary code could be sent in a logging configuration from a malicious user to a victim, perhaps resulting in undesirable consequences for the victim when that code is run by eval(). (Thanks are due to Graham Dumpleton for reporting this vulnerability.)

Why does fileConfig() use eval(), and how can the vulnerability be tackled? Although recent versions of Python provide the ast module to assist with processing Python source code, this module was not available when fileConfig() was written. The type of configuration data which needs to be evaluated can be more complex than a simple parser can cope with: for example, a configuration may refer to a handler’s class, whether it be a stdlib-included one such as handlers.WatchedFileHandler, or a user-defined or third-party handler such as mailinglogger.MailingLogger. The configuration can also specify the constructor arguments for a handler, which can also be fairly involved to parse in ad hoc fashion. For this reason, eval() was used.

Users have already been alerted to the vulnerability via an update to the documentation, but that is perhaps not enough. One way of addressing the vulnerability is to not use eval(), but some other evaluation mechanism which is more sandboxed. One suggestion which has been made is to use ast.literal_eval(), but this only copes with Python literals – it does no name lookup, and so can’t cope with expressions like handlers.WatchedFileHandler.

As an experiment, I put together a simple expression evaluator which uses the ast module to parse input text and probably provides the bare minimum evaluation of AST nodes to support common use cases:

This shows a simple evaluator together with a simple test harness which evaluates user-entered expressions in the context of the logging package:

However, limiting the scope of evaluation is not the complete answer, and perhaps not the correct answer. For example, a malicious user could still send a bogus configuration which, for example, just turns the verbosity of all loggers off, or configures a huge number of bogus loggers. This would certainly be allowed even by a limited-evaluation scheme if a user legitimately wanted to do so; but if a malicious user sends the exact same “legal” configuration, it is still a security exploit because the consequences may be undesirable to the victim.

But how is the listener to know whether or not the configuration is coming from a legitimate source (a client process controlled by the same user who is running the process which uses listen())  or a malicious user (a client process controlled by some other user)? The simplest answer would appear to be a shared secret: When listen() is called, it is passed a text passphrase, which is also known to legitimate clients. When handling a configuration request via the socket, the configuration is checked to see if it contains the passphrase. If it does, the request is processed; otherwise, it is ignored.

In the fileConfig() input data, the passphrase could be provided via a passphrase=secret entry in the [default] section. In the dictConfig() input data, the passphrase could be provided against the passphrase key in the dict which is passed to dictConfig(). The checking would be done in the request handler code before calling fileConfig() or dictConfig(). If the passphrase argument to the listen() call is None (the default, preserving the current behaviour) no passphrase checking would be done.

As always, comments are welcome. Although this vulnerability is limited to specific scenarios, it would be good to address it as soon as possible. As Python 3.3 has entered beta, and is therefore in feature freeze, it is likely that a fix will go into Python 3.4 – the vulnerability is not currently considered serious enough to warrant fixing in earlier versions.

Sunday, 29 April 2012

Introspecting the logger hierarchy

Brandon Rhodes’ post about his logging_tree project got me thinking about providing this sort of functionality in the logging package. Though introspection of the logger hierarchy has been mentioned a couple of times in newsgroup postings over the years, it has never actually been requested as a fleshed-out feature, and until Brandon’s post, it was below my radar. Now, thanks to him, I have given it some attention.

There are a number of things which need to be considered when providing some way of introspecting the logger hierarchy:

  • In many systems, the logging hierarchy doesn’t change once it’s configured. However, there are also systems where it does – logging provides the facility for on-the-fly reconfiguration without having to restart a Python process. Therefore, a general introspection facility must act as a snapshot – the state of the hierarchy at a particular time when the snapshot was requested.
  • The introspection facility must allow both built-in and user-defined Logger, Handler, Formatter, Filter instances to return useful information.
  • The data returned by the introspection facility should be, as far as practicable, immutable. Users should not expect to get direct access to individual logging elements (as there are already specific APIs for doing that).
  • The introspection facility should perhaps eschew any presentation of the hierarchy; this is likely to be very application-dependent, and it is debatable whether it is worth providing a “default” presentation. Example presentations would be a textual printout (such as logging_tree provides) or a depiction of the hierarchy in a browser-based or desktop GUI.

I’ve made an attempt at addressing the above points when implementing my own take on Brandon’s work. Although I’ve no definite plan to include this functionality in Python 3.3, I’m publishing this post in order to get feedback about the approach used, and also whether something like this belongs in the core package.

The basic API is provided by a single function, get_snapshot(), which returns a dictionary. The keys of that dictionary are logger names, with '' being the key for the root logger. The corresponding values are instances of namedtuple subclasses; they are snapshots of the loggers, handlers, formatters and filters.

Here’s how it might work. Running the code snippet

logging.getLogger('logtree.test')
logging.basicConfig(filename='logtree.log')
result = get_snapshot()
tw = textwrap.TextWrapper()
tw.subsequent_indent = ' ' * 4
for name, info in sorted(result.items()):
    print(tw.fill('%r -> %r' % (name, info)))
    print('-'*40)

would display something like

'' -> logging_RootLoggerInfo(class_='logging.RootLogger', level=30,
    propagate=1, children=(('logtree',
    logging_PlaceHolderInfo(class_='logging.PlaceHolder',
    children=(('logtree.test',
    logging_LoggerInfo(class_='logging.Logger', name='logtree.test',
    level=0, handlers=(), disabled=0, propagate=1, filters=(),
    children=())),))),), filters=(),
    handlers=(logging_FileHandlerInfo(class_='logging.FileHandler',
    encoding=None, level=0, delay=None,
    filename='C:\\temp\\logtree.log', mode='a', filters=(),
    formatter=logging_FormatterInfo(class_='logging.Formatter',
    style='%', datefmt=None,
    format='%(levelname)s:%(name)s:%(message)s')),))
----------------------------------------
'logtree' -> logging_PlaceHolderInfo(class_='logging.PlaceHolder',
    children=(('logtree.test',
    logging_LoggerInfo(class_='logging.Logger', name='logtree.test',
    level=0, handlers=(), disabled=0, propagate=1, filters=(),
    children=())),))
----------------------------------------
'logtree.test' -> logging_LoggerInfo(class_='logging.Logger',
    name='logtree.test', level=0, handlers=(), disabled=0,
    propagate=1, filters=(), children=())
----------------------------------------

Of course, this is very simplified formatting, and can be much improved on, but it gives the gist of what’s being returned. The various classes whose names end in Info are the namedtuple subclasses, and the corresponding instances represent particular logging objects. You can see that specific attributes are displayed; these can be customised quite flexibly.

Customising is done via a configuration dictionary – if you don’t supply one, a default is used. The full definition of get_snapshot() is:

def get_snapshot(config=None):
    return Snapper(config).snapshot()

If no configuration dictionary is specified, a default one – available at Snapper.DEFAULT_CONFIG – is used. You can make a copy of that, add or change elements in your copy, and then pass that in to get_snapshot. (You could also subclass the Snapper class if you really need to, but in most cases, you shouldn’t need to.)

The configuration dictionary currently has two keys - attrmap and format – whose values are dictionaries. The dictionary is used when converting each logging object to its snapshot representation. Each object, and each attribute of that object, is passed recursively through the conversion process. This process works roughly as follows:

  1. A dict instance is converted to another dict instance whose keys and values have passed through the conversion process.
  2. Instances of list and tuple are converted to a tuple instance whose elements are the converted source elements.
  3. Instances of int, float, and str are passed through unchanged. None is passed through unchanged. (If needed, this could also be applied to complex and decimal.Decimal types, but it seems unlikely these would be used in a logging configuration.)
  4. For instances, a qualified class name is computed from instance.__class__.__name__ and instance.__module__ (if present). This is used as the key to the attrmap and format configuration sub-dictionaries. The format dictionary is checked first – if the key is present, the value should be a callable which takes a single argument – the instance – and returns the representation. This is used, for example, to convert internal implementation class instances such as PercentStyle, StrFormatStyle and StringTemplateStyle (in Python 3.2 and later) to the strings '%', '{' and '$'. (The value for config['attrmap']['logging.PercentStyle'] is lambda x: '%'.) If the key is not present in format, then attrmap is checked for the key. If present, the value should be a space-separated list of attribute names to bring into the snapshot; if an attribute name starts with an underscore, you can represent this as e.g. _fmt:format, where the instance attribute in the logging object will be _fmt but the field name in the corresponding namedtuple will be format. (Field names in named tuples can’t start with an underscore.) All named tuple classes have the field name class_ included, which holds the class name of the type represented by the named tuple.

This arrangement allows user-defined handler classes, for example, to be easily accommodated by just adding corresponding entries in the configuration dictionary. A part of the default configuration dictionary is shown below:

DEFAULT_CONFIG = {
    'attrmap': {
        'TextIOWrapper': 'name mode encoding', # for Python 3.x
        'file': 'name mode encoding', # for Python 2.x
        'logging.Formatter': 'datefmt _fmt:format _style:style',
        'logging.Logger': 'disabled filters handlers level name '
                          'propagate children',
        'logging.StreamHandler': 'filters formatter level stream',
        'logging.FileHandler': 'filters formatter level '
            'baseFilename:filename mode encoding delay',
        'logging.handlers.SMTPHandler': 'filters formatter level mailhost '
            'mailport fromaddr toaddrs username password subject secure '
            'timeout',
    },
    'format': {
        'logging.PercentStyle': lambda x : '%',
        'logging.StrFormatStyle': lambda x : '{',
        'logging.StringTemplateStyle': lambda x : '$',
    },
}

Just as an exercise, I coded up an implementation of a presentation which echoes Brandon’s formatting. You can see these in this gist, which contains two files: logtree.py, which contains the introspection implementation, and lttest.py, a simple test which defines a logging configuration similar to Django’s default configuration and prints out representations using Brandon’s logging_tree and the logtree.py implementation. The differences between the representations are not especially significant, and relate to the representation of individual handlers:

SHGHFV~A

Comments are welcome on the approach used, the desirability (or otherwise) of adding the logtree.py functionality to the stdlib, or any other pertinent point.

Wednesday, 7 March 2012

Structured logging, standardised

Moves are afoot in the Linux world to standardise structured logging – logging in such a way that log event data can be parsed in a standard fashion to extract useful information for later analysis. One such project is Fedora’s lumberjack, which aims to implement the specifications proposed in MITRE’s CEE (Common Event Expression). Another Fedora initiative is embodied in the ELAPI project, which covers a slightly wider problem domain.

The basic idea is to format messages using structured formats such as JSON or XML. These messages can be sent conventionally across existing transport mechanisms such as syslog, and the aim is that in due course, structured messages can be routed to structured message stores (e.g. MongoDB databases) rather than plain text files.

Python logging already supports structured logging in a generic way, with third-party handlers available both for RDBMS and MongoDB. However, a key element of the emerging standards is interoperability, so it may be that some Python application developers will need to support these standards at some point.

This support can be provided using a custom formatter class. For example,let’s look at using the CEE standard and assume that JSON will be used as the structured format, and syslog as the transport. A CEE event must contain, as a minimum, the following fields:

Name Description
p_proc The name of the process that produced the event.
p_sys The hostname of the system that generated the event.
time The time the event occurred. This timestamp should have microsecond granularity (1E-6 seconds) and include the timezone GMT offset.

The time format needed is not quite the same as the default provided by logging, but it should be easy enough to provide using the datetime class. The complete CEE schema is quite involved and I won’t attempt to detail it here – you can use the links provided above if you need more information.

Here is an example script showing how you can use a formatter to provide CEE-compliant log output:

#!/usr/bin/env python
import datetime
import json
import logging

import os
import socket
import sys

import pytz

class CEEFormatter(logging.Formatter):
    cee_token = ' cee: '

    def __init__(self, fmt=None, datefmt=None, tz=None):

        # We don't use super because logging classes are
        # old-style classes on 2.x
        logging.Formatter.__init__(self, fmt, datefmt)

        self.hostname = socket.gethostname()
        self.procname = os.path.basename(sys.argv[0])
        self.tz = tz

    def format(self, record):
        result = logging.Formatter.format(self, record)
        structured = self.format_structured(record)
        if structured:
            result += '%s%s' % (self.cee_token, json.dumps(structured))
        return result

    def get_event_time(self, record):
        return datetime.datetime.fromtimestamp(record.created,
                                               tz=self.tz).isoformat()

    def format_structured(self, record):
        result = {
            'Event': {
                'p_proc': self.procname,
                'p_sys': self.hostname,
                'time': self.get_event_time(record)
            },
        }
        return result

def main():
    if len(sys.argv) > 1:
        try:
            tz = pytz.timezone(sys.argv[1])
        except pytz.UnknownTimeZoneError:
            print('Unknown timezone %r, using Europe/London' % sys.argv[1])
            tz = pytz.timezone('Europe/London')
    else:
        tz = pytz.timezone('Europe/London')

    formatter = CEEFormatter('%(message)s', tz=tz)
    record = logging.makeLogRecord(dict(msg='Hello, world!'))

    print(formatter.format(record))

if __name__ == '__main__':
    sys.exit(main())

In the above script, the CEEFormatter class represents just one way in which you can implement a formatter to implement CEE-compatible logging. The CEE structured data is appended to the message, which can (for example) be sent via a SysLogHandler to a local or remote host. The cee_token is a value which indicates to a CEE-aware receiver that the message contains structured data. This is given as "cee:" in the CEE online documentation, but I’ve seen uses of "@cee:" as well. Whatever the standard decides upon, you can set this at a per-instance level for testing etc.

The format method of the CEEFormatter class calls the superclass method to format the message, and then calls format_structured with the record to get a dict indicating what the CEE structure should be. If the method returns a non-empty dict, it is formatted into JSON and appended to the message, preceded by the CEE token. In the above example, the absolutely minimal CEE-compatible message is formatted.

The get_event_time method is provided to factor out the event time format, just in case it has minor format variations you want to support – in which case you need only override this method. The example shows use of pytz to set up timezones, but this may not be needed in your case.

In practice, you can use one of the established methods (extra parameter to logging call, LoggerAdapter or Filter) to pass context information in the LogRecord, which can be used by the format_structured method to add additional, context-dependent information to the structured portion of the message.

If you run the above script (converted into an executable using ln –sT ceefmt.py ceefmt; chmod u+x ceefmt), you might see results similar to the following.

$ ./ceefmt
Hello, world! cee: {"Event": {"time": "2012-03-07T20:21:16.612857+00:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}
$ ./ceefmt Europe/Rome
Hello, world! cee: {"Event": {"time": "2012-03-07T21:21:29.021881+01:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}
$ ./ceefmt America/Chicago
Hello, world! cee: {"Event": {"time": "2012-03-07T14:21:41.745214-06:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}
$ ./ceefmt America/Anchorage
Hello, world! cee: {"Event": {"time": "2012-03-07T11:21:47.686027-09:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}
$ ./ceefmt Pacific/Auckland
Hello, world! cee: {"Event": {"time": "2012-03-08T09:21:57.687873+13:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}

Comments welcome.

Tuesday, 13 December 2011

Improved flexibility for log file rotation

The RotatingFileHandler and TimedRotatingFileHandler classes provide basic log file rotation functionality, but there are times when more functionality is required than is provided in these classes. One requirement is to compress log files when they are rotated. Currently, if you want this kind of functionality, you have to subclass the relevant handler and override the doRollover method. This works, but requires you to copy and paste the entire method, and then adapt it to your needs; clearly, this is not ideal. An example of this can be found in this ActiveState recipe for a TimedCompressedRotatingFileHandler. This uses the .zip format to compress log files.

In order to provide improved flexibility for log file processing (which will generally be compression, but in practice any transformation could be performed), I propose to add two new methods to these handlers (in their common base class, BaseRotatingHandler).

The first method is named rotate. It takes two arguments, source and dest. The source argument is the name of the log file about to be rotated (e.g. test.log), and the dest argument names the target of the rotation (e.g test.log.1). Under normal circumstances, the file named by dest should not exist, having been explicitly removed by the rollover logic. the default behaviour is what happens currently – a call to os.rename(source, dest).

Since we are potentially transforming log files, we might also want to change their extension, so that they are more usable with e.g. file system viewers like Nautilus or Windows Explorer. To facilitate this, a second method is provided, called filename. This takes a single argument – the filename of a destination file in a rotation (for example, test.log.1)  – and returns a modified version (for example, adding an extension – test.log.1.gz). The default behaviour is to return the name unchanged.

The definition of the new methods is as follows.

    def filename(self, name):

        """
        Modify the filename of a log file when rotating.

        This is provided so that a custom filename can be provided,
        say by adding an extension.

        The default implementation returns the name unchanged.

        :param name: The default name for the destination log file.
        """

    def rotate(self, source, dest):

        """
        When rotating, rotate the current log in the file named by
        source to the file named by dest.

        On entry, the file named by source should exist and the file
        named by dest should not exist.

        On exit, the file named by source should not exist and the

        file named by dest should exist.

        The default implementation simply renames source to dest.

        :param source: The source filename. This is normally the base
                       filename, e.g. 'test.log'
        :param dest:   The destination filename. This is normally
                       what the source is rotated to, e.g. 'test.log.1'.
        """

This will allow a custom rotation strategy to be implemented relatively easily: for example, if you want to gzip rotated log files, you can ensure that filename returns test.log.1.gz when passed test.log.1, and that rotate gzips test.log to test.log.1.gz.

The ActiveState example I cited earlier used the .zip format, but what if you want to use .gz, .bz2, .7z or some other compression algorithm? You would need to support all of them in a single handler, or write multiple handler classes for each format.

To avoid the need for users to subclass the rotating handlers, I further propose to allow configurability of the rotate and filename methods by providing two new handler attributes (both, by default, set to None in BaseRotatingHandler).

The first handler attribute is named rotator. You can set it to a callable that takes two arguments, source and dest. The meanings of these arguments are the same as for the rotate method. If a callable is provided in the rotator attribute, it is called by the rotate method to perform the rotation.

The second handler attribute is called namer. You can set this to a callable which takes a single argument – the filename of a destination file in a rotation (for example, test.log.1)  – and modifies it to, for example, add an extension. If a callable is set, then it is called by the filename method to return the new name.

An example of how you can define a namer and rotator is given in the following snippet, which shows zlib-based compression of the log file.

def namer(name):
    return name + ".gz"

def rotator(source, dest):
    with open(source, "rb") as sf:
        data = sf.read()
        compressed = zlib.compress(data, 9)
        with open(dest, "wb") as df:
            df.write(compressed)
    os.remove(source)

rh = logging.handlers.RotatingFileHandler(...)
rh.rotator = rotator
rh.namer = namer

I’m aware that these are not “true” .gz files, as they are bare compressed data, with no “container” such as you’d find in an actual gzip file. This snippet is just for illustration purposes.

Note: The namer function is called quite a few times during rollover, so it should be as simple and as fast as possible. It should also return the same output every time for a given input, otherwise the rollover behaviour may not work as expected. If either the namer or rotator function raises an exception, this will be handled in the same way as any other exception during an emit() call, i.e. via the handleError method of the handler.

The internal logic for determining which files to delete (when a backupCount is specified on a rotating file handler) uses, in the case of  TimedRotatingFileHandler, regular expressions for the date/time-based suffix. The regular expressions used have been updated to allow an optional file extension consisting of a period followed by one or more letters and digits.

These changes will soon appear in Python 3.3’s development repository. Your comments are welcome.

Thursday, 9 June 2011

Configurability tweaks for SysLogHandler and Formatter

There have been some recent changes to SysLogHandler and Formatter which will appear in Python releases.

  1. SysLogHandler currently appends a NUL byte to messages it sends to syslog daemons. This is because earlier versions of these daemons expected the NUL terminator to be present, even though it’s not part of the relevant specification (RFC 5424). More recent releases of these daemons don’t expect the NUL byte to be there, but strip it off if it’s present. And some even more recent daemons follow the specification more strictly, and pass the NUL byte on as part of the message. In order to facilitate dealing with all the various possible destinations for SysLogHandler messages, a new class-level attribute, append_nul, has been added to SysLogHandler. The default value is True, thus preserving current behaviour; however, you can set it to False for specific handler instances where you don’t want the NUL byte to be appended. This change is expected to appear in Python 3.2.1, which should be released soon.
  2. In the same vein, the date formatting logic used by Formatter has been changed to make it more configurable. Previously, the default ISO 8601 format was hard-coded as in this example: 2010-09-06 22:38:15,292 where the part before the comma is handled by a strptime format string ('%Y-%m-%d %H:%M:%S'), and the part after the comma is a millisecond value. Because strptime does not have a format placeholder for milliseconds, the millisecond value is appended using another format string, '%s,%03d' – both of these format strings have been hardcoded into the Formatter.formatTime method. That’s set to change with Python 3.3 – these strings will be defined as class-level attributes which can be overridden at the instance level when desired. The names of the attributes are default_time_format (for the strptime format string) and default_msec_format (for appending the millisecond value).

Monday, 11 April 2011

Added functionality for basicConfig() in Python 3.3

The logging.basicConfig() function allows users to easily configure either a console handler or file handler for the root logger, and thereby have them usable by all loggers in their application. While this functionality meets the commonest use cases, there are some slightly less common use cases which can be accommodated without too much work. For example:

  • Configuring a FileHandler with a specific encoding, or with a delay parameter which defers opening the file until the first time it’s actually used.
  • Configuring a SysLogHandler. This could be useful, say, for cron jobs or other admin scripts.
  • Configuring multiple handlers with minimal work.

Accordingly, basicConfig() in Python 3.3 will have added functionality in the form of a new handlers keyword argument. This defaults to None, but when specified it is expected to be an iterable of already-created handler instances, which are to be added to the root logger.

Any handler in the iterable which does not have a formatter assigned will be assigned the formatter created by basicConfig(). This allows users to specify formatters if they need to, but defaults to a common formatter for all handlers which have no formatter specified.

This change to basicConfig() also includes improved checking for parameter compatibility: it doesn’t make sense to specify handlers together with stream or filename parameters, or to specify stream and filename parameters together. In Python 3.3, checks have been added so that these invalid combinations will raise a ValueError.

Examples of usage:

logging.basicConfig(handlers=[logging.FileHandler('test.log', 'w', 'utf-8')])

logging.basicConfig(handlers=[logging.SysLogHandler()])

logging.basicConfig(handlers=[logging.StreamHandler(), logging.FileHandler('test.log')])

I’d welcome feedback on this new functionality, which has already been checked into Python’s Mercurial repository.

Logging documentation for Python 2.7 reorganised

The logging documentation for Python 2.7 has been reorganised into the same structure as that for Python 3.x. The library reference is supposed to be just that – a reference – so the tutorials have been moved to the HOWTO section. The configuration and handler documentation has also been moved into separate pages. The organisation is now as follows:

There’s a prominent sidebar at the start of the reference documentation pages pointing to the HOWTO sections.

I hope the results are easier to navigate and to understand. Please feel free to give your feedback.