Thursday, 26 July 2012

A potential security vulnerability in logging.fileConfig()

Although now superseded by the dictConfig() API proposed in PEP 391, the much older fileConfig() API has been widely used to configure logging. It is also possible to do on-the-fly reconfiguration of logging through logging’s listen() API, which listens on a socket for configuration data in either the format understood by fileConfig(), or JSON format for use with dictConfig(). Although listen() only ever binds to localhost and so cannot be reached from a remote machine, it is possible in certain scenarios, if you use the listen() function, that code which you don’t control will be executed in the process which calls listen(). To see how, consider a shared hosting environment where multiple users, who have no connection with each other other than being hosted on the same server, run processes. Ordinarily these processes are isolated from each other, but if one user’s process uses listen(), another user’s process could connect to the listened-on socket and send arbitrary configuration data. Because fileConfig() calls eval() under the hood to convert from text elements in the configuration text to Python objects for use in the configuration operation, potentially arbitrary code could be sent in a logging configuration from a malicious user to a victim, perhaps resulting in undesirable consequences for the victim when that code is run by eval(). (Thanks are due to Graham Dumpleton for reporting this vulnerability.)

Why does fileConfig() use eval(), and how can the vulnerability be tackled? Although recent versions of Python provide the ast module to assist with processing Python source code, this module was not available when fileConfig() was written. The type of configuration data which needs to be evaluated can be more complex than a simple parser can cope with: for example, a configuration may refer to a handler’s class, whether it be a stdlib-included one such as handlers.WatchedFileHandler, or a user-defined or third-party handler such as mailinglogger.MailingLogger. The configuration can also specify the constructor arguments for a handler, which can also be fairly involved to parse in ad hoc fashion. For this reason, eval() was used.

Users have already been alerted to the vulnerability via an update to the documentation, but that is perhaps not enough. One way of addressing the vulnerability is to not use eval(), but some other evaluation mechanism which is more sandboxed. One suggestion which has been made is to use ast.literal_eval(), but this only copes with Python literals – it does no name lookup, and so can’t cope with expressions like handlers.WatchedFileHandler.

As an experiment, I put together a simple expression evaluator which uses the ast module to parse input text and probably provides the bare minimum evaluation of AST nodes to support common use cases:

This shows a simple evaluator together with a simple test harness which evaluates user-entered expressions in the context of the logging package:

However, limiting the scope of evaluation is not the complete answer, and perhaps not the correct answer. For example, a malicious user could still send a bogus configuration which, for example, just turns the verbosity of all loggers off, or configures a huge number of bogus loggers. This would certainly be allowed even by a limited-evaluation scheme if a user legitimately wanted to do so; but if a malicious user sends the exact same “legal” configuration, it is still a security exploit because the consequences may be undesirable to the victim.

But how is the listener to know whether or not the configuration is coming from a legitimate source (a client process controlled by the same user who is running the process which uses listen())  or a malicious user (a client process controlled by some other user)? The simplest answer would appear to be a shared secret: When listen() is called, it is passed a text passphrase, which is also known to legitimate clients. When handling a configuration request via the socket, the configuration is checked to see if it contains the passphrase. If it does, the request is processed; otherwise, it is ignored.

In the fileConfig() input data, the passphrase could be provided via a passphrase=secret entry in the [default] section. In the dictConfig() input data, the passphrase could be provided against the passphrase key in the dict which is passed to dictConfig(). The checking would be done in the request handler code before calling fileConfig() or dictConfig(). If the passphrase argument to the listen() call is None (the default, preserving the current behaviour) no passphrase checking would be done.

As always, comments are welcome. Although this vulnerability is limited to specific scenarios, it would be good to address it as soon as possible. As Python 3.3 has entered beta, and is therefore in feature freeze, it is likely that a fix will go into Python 3.4 – the vulnerability is not currently considered serious enough to warrant fixing in earlier versions.

Sunday, 29 April 2012

Introspecting the logger hierarchy

Brandon Rhodes’ post about his logging_tree project got me thinking about providing this sort of functionality in the logging package. Though introspection of the logger hierarchy has been mentioned a couple of times in newsgroup postings over the years, it has never actually been requested as a fleshed-out feature, and until Brandon’s post, it was below my radar. Now, thanks to him, I have given it some attention.

There are a number of things which need to be considered when providing some way of introspecting the logger hierarchy:

  • In many systems, the logging hierarchy doesn’t change once it’s configured. However, there are also systems where it does – logging provides the facility for on-the-fly reconfiguration without having to restart a Python process. Therefore, a general introspection facility must act as a snapshot – the state of the hierarchy at a particular time when the snapshot was requested.
  • The introspection facility must allow both built-in and user-defined Logger, Handler, Formatter, Filter instances to return useful information.
  • The data returned by the introspection facility should be, as far as practicable, immutable. Users should not expect to get direct access to individual logging elements (as there are already specific APIs for doing that).
  • The introspection facility should perhaps eschew any presentation of the hierarchy; this is likely to be very application-dependent, and it is debatable whether it is worth providing a “default” presentation. Example presentations would be a textual printout (such as logging_tree provides) or a depiction of the hierarchy in a browser-based or desktop GUI.

I’ve made an attempt at addressing the above points when implementing my own take on Brandon’s work. Although I’ve no definite plan to include this functionality in Python 3.3, I’m publishing this post in order to get feedback about the approach used, and also whether something like this belongs in the core package.

The basic API is provided by a single function, get_snapshot(), which returns a dictionary. The keys of that dictionary are logger names, with '' being the key for the root logger. The corresponding values are instances of namedtuple subclasses; they are snapshots of the loggers, handlers, formatters and filters.

Here’s how it might work. Running the code snippet

logging.getLogger('logtree.test')
logging.basicConfig(filename='logtree.log')
result = get_snapshot()
tw = textwrap.TextWrapper()
tw.subsequent_indent = ' ' * 4
for name, info in sorted(result.items()):
    print(tw.fill('%r -> %r' % (name, info)))
    print('-'*40)

would display something like

'' -> logging_RootLoggerInfo(class_='logging.RootLogger', level=30,
    propagate=1, children=(('logtree',
    logging_PlaceHolderInfo(class_='logging.PlaceHolder',
    children=(('logtree.test',
    logging_LoggerInfo(class_='logging.Logger', name='logtree.test',
    level=0, handlers=(), disabled=0, propagate=1, filters=(),
    children=())),))),), filters=(),
    handlers=(logging_FileHandlerInfo(class_='logging.FileHandler',
    encoding=None, level=0, delay=None,
    filename='C:\\temp\\logtree.log', mode='a', filters=(),
    formatter=logging_FormatterInfo(class_='logging.Formatter',
    style='%', datefmt=None,
    format='%(levelname)s:%(name)s:%(message)s')),))
----------------------------------------
'logtree' -> logging_PlaceHolderInfo(class_='logging.PlaceHolder',
    children=(('logtree.test',
    logging_LoggerInfo(class_='logging.Logger', name='logtree.test',
    level=0, handlers=(), disabled=0, propagate=1, filters=(),
    children=())),))
----------------------------------------
'logtree.test' -> logging_LoggerInfo(class_='logging.Logger',
    name='logtree.test', level=0, handlers=(), disabled=0,
    propagate=1, filters=(), children=())
----------------------------------------

Of course, this is very simplified formatting, and can be much improved on, but it gives the gist of what’s being returned. The various classes whose names end in Info are the namedtuple subclasses, and the corresponding instances represent particular logging objects. You can see that specific attributes are displayed; these can be customised quite flexibly.

Customising is done via a configuration dictionary – if you don’t supply one, a default is used. The full definition of get_snapshot() is:

def get_snapshot(config=None):
    return Snapper(config).snapshot()

If no configuration dictionary is specified, a default one – available at Snapper.DEFAULT_CONFIG – is used. You can make a copy of that, add or change elements in your copy, and then pass that in to get_snapshot. (You could also subclass the Snapper class if you really need to, but in most cases, you shouldn’t need to.)

The configuration dictionary currently has two keys - attrmap and format – whose values are dictionaries. The dictionary is used when converting each logging object to its snapshot representation. Each object, and each attribute of that object, is passed recursively through the conversion process. This process works roughly as follows:

  1. A dict instance is converted to another dict instance whose keys and values have passed through the conversion process.
  2. Instances of list and tuple are converted to a tuple instance whose elements are the converted source elements.
  3. Instances of int, float, and str are passed through unchanged. None is passed through unchanged. (If needed, this could also be applied to complex and decimal.Decimal types, but it seems unlikely these would be used in a logging configuration.)
  4. For instances, a qualified class name is computed from instance.__class__.__name__ and instance.__module__ (if present). This is used as the key to the attrmap and format configuration sub-dictionaries. The format dictionary is checked first – if the key is present, the value should be a callable which takes a single argument – the instance – and returns the representation. This is used, for example, to convert internal implementation class instances such as PercentStyle, StrFormatStyle and StringTemplateStyle (in Python 3.2 and later) to the strings '%', '{' and '$'. (The value for config['attrmap']['logging.PercentStyle'] is lambda x: '%'.) If the key is not present in format, then attrmap is checked for the key. If present, the value should be a space-separated list of attribute names to bring into the snapshot; if an attribute name starts with an underscore, you can represent this as e.g. _fmt:format, where the instance attribute in the logging object will be _fmt but the field name in the corresponding namedtuple will be format. (Field names in named tuples can’t start with an underscore.) All named tuple classes have the field name class_ included, which holds the class name of the type represented by the named tuple.

This arrangement allows user-defined handler classes, for example, to be easily accommodated by just adding corresponding entries in the configuration dictionary. A part of the default configuration dictionary is shown below:

DEFAULT_CONFIG = {
    'attrmap': {
        'TextIOWrapper': 'name mode encoding', # for Python 3.x
        'file': 'name mode encoding', # for Python 2.x
        'logging.Formatter': 'datefmt _fmt:format _style:style',
        'logging.Logger': 'disabled filters handlers level name '
                          'propagate children',
        'logging.StreamHandler': 'filters formatter level stream',
        'logging.FileHandler': 'filters formatter level '
            'baseFilename:filename mode encoding delay',
        'logging.handlers.SMTPHandler': 'filters formatter level mailhost '
            'mailport fromaddr toaddrs username password subject secure '
            'timeout',
    },
    'format': {
        'logging.PercentStyle': lambda x : '%',
        'logging.StrFormatStyle': lambda x : '{',
        'logging.StringTemplateStyle': lambda x : '$',
    },
}

Just as an exercise, I coded up an implementation of a presentation which echoes Brandon’s formatting. You can see these in this gist, which contains two files: logtree.py, which contains the introspection implementation, and lttest.py, a simple test which defines a logging configuration similar to Django’s default configuration and prints out representations using Brandon’s logging_tree and the logtree.py implementation. The differences between the representations are not especially significant, and relate to the representation of individual handlers:

SHGHFV~A

Comments are welcome on the approach used, the desirability (or otherwise) of adding the logtree.py functionality to the stdlib, or any other pertinent point.

Wednesday, 7 March 2012

Structured logging, standardised

Moves are afoot in the Linux world to standardise structured logging – logging in such a way that log event data can be parsed in a standard fashion to extract useful information for later analysis. One such project is Fedora’s lumberjack, which aims to implement the specifications proposed in MITRE’s CEE (Common Event Expression). Another Fedora initiative is embodied in the ELAPI project, which covers a slightly wider problem domain.

The basic idea is to format messages using structured formats such as JSON or XML. These messages can be sent conventionally across existing transport mechanisms such as syslog, and the aim is that in due course, structured messages can be routed to structured message stores (e.g. MongoDB databases) rather than plain text files.

Python logging already supports structured logging in a generic way, with third-party handlers available both for RDBMS and MongoDB. However, a key element of the emerging standards is interoperability, so it may be that some Python application developers will need to support these standards at some point.

This support can be provided using a custom formatter class. For example,let’s look at using the CEE standard and assume that JSON will be used as the structured format, and syslog as the transport. A CEE event must contain, as a minimum, the following fields:

Name Description
p_proc The name of the process that produced the event.
p_sys The hostname of the system that generated the event.
time The time the event occurred. This timestamp should have microsecond granularity (1E-6 seconds) and include the timezone GMT offset.

The time format needed is not quite the same as the default provided by logging, but it should be easy enough to provide using the datetime class. The complete CEE schema is quite involved and I won’t attempt to detail it here – you can use the links provided above if you need more information.

Here is an example script showing how you can use a formatter to provide CEE-compliant log output:

#!/usr/bin/env python
import datetime
import json
import logging

import os
import socket
import sys

import pytz

class CEEFormatter(logging.Formatter):
    cee_token = ' cee: '

    def __init__(self, fmt=None, datefmt=None, tz=None):

        # We don't use super because logging classes are
        # old-style classes on 2.x
        logging.Formatter.__init__(self, fmt, datefmt)

        self.hostname = socket.gethostname()
        self.procname = os.path.basename(sys.argv[0])
        self.tz = tz

    def format(self, record):
        result = logging.Formatter.format(self, record)
        structured = self.format_structured(record)
        if structured:
            result += '%s%s' % (self.cee_token, json.dumps(structured))
        return result

    def get_event_time(self, record):
        return datetime.datetime.fromtimestamp(record.created,
                                               tz=self.tz).isoformat()

    def format_structured(self, record):
        result = {
            'Event': {
                'p_proc': self.procname,
                'p_sys': self.hostname,
                'time': self.get_event_time(record)
            },
        }
        return result

def main():
    if len(sys.argv) > 1:
        try:
            tz = pytz.timezone(sys.argv[1])
        except pytz.UnknownTimeZoneError:
            print('Unknown timezone %r, using Europe/London' % sys.argv[1])
            tz = pytz.timezone('Europe/London')
    else:
        tz = pytz.timezone('Europe/London')

    formatter = CEEFormatter('%(message)s', tz=tz)
    record = logging.makeLogRecord(dict(msg='Hello, world!'))

    print(formatter.format(record))

if __name__ == '__main__':
    sys.exit(main())

In the above script, the CEEFormatter class represents just one way in which you can implement a formatter to implement CEE-compatible logging. The CEE structured data is appended to the message, which can (for example) be sent via a SysLogHandler to a local or remote host. The cee_token is a value which indicates to a CEE-aware receiver that the message contains structured data. This is given as "cee:" in the CEE online documentation, but I’ve seen uses of "@cee:" as well. Whatever the standard decides upon, you can set this at a per-instance level for testing etc.

The format method of the CEEFormatter class calls the superclass method to format the message, and then calls format_structured with the record to get a dict indicating what the CEE structure should be. If the method returns a non-empty dict, it is formatted into JSON and appended to the message, preceded by the CEE token. In the above example, the absolutely minimal CEE-compatible message is formatted.

The get_event_time method is provided to factor out the event time format, just in case it has minor format variations you want to support – in which case you need only override this method. The example shows use of pytz to set up timezones, but this may not be needed in your case.

In practice, you can use one of the established methods (extra parameter to logging call, LoggerAdapter or Filter) to pass context information in the LogRecord, which can be used by the format_structured method to add additional, context-dependent information to the structured portion of the message.

If you run the above script (converted into an executable using ln –sT ceefmt.py ceefmt; chmod u+x ceefmt), you might see results similar to the following.

$ ./ceefmt
Hello, world! cee: {"Event": {"time": "2012-03-07T20:21:16.612857+00:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}
$ ./ceefmt Europe/Rome
Hello, world! cee: {"Event": {"time": "2012-03-07T21:21:29.021881+01:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}
$ ./ceefmt America/Chicago
Hello, world! cee: {"Event": {"time": "2012-03-07T14:21:41.745214-06:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}
$ ./ceefmt America/Anchorage
Hello, world! cee: {"Event": {"time": "2012-03-07T11:21:47.686027-09:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}
$ ./ceefmt Pacific/Auckland
Hello, world! cee: {"Event": {"time": "2012-03-08T09:21:57.687873+13:00", "p_proc": "ceefmt", "p_sys": "eta-jaunty"}}

Comments welcome.