Unicode File Names in Python 2.7

Today I wrote a small script to find and delete duplicate files. To do this task I needed to iterate over files in a specific folder, and calculate md5 checksum for each file:

1
2
3
4
5
for folder, subs, files in os.walk(path):
    for filename in files:
        file_path = os.path.join(folder, filename)
          with open(file_path, 'rb') as fh:
              ...

If the source folder contains a file or a folder with unicode characters in it, execution of the code results in this bummer:

1
IOError: [Errno 22] invalid mode ('rb') or filename: 'files\\????????? ????? ????????.txt'

The solution was easy. As it turned out, the os.walk method can return values encoded either as ASCII or as Unicode strings. It chooses the encoding based on the encoding of the “top path” argument. Originally I was passing an ASCII string. Once it was encoded to UTF-8 the problem was solved:

1
2
3
for folder, subs, files in os.walk(unicode(path, 'utf-8')):
    for filename in files:
        file_path = os.path.join(folder, filename)

Comments