Automating Text File Encoding Conversion with Python

If you’ve ever worked with text files from various sources, you’ve probably encountered issues with file encoding. For me, this problem became evident when I downloaded subtitles for TV shows that weren’t in UTF-8 encoding. All my devices are set to use UTF-8, so dealing with encoding mismatches can quickly become frustrating — especially when you have to update subtitles for your favorite TV show with 24 episodes per season and 6 seasons in total! Manually opening each file in a text editor and saving it with the correct encoding was a waste of time.

To save time and effort, I decided to create a Python package to automate the process of converting text file encodings.

The Problem: Non-UTF-8 Encodings #

Subtitles for older or region-specific TV shows often come in non-UTF-8 encodings, like Windows-1250 or ISO-8859-2. While these encodings might work on most systems, they usually require switching to the correct encoding and then reverting the settings afterward. To avoid this hassle, I decided to standardize everything to UTF-8 and convert all subtitles to this format. Manually converting files every time is tedious and error-prone, so I needed a tool that could:

Handle batch processing of multiple files.
Automatically detect and process all files in a folder.
Be easy to use and configurable for different encodings.

The Solution: `change_encoding` package #

The result of my efforts is a Python package simply called change_encoding. It’s designed to be straightforward and effective. Here’s what it does:

Converts all .srt files in a specified folder from one encoding to another.
Creates a separate folder for the converted files, named based on the destination encoding, preserving the originals.
Handles encoding errors gracefully, ensuring the process doesn’t break halfway.

Installation #

You can install the package by cloning the GitHub repository and install it:

git clone https://github.com/SoftwareWitchcraft/change_encoding
cd change_encoding
pip3 install .

Usage #

Once installed, chenc command will be available on your computer, so you can use the package directly from the command line:

chenc /path/to/folder windows-1250 utf-8

Here:

/path/to/folder is the directory containing the text files to convert.
windows-1250 is the source encoding (you can change this based on your files).
utf-8 is the target encoding.

If you don’t specify the encodings, the package defaults to converting from windows-1250 to utf-8.

NOTE: Please refer to the Python documentation for a list of all supported encodings.

How It Works #

The package works by:

Scanning the specified folder for .srt files.
Reading each file with the specified source encoding.
Writing the content to a new file in the target encoding.
Saving the converted files in a subfolder, named based on the destination encoding, within the original directory (for example: in utf-8 subfolder).

Here’s a simple code snippet that forms the core functionality:

def convert_file_encoding(input_file, output_file, from_encoding, to_encoding):
    with open(input_file, 'r', encoding=from_encoding) as infile:
        content = infile.read()
    with open(output_file, 'w', encoding=to_encoding) as outfile:
        outfile.write(content)

Extending the Package #

While the package is currently tailored for my needs, there’s potential for extending its functionality:

Auto-detection of source encoding: Using libraries like chardet or charset-normalizer.
Support for additional file types: Beyond .srt, the package could handle .txt or .csv files.
Recursively scanning the subfolders
…

Conclusion #

The change_encoding package has simplified a tedious process in my workflow, allowing me to enjoy my TV shows without hassle. Now I can process an entire folder of subtitle files in seconds, ensuring they’re ready to use on all my devices.

If you’re facing similar encoding issues, feel free to try it out or adapt it to your needs. You can find the code on GitHub.