Embedding Images with Arbitrary Data & md5sums

Published: Friday, September 06, 2024
Last Modified: Sunday, September 08, 2024

For every JPG or PNG image in a directory, we can embed its associated JSON file (output from gallery-dl’s –write-metadata flag) into the EXIF field UserComment.

import os
from PIL import Image, ExifTags
import json

image_directory = '/'

for filename in os.listdir(image_directory):
    json_filename = f'{filename}.json'

    image_path = os.path.join(image_directory, filename)
    json_path = os.path.join(image_directory, json_filename)

    if filename.lower().endswith(('.png', '.jpg', '.jpeg')) and os.path.isfile(json_path):

        d = {}
        with open(json_path, mode='r') as f:
            data = json.load(f)
            d['download_url'] = data.get('url', '')
            d['origin_url'] = data.get('link', '')
            d['auto_alt_text'] = data.get('auto_alt_text', '')
            d['created_at'] = data.get('created_at', '')
            d['description'] = data.get('description', '')
            d['grid_title'] = data.get('grid_title', '')
            d['dominant_color'] = data.get('dominant_color', '')

        with Image.open(image_path) as img:
            exif = img.getexif()
            exif[ExifTags.Base.UserComment] = json.dumps(d).encode()

            new_image_path = os.path.join(image_directory, f'modified_{filename}')
            img.save(new_image_path, exif=exif)

To extract the embedded data, just to a regular dict lookup and decode the bytes.

exif[ExifTags.Base.UserComment].decode()

Remarks:

  • I’ve tested this with PNGs and JPGs, and it works for both.
  • This changes the md5sum of the image. I’ve also noticed running img.save() alone will alter an image’s md5sum. The snippet below demonstrates this. I figure PIL (pillow) must be altering the file structure somehow.
from PIL import Image
import hashlib

image_path = '/home/user/Desktop/temp/flickr.jpg'

def get_md5sum(filename):
    return hashlib.md5(open(filename,'rb').read()).hexdigest()

print(f'md5sum, initial: {get_md5sum(image_path)}')

with Image.open(image_path) as img:
    img.save(image_path) # save() will alter the md5sum
    print(f'md5sum, final:   {get_md5sum(image_path)}')

Is this expected? Not really, because cp flickr1.jpg flickr2.jpg will result in the flickr1.jpg and flickr2.jpg having the save md5sum. So what is PIL doing? For the first few iterations of this script, PIL is modifying 4 bytes in the image. After several runs, the byte count of both images will start to differ and the differences begin to compound somehow.

cmp -l -b flickr1.jpg flickr2.jpg

byte val1 val2
237,697 117 77
237,700 57 63
294,748 317 313
294,751 214  215

Why is this happening? We will have to ask the PIL devs.

Comment
Optional

Comments

Friday, September 06, 2024
The md5sum thing isn't true. I edited the UserComment field for a random image using exiftool and diffed the hashes to make sure. Maybe you had a freak hash collision? I'm interested in the specific image and json where you found this
Neet Ventures
Sunday, September 08, 2024
Unfortunately, I've since lost track of which image I was testing this on, but I did more testing on another image, and you're totally right. I also found some weird anomaly with PIL, which I describe in the now updated post.