Avatar
Just a blog site.

Building Deterministic Zip Files with Built-In Commands

Zip files are not deterministic by nature, and this can cause some problems when you’re trying to do what you gotta do.

One specific use-case (my use case) was knowing when our continuous integration should deploy a new version of our AWS Lambda functions.

Why Aren’t Zip Files Deterministic

Non-determinism in zip files is not due to any of the underlying algorithms, but due to the other elements involved in building the zip file.

“Extra Information”

By default, zip will include “extra information” in the zip file. This depends on OS and such, but can include owner, creation time, access time etc.

$ openssl rand -base64 32 > a.txt
$ openssl rand -base64 32 > b.txt
$ zip - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
c9623960732a0be4afeee97b5b34f6d1
$ zip - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
857161ad22e6d31460a1ff39a9e1d955

The md5s are different, even though the files are the same.

To correct this part of the puzzle, use the -X (or --no-extra) flag.

$ openssl rand -base64 32 > a.txt
$ openssl rand -base64 32 > b.txt
$ zip -X - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
54f3bd89a49f0b7d06e387e76c67f77c
$ zip -X - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
54f3bd89a49f0b7d06e387e76c67f77c

Permissions

The permissions on files, with identical contents, will create different zip files.

$ openssl rand -base64 32 > a.txt
$ openssl rand -base64 32 > b.txt
$ zip -X - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
987e6c2956d6d34a9d6d4999af0a1c31
$ chmod +x a.txt
$ zip -X - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
8d1addee9f37fa71eaa349597222c57f
$ chmod -x a.txt
$ zip -X - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
987e6c2956d6d34a9d6d4999af0a1c31

Timestamp

The -X flag still includes the timestamp of the file, so two files with the same contents, but different timestamps, will create different zip files.

$ openssl rand -base64 32 > a.txt
$ openssl rand -base64 32 > b.txt
$ zip -X - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
9281791dffcb940fa2818722bc79a74c
$ touch -t 201301250000 a.txt
$ zip -X - a.txt b.txt | md5
  adding: a.txt (deflated -3%)
  adding: b.txt (deflated -3%)
6b55b43c7119aa413206b8953763d85c

In order to create a deterministic zip file, we need to make sure that the timestamp of each included file stays the same.

To do this we could simply set the timestamp of all the zipped files to a constant value.

$ openssl rand -base64 32 > a.txt
$ openssl rand -base64 32 > b.txt
$ ls -l
total 16
-rw-r--r--  1 pat  staff    45B Jan 25 11:43 a.txt
-rw-r--r--  1 pat  staff    45B Jan 25 11:43 b.txt
$ find . -exec touch -t 201301250000 {} +
$ ls -l
total 16
-rw-r--r--  1 pat  staff    45B Jan 25  2013 a.txt
-rw-r--r--  1 pat  staff    45B Jan 25  2013 b.txt

But in my case, I needed something a bit more clever, and I suspect you will as well…

Generating Meaningful Timestamps with Git

Setting the timestamp to a constant value does solve the problem of generating a deterministic zip, but it’s not very meaningful. What if we could get the timestamp for the most recent modification to an included file.

We can with git! (Probably other version control also, but that’s left as an exercise for the reader).

This uses the ls-files git command, pipes the output to the log git command and formats the output to just output the author date.

$ git ls-files -z . | xargs -0 -n1 -I{} -- git log -1 --format="%ad" {}
Sat Jun 17 13:34:28 2017 -0700
Thu Dec 13 14:56:15 2018 -0800
Fri May 5 13:01:16 2017 -0700
Thu Dec 13 14:31:07 2018 -0800
Sat Jun 17 10:32:11 2017 -0700
Mon Jun 19 11:16:41 2017 -0700
Thu May 18 14:38:53 2017 -0700
Tue Oct 9 09:39:04 2018 -0700
Thu Dec 20 07:50:42 2018 -0800
Thu Dec 13 15:47:04 2018 -0800
Mon Jul 10 12:48:59 2017 -0700
Tue Aug 1 15:49:51 2017 -0700
Sat Oct 6 14:48:55 2018 -0700
Wed Jan 23 18:12:33 2019 -0800
Tue Jul 11 21:55:10 2017 -0700

We now need to format the date into something we can sort, and provide as input to touch -t.

(Using the --date flag requires git version 2.6 or higher.)

$ git ls-files -z . | \
  xargs -0 -n1 -I{} -- git log -1 --date=format:"%Y%m%d%H%M" --format="%ad" {} | \
  sort -r | \
  head -n 1
201901231812

We can now put this together with touch.

$ find . -exec touch -t `git ls-files -z . | \
  xargs -0 -n1 -I{} -- git log -1 --date=format:"%Y%m%d%H%M" --format="%ad" {} | \
  sort -r | head -n 1` {} +

This will recursively set the timestamp of all files to the timestamp of the timestamp of the last change in git.

Reusable Command for CircleCI

Here is the code as a reusable command for CircleCI.

commands:
  #
  # Deterministic Zip
  #
  zip_r_deterministic:
    description: "Create a deterministic zip of a directory."
    parameters:
      before:
        type: string
        default: ""
      dir:
        type: string
      out:
        type: string
        description: Relative to CIRCLE_WORKING_DIRECTORY
      after:
        type: string
        default: ""
      date:
        type: string
        default: |
          `git ls-files -z . | xargs -0 -n1 -I{} -- git log -1 --date=format:"%Y%m%d%H%M" --format="%ad" {} | sort -r | head -n 1`
    steps:
      - when:
          condition: << parameters.before >>
          steps:
            - run:
                name: Zip Deterministic - PreProcess
                command: |
                  cd << parameters.dir >>
                  << parameters.before >>
      - run:
          name: Zip Deterministic - << parameters.dir >>
          command: |
            cd << parameters.dir >>
            export TEAK_ZIP_R_DETERMINISTIC_DATETIME=<< parameters.date >>
            find . -exec touch -t $TEAK_ZIP_R_DETERMINISTIC_DATETIME {} +
            zip -rX /tmp/teak_zip_r_deterministic.zip .
            touch -t $TEAK_ZIP_R_DETERMINISTIC_DATETIME /tmp/teak_zip_r_deterministic.zip
            cd -
            mv /tmp/teak_zip_r_deterministic.zip << parameters.out >>
      - when:
          condition: << parameters.after >>
          steps:
            - run:
                name: Zip Deterministic - PostProcess
                command: |
                  cd << parameters.dir >>
                  << parameters.after >>

This command also sets the timestamp of the resulting zip file to the timestamp of the last changed file that it contains.

This makes it easy to use the aws s3 sync to update only when there is an update to the contained code.

jobs:
  build:
    docker:
      - image: circleci/node:6.10.2-browsers

    steps:
      - checkout

      - run:
          name: Install AWS CLI
          command: sudo apt-get -y -qq install awscli

      - run:
          name: Update git
          command: |
            sudo sh -c "echo 'deb http://ftp.debian.org/debian jessie-backports main' >> /etc/apt/sources.list.d/git.list"
            sudo apt-get update && sudo apt-get -t jessie-backports install "git"

      - run: mkdir packages
      - zip_r_deterministic:
          before: yarn run prepare-for-lambda-zip
          dir: origin
          out: packages/origin.zip
      - zip_r_deterministic:
          before: yarn run prepare-for-lambda-zip
          dir: edge
          out: packages/edge.zip

      - run:
          name: Upload Lambda Packages to S3
          command: |
            aws s3 sync packages s3://your-lambda-packages

I included how to update git on the (very old) version of Debian that CircleCI uses, since that’s a headache.

Learning!

I hope this helped demystify the process of creating zip files, and some of the reasons why they can be non-deterministic.

If it didn’t, well, write your own damn blog.

all tags