Convert a large paper_trail dataset from YAML to JSON

During an upgrade of Rails from 6 to 7, I was tasked with converting a large amount of existing papertrail data from the default but not recommended format of YAML to JSON.

The process will likely differ depending on your existing setup, but I wanted to share a useful pattern for this. There are compromises yes, but this was fast and required a reasonable maintenance window of around 10 minutes per 1M rows in our versions table. Where papertrail stores it data by default.

Part of the requirement was to make this change in a single deploy. There was a good reason for this, but it did pose some challenges.

This is how it works.

  1. Create a single migration to;

    • change column names object and object_changes to old_object and old_object_changes
    • create new columns to replace them, of type JSONB
    • kick off the data conversion
  2. Refactor existing logic that expects the old format

  3. Upgrade Papertrail gem, and turn on JSON serializer, now all new records are saved in the new format

Let’s look at the code.

Create the migration

class CovertVersionsObjectData < ActiveRecord::Migration[6.1]
    def up
        rename_column :version, :object, :old_object
        rename_column :version, :object_changes, :old_object_changes
        add_column :version, :object, :jsonb
        add_column :version, :object_changes, :jsonb

        QueueConvertVersionObjectData.perform_later
    end

    def down
        remove_column :versions, :object, :jsonb
        remove_column :versions, :object_changes, :jsonb
        rename_column :versions, :old_object, :object
        rename_column :versions, :old_object_changes, :object_changes
        add_column :versions, :new_format, :boolean, default: false
    end
end

Background Jobs

Queue the jobss

class QueueConvertVersionObjectData < ApplicationJob
    queue_as :default

    def perform
        PaperTrail::Version
          .where(new_format: false)
          .and(PaperTrail::Version.where.not(old_object: nil, old_object_changes: nil)
          .ids
          .in_groups_of(1000, false) do |version_ids|
            ConvertVersionObjectData.perform_later(version_ids)
        end
    end
end

Convert the data

class QueueConvertVersionObjectData < ApplicationJob
    queue_as :default

    def perform(version_ids)
        PaperTrail::Version.where(id: version_ids).find_each do |version|
        versions << { object: YAML.unsafe_load(version.old_object || 'null'),
                      object_changes: YAML.unsafe_load(version.old_object_changes || 'null'),
                      new_format: true }
        end

        PaperTrail::Version.upsert_all(versions)
    end
end


Yes, renaming the column is not an ideal solution. However, we could afford a decent sized maintenance window and needed an single deploy.

Using an upsert allowed a drastic increase in performance, making a single request for every 1000 records. Bear in mind if you are using Postgres as your database, you will need to pass along all columns with ‘not null’ constraints.

You can also write some tests to ensure running this will not change existing data, and that the converted data is the same. Hashdiff is great for this.

I hope you find this useful.

Dec 16, 2023