Deleting a Massive Amount of Data in Rails

I’ve been an avid Ruby on Rails fan since 2012 when I first got into software development. Since then I’ve had the pleasure of working with some brilliant people and building challenging pieces of software. It’s been a great time but there are certainly times where the Rails framework is less than ideal.

Let’s talk about database management. ActiveRecord is a great tool. When I was learning software development I was delighted to know that I never had to write SQL directly. I had no clue what an ORM was and didn’t even fully understand the superpowers that ActiveRecord had bestowed upon me. All I knew is that I didn’t have to learn two languages, I just had to know Ruby.

Rails gave me the power of scopes where I could store my increasingly verbose database queries into reusable ammo for my database accessing arsenal across my codebase. It was readable, it was maintainable, and it got the job done. If it as all of those things, than why am I now saying that it’s less than ideal?

The truth about ActiveRecord is that it’s slow. Especially when you’re working with HUGE amounts of data. The other day I was faced with the task of deleting about 31 million records from a PostgreSQL database. There were around 5 parent records for these 31 million that also had to be deleted. There was a typical destroy method in the controller for this parent model that deleted it through ActiveRecord. When delete was called on a parent millions of children would also be deleted through a dependent: :destroy in the parent’s model.

I hit delete and watched my server hang for close to 5 hours. I didn’t time it at first because I had no idea it was going to take that amount of time. When you set up a has_many relationship to have a dependent: :destroy it instantiates an instance for every child class and deletes it individually in case that class also has dependencies that need to be destroyed. If you know for sure that children classes don’t have any dependencies that need to be deleted you can just use dependent: :delete_all. Here’s an example with a user that has many posts. Let’s say you want to delete all the posts, and those posts don’t have any dependencies.

Running a destroy action on the user with a dependent: :destroy argument on it’s has many relationship to posts would result in something like this.

DELETE FROM "posts" WHERE "posts"."id" = $1 [["id", 1]]
DELETE FROM "posts" WHERE "posts"."id" = $1 [["id", 2]]
DELETE FROM "posts" WHERE "posts"."id" = $1 [["id", 3]]
DELETE FROM "posts" WHERE "posts"."id" = $1 [["id", 4]]
DELETE FROM "posts" WHERE "posts"."id" = $1 [["id", 5]]
DELETE FROM "users" WHERE "users"."id" = $1 [["id", 4]]

Running a destroy action on the user with a dependent: :delete_all argument on it’s has many relationship to posts would result in something like this.

DELETE FROM "posts" WHERE "posts"."user_id" = $1 [["user_id", 4]]
DELETE FROM "users" WHERE "users"."id" = $1 [["id", 4]]

For a user that has only a couple of posts, this isn’t a huge performance problem, but when you’re taking about thousands to hundreds of thousands of records it adds up.

When children records don’t have any dependencies that would have to be deleted from the database use dependent: :delete_all instead of dependent: :destroy.

Now there’s also the scenario that I ran into which means that neither of these options are the correct way to handle the problem of deleting dependent records. When you’re talking about millions of records being deleted, you should setup a background task to handle it. This is helpful because it won’t block the users experience by making them wait to continue interacting with your application.

What I ended up doing was setting up a Redis Server to handle a Resque queue. On the destroy method of the aforementioned parent controller, I enqueued a task, passing in the id of the parent record in the database.

This task looks something like this…

def self.perform(user_id)
	user = User.find(user_id)
	number_of_posts = user.posts.count

	# Determine how many batches need to be run
	number_of_iterations = (posts.to_f / 1000).ceil

	# Delete posts
	(1..number_of_iterations).each do |i|
	    user.posts.limit(1000).delete_all
	end

	user.destroy
end

When dealing with MASSIVE amounts of data it’s good to batch the deletion. If you delete more than 5,000 rows in a single transaction, your database will lock. This means the entire table is inaccessible by any other running process for the duration of the transaction. This can mean some serious issues for the users of your site while a DELETE is happening.

If you’re dealing with more than 5,000 records at a time move the deletion of those records to a background process where you batch the deletion of those records.

Now we’re able to handle all different types of scenarios when removing records from a database using ActiveRecord. Happy deleting!

Need help with a project? We'd love to hop on board. Hire Us.

Get In
Touch

Must be a valid email address.