Something something Dropbox - Part 1

Fri 01 March 2013

Something that has never made sense to me is the unicorn status of Dropbox. To be clear, I'm not being salty and hating on them - I'm sure the engineering effort that went in to creating that service involved substantial creativity. What nonplusses me is that given the competitor landscape includes Google Drive (which, for the record, has never failed me yet), why on earth would anyone still opt for Dropbox? I mean, Google Drive also integrates seamlessly with a bunch of Google's other products which, even as a Dropbox user, you are very likely to be using for e.g. GMail. I dunno, I guess their timing was good or I guess I'm just too locked in to the Google ecosystem to understand...

This discussion spilled over last evening and ended with the inevitable "if you feel that strongly, why don't you try building it yourself?" Challenge accepted. Well, at least, part of it. I was only intrigued enough to implement 2 features of it:

  • automatic backup and syncing across multiple devices and
  • support for simultaneous editing.

I wasn't concerned with the engineering problems of scale because a) I don't know enough to deal with those and b) it was highly unlikely that anyone except me would use what I ended up making. This post is an, albeit hacky, exploration of the automatic backup and sync problem.

To break down the problem, what we need to do is detect when a file is modified and sync those changes with a backup of the file that's sitting somewhere on the interwebz. Originally, I used an AWS EC2 instance to host my file backups. But then, since Google Drive clients on Linux (at least, Ubuntu) are a joke, I decided to use Google Drive itself.

Part 1: Detecting filetree modifications within a particular directory

I believed this part was to be the most challenging. Writing a script that runs in the background and periodically checks for modifications to the tree structure of a particular directory, making it resource efficient, is there a way to use interrupts instead of polling etc. But wait, doesn't Linux itself store information about file modifications? Surely, this has to be implemented in the OS itself. It is. Using inotify - a kernel subsystem that extends filesystems to notice changes in the same and report those changes. What's more, with support for flags like IN_CREATE, IN_MODIFY, IN_OPEN etc. I could get an unprecedented level of control on the triggers for my file sync. Huh, that sounds like it solves all my problems. The steps from there were fairly straightforward -
First, we install incrontab, which is basically a crontab that runs on file notifications -

$ sudo apt-get install incron

incrontab functions very much like crontab, including having the -l, -e flags for common operations. The next steps then are to add an entry using incrontab -e. However, you first need to make sure that you are among the set of users allowed to edit it. To do that edit the .allow file like so -

$ sudo vim /etc/incron.allow

... and add your username (in my case, this was varad) to the file. Once you save it, you can add a command like to track changes in your directory of choice like so -

/path/to/directory IN_MODIFY <command to execute on detecting change>

After experimenting with a bunch of flags, I concluded that IN_MODIFY suited this use case the best. However, there is a more comprehensive list of possible triggers detailed here.

Next, we look at the command to execute to do the remote file sync, which is the last argument that we need to provide in the command above.

Part 2: Syncing files

I have updated this section of the post to showcase a nifty tool - rclone - that I've since switched to using to do the heavy lifting for me. Essentially, what we want to do in this step is patch diffs from the local file to its remote backup as efficiently as possible. Something that rsync was built to do. rclone is essentially a wrapper around rsync that makes life easier to transfer files. It has built-in support for a host of different services including Amazon S3, Google Drive, Dropbox etc. and does a lot of the required configuration for each of these services under the hood. This saves me a lot of time from explaining the script I wrote to recreate rsync functionality.

To get rclone on your machine, just follow the instructions on the Downloads page. Once you have it on your machine, set it up to track your Google Drive account using rclone config as detailed here and give it a proper identifier viz. gdrive instead of remote as suggested in the link. This ensures that we now have a function to sync files to our backup. So, the command we provide as the second argument to the incrontab entry above is -

rclone copy path/to/directory/you/want/to/sync gdrive:/remote/folder/path

So the incrontab entry will look like so -

/path/to/directory/to/changes/on IN_MODIFY rclone copy path/to/directory/you/want/to/sync gdrive:/remote/folder/path

Save the file, and that's it! incron will run diligently in the background doing automatic file backups forever after. And you didn't even have to make an account.

Part 3: Pulling changes from the backup

This completes one part of the story. We also need to account for patching diffs in from the backup, to give ourselves the freedom of editing wherever we please. Since I decided to procrastinate on the simultaneous editing problem (I didn't want to deal with mutexes and synchornization edge cases just yet), I just created a simple cronjob to pull the changes periodically on a conservative interval of 5 min, using the rclone copy as before. Inefficient, I know. Sue me.

$ crontab -e

# Add the following line to the file that opens up to schedule a sync every 5 min
*/5 * * * * rclone copy gdrive:/remote/folder/path path/to/local/synced/directory/

For a comprehensive understanding of crontab syntax, check out Crontab Guru

And that's the first Dropbox functionality implemented! In Part 2 we take on the classic problem of enabling simultaneous editing.