FTP Crawler

The Case

A client was receiving productions on a frequent, yet erratic basis. The productions were being uploaded to a shared FTP, but there was no notification or alert of new data being added, and no real organizational standard for adding data, and duplicative data was often uploaded. These documents needed to be transferred to our data center, processed, and loaded into our review platform as they were added to the FTP.

The Solution

This was a problem that needed multiple solutions. First, we needed to know when data were being added. For this, we settled on a once-a-day check of the FTP site. Because we were running this check on a schedule, the frequency could be as often as required (every minute if we needed it to be that frequent).

The next issue was duplication of data on our end. Simply downloading the entire contents of the FTP every night would be overkill, and downloading and comparing locally would be a waste of time and bandwidth. So, each download job added to a file manifest. If a file was already in the manifest (meaning it had already been processed and made available for review), it was skipped. If it was new, it was downloaded.

Each time files were downloaded to our repository, a ticket was created in our ticketing system and stakeholders were notified, including the client and our in-house processing team. The ticket included the manifest of what was downloaded for easy reference to what files would be available for review next.

After download, it was a question of quickly processing, imaging, OCRing, and indexing the new data. By using templates and further automation, we were able to make this nightly process very fast, repeatable, and defensible by preserving logs, manifests, and all original source data.