This was a problem that needed multiple solutions. First, we needed to know when data were being added. For this, we settled on a once-a-day check of the FTP site. Because we were running this check on a schedule, the frequency could be as often as required (every minute if we needed it to be that frequent).
The next issue was duplication of data on our end. Simply downloading the entire contents of the FTP every night would be overkill, and downloading and comparing locally would be a waste of time and bandwidth. So, each download job added to a file manifest. If a file was already in the manifest (meaning it had already been processed and made available for review), it was skipped. If it was new, it was downloaded.
Each time files were downloaded to our repository, a ticket was created in our ticketing system and stakeholders were notified, including the client and our in-house processing team. The ticket included the manifest of what was downloaded for easy reference to what files would be available for review next.
After download, it was a question of quickly processing, imaging, OCRing, and indexing the new data. By using templates and further automation, we were able to make this nightly process very fast, repeatable, and defensible by preserving logs, manifests, and all original source data.