It isn't clear from the design doc, but it sounds like we're proposing to develop this from scratch as a set of Python scripts? Are there really no existing software tools that we could re-purpose for this? Something based on WebDAV or its successors? Why not Globus Transfer?
XSEDE shouldn't be developing stuff that has to be maintained long-term when there are other options available. It's disappointing to me that we wouldn't simply do some googling and find an off-the-shelf tool for "collecting files from multiple sources".
I think it would be very easy to use Globus Transfer for this. Bring up a GCS server on the host and set up shared endpoints for the various incoming directories. SPs register & authenticate with XSEDE, we add them to groups (either on their request or by invitation) with permission to their shared endpoint, and they use the Globus CLI to upload their files. The CLI can be automated as easily as scp. A significant benefit is that we are using XSEDE's user identity/auth mechanisms rather than falling back to ssh keys.
For the server-side stuff (moving from incoming to repository directories, generating usage reports, etc.), surely there are existing software packages that do this? What about XDMoD? It seems like a very common use case for which there surely must be off-the-shelf options available?
The scripts we are proposing to develop:
1) A script that identifies the configurable subset of files in a (log) directory that contain usage that needs to be transferred, runs a simple parser to convert those files into a standard format, and then scp's them to the central server. This script is necessary regardless of which transfer software we use because something needs to identify the files and pass them to the transfer tool.
2) Scripts that parse/convert log or other usage records into a standard format. These are relatively easy to write. I thought that would be less effort overall to centralize a standard usage record than to centralize a bunch of different usage records and have to write usage analysis reports for each usage record format. Certainly, if a tool already has usage analysis capabilities, then we can use that analysis as is. Incidentally, we did explore log analysis tools and may want to adopt one. At this point most of them seemed like an overkill for what is a relatively simple process of converting usage records to a standard format, copying them to remote system, and then producing three simple reports. For relatively small projects it's sometimes a tough call whether to invest in a brand new tool that does way more than one needs or to write some simple scripts that only do what we need. I'm not sure we're making the right choice, but I do know the choice we're recommending is easy to implement.
3) We're not developing a transfer tool, we're using scp. This is the least effort way to move small files between servers because it uses existing software (nothing new to install and maintain) and is reliable enough for this application.
4) Moving files around on the server: again, this is a very simple set of scripts that copy files deposited by remote systems into a central directory that is not accessible to the clients depositing the files. Arguably, this is the part that we could perhaps totally replace with an RESTful put (WebDAV) or AMQP publish alternative to transfer usage records. If we control the code that receives the usage, then it's easy to put it where we want it to begin with. If the source system controls the destination (as is the case for Globus transfer or scp), then something on the destination will have to move the files from the client accessible directory to a place where the client can no longer touch it. An alternative would be to use a pull model so the client doesn't control the destination. We opted in this case to not give a single central service the ability to pull from a bunch of remote servers, for security reasons.
How about if we include in this design a recommendation to evaluate and replace scp in the future with a RESTful, AMQP, or GCPv5 method of transfer usage files. It's worth doing, but will take more time considering we already have everything we need for a simple scp implementation.
Really good feedback. Thanks.