Hey Galen,
I hear you’ll have some iron-man capabilities you can use starting tomorrow. I can’t wait to see them!
So, the elevator speech version of what we need to do is to “track the ongoing usage for XSEDE delivered software and services for use in ROI analysis”.
There are three basic pieces:
1) Gathering component specific usage and sending it to a central usage repository
2) Managing the contents of the central usage repository
3) Usage analysis
About 1) Gathering component specific usage:
We are spinning up separate activities to enhance specific components to record usage information. Some tools already track usage in Apache or as syslog entries. Others will need to be enhanced to do so. We don’t need you to worry about this piece of work other than to recognize that we will have apache, syslog, and other usage logs that need to make it into a central repository. We will use scp or other simple methods to move server usage logs to our central repository.
About 3) Usage analysis:
We would like to replicate our central usage repository in XDMod so that they can build the sophisticated usage analysis interfaces, so we don’t need you to worry about sophisticated usage analysis. If tools you find have some basic analysis functionality, that’s OK.
Now part 2) central usage repository is the part we need you to focus on:
We need software that can manage a repository of component usage/access information in flat files or a database, to include:
Required:
- What component and version was used
- When the usage happened/started (UTC time)
- Where usage happened (client and/or server)
Optional:
- Who initiated the usage (local operating system user and/or XSEDE identity)
- Usage parameters or flags (Apache URLs could be considered a parameter)
- When the usage finished (UTC time)
- Resources consumed
Any questions?
Thanks,
JP
From Galen:
JP,
I'm not thinking about it yet, but yes, questions/comments:
-Resources consumed would require job info (catching job usage, but perhaps missing interactive usage or login node usage )
-Have you looked at xalt? This sounds like that. A lot. And it's a hot mess at times. :) It does work most of the time.
,Galen
Galen Arnold
NCSA - sr systems engineer
From JP to Galen,
Resources consumed is optional and not of primary importance. Certainly not important enough to install special software wherever users invoke an XSEDE tool or service. I added this optional field because it might be useful, for example, to record how long the user used a tool, or how fast it performed the task they requested.
JP
Galen,
This is usage collection that Shava, Lee, Jim, and I have prioritized for the next ~6 months. We're already started independent activities to instrument and gather this usage which will need to be aggregated and managed with the tool(s) you are helping us find:
Thanks,
JP
I can't stand it when people try to remove SP SSH servers. :)
Do you have some sample data? This is sounding more and more like a problem looking for a database to me. There are some unix and other tools for doing similar things, but not targeted toward the data sources you're thinking (xalt reporting, unix system V process accounting, oprofile (which has started to supercede old process accounting )... I'll keep thinking on it some next week, but if we reach a point where the conclusion is : we need a database , I'm going to remind you over and over again that I'm an HPC fortran,c,c++ analyst now. :)
Not to worry. We're not trying to develop a new database, at least not now. We just need you to try to find potential tools that we could use to collect and manage this type of information. It's a secondary concern whether those tools use a database, flat files, or some cloud data service as long as the easily manage the type of information we need to manage.
If you/we don't find appropriate tools and need to develop something custom, we can have someone else do that.
JP
Is this a new use case for XSEDE's data transfer analysis and logging infrastructure (https://jira.xsede.org/browse/SDIACT-200)? "This will include a service with a backend database and REST API to allow for querying of the database on-demand." Seems to me that usage logs and performance logs have similar storage/query/analysis requirements.
This is an example IPF usage record:
=INFO REPORT==== 17-Nov-2017::12:40:19 ===
connection <0.3551.0> (129.114.62.15:39053 -> 192.249.6.5:5671): user 'xsede-tacc' authenticated and granted access to vhost 'xsede'
Venkat may be able to provide sample SSO hub and CI login usage records.
I think xdusage records will be custom, but will likely look like web server log entries.
Hi JP,
In regard to XCI-185 (SSO Hub usage metrics) I have been working on, should I assume the usage scripts will run on the SSO Hub (as I am currently assuming) or on the remote repository? In the latter case, do you intend for a separate set of tools to be developed to analyze usage?
Thanks,
venkat
Yes, I propose the following design elements:
What do you think?
This will all need to make it into a design document that gets reviewed...
JP
Hi JP,
In discussing this with Jim, it looks like we could incorporate the generation of files with relevant/filtered data for copying to the central repo in the future. This would be in addition to the analysis tools that could be run on the SSO Hub itself for now.
Thanks,
venkat
Are these tools already part of the OS or were they developed specifically for use on the SSO hub? The reason this is relevant is because we already have a second SSO hub like server called kepler.xsede.org, and we are likely to have more in the future. If it's easy to install and use third party analysis tools on each server, it's OK to do that. If we're developing these tools it may be easier to maintain and use the on a single central server.
I recognize there are many possible ways to do this. We want to find the optimal way to manage and process this usage information.
Another reason to aggregate the data in a central server is so that we can better manage the data long term: keep it for as long as we want, have it even if the server is compromised and needs to be rebuilt, ....
Thanks,
JP
Hi JP,
It's a set a scripts developed specifically for the SSO Hub to process wtmp and auditd logs. Since they will be delivered in an RPM, it should be easy to install on multiple systems. Also, the idea is that they can be modified to run on the central repo once it is up.
Thanks,
venkat
Before I answer I want to clarify that I'm contributing design _opinions_ for all of us to discuss and accept/reject/modify.
I think an efficient design for this system is:
Where the usage record is generated have minimal code to record the usage record and periodically (daily?) forward a filtered subset of raw usage records to a central repository.
In the central repository convert raw usage records into a more uniform format and store the in a usage analysis repository. Also on the central repository have to usage analysis tools that display analysis results by command line and/or web interface.
Galen is searching for a tool/s to handle raw usage analysis conversion, loading into a uniform format repository, and simple usage analysis. If we can't rely on XDMod for advanced analysis, then we need to find a tool that does all required analysis. If we can't find appropriate tool/s, we'll build whatever we need with the minimal required effort.
Converting, merging, and analysing usage records isn't rocket science, but it's better to leverage an existing tool if we can find out than to develop something custom ourselves because we'll get much more functionality for the effort we have to put into it (hopefully).
JP
In my reading of XSEDE's privacy policy (http://hdl.handle.net/2142/73408), giving non-XSEDE staff (e.g., XDMoD) access to this information would be a policy violation.
See also https://jira.xsede.org/browse/REVIEW-11 for security concerns around centralized collection of non-anonymized user activity logs.
I believe the XDMod project is considered part of the same NSF XD program and "in the fence". They currently access all XSEDE's non-anonymized accounting data.
Should we get a ruling from Operations Security on whether this policy affects our ability to share tool usage information with the XDMod folks? If so, do you want to ask them or should I?
JP
I'll ask Adam and JAM to chime in.
Hello JP/JIm the sharing of user data with XDMod is against our privacy policy. The L2 of Operations is aware of this conflict and we're considering adding this to the risk registry until it is resolved. I would refrain from increasing the data sharing outside of XSEDE until this gets addressed.
Jim
List of tools reviewed and considered
Observations
Hey Galen,
This is a great list of log analysis tools. Thanks for pulling it together and reviewing them. Looking at the original base requirement.
We need to manage a central repository of usage information that contains these required fields:
- What component and version was used
- When the usage happened/started (UTC time)
- Where usage happened (client and/or server)
And these optional fields:
- Who initiated the usage (local operating system user and/or XSEDE identity)
- Usage parameters or flags (Apache URLs could be considered a parameter)
- When the usage finished (UTC time)
- Resources consumed
I understand that these tools may in some cases generate log entries, and in most cases understand and analyze existing login entries. Can they manage a repository (files or database) that contains the above required and optional fields?
I'm wondering if we're dealing with a relatively simple and specific set of needs that would be best addressed with some simple custom software/scripts that we write?
Thanks,
JP
JP,
After talking with some others here, I think the baseline to meet or exceed is something along the lines of:
"Can we do better than a well organized directory structure using the appropriate tools that come with the associated data/log source ? " For example the linux log monitoring and watching packages already handle almost anything in a common syslog format. I'm going to meet with Kay in our security team and get a demo of Splunk this week and have a couple sentences and thoughts on it as well. Of the things reviewed so far, I think logalyze is probably the most full-featured and can ingest the greatest variety of information in various formats. Linux logwatch is also good for anything resembling syslog information. I'd advocate for implementing one or a combination of existing tools over writing new. The old admin in me really likes the linux packages because they will probably persist going forward even if they're not software du jour. logalyze and similar have some operational overhead [ java, 1 billion threads :) ] , are very programmable but probably require more development and layout to get what you want. I suspect you're going to end up with a mix of a couple different tools in the end.
Hey Galen,
It makes sense that we might need several tools to process different usage record formats, syslog, http logs, audit logs, etc. When we instrument a new tool to track usage, say an API, we need to decide what log format to use so that we don't have to introduce yet another log analysis tool. So, it seems as though we need to identify a preferred set of formats and the smallest possible set of tools that processes those formats, to minimize our learning curve and support costs.
I'll take a look at logalyze to see what it can do.
Thanks,
JP
I met with Kay Avila in our security group and she showed me a bit about Splunk. It's a very flexible business informatics tool, but it's commercial. There's a free version that will work with up to 500MB/day. Some of the standout features: a GUI query will also generate the unix-like search string for the same query . It will monitor files, streams, run your customized scripts on a system, and knows about a wide variety of data formats. It's also simple enough to define your own format for custom input. Anything in the data can become a field that Splunk can then associate with other data sources. For example you can define MYUSER and have it be the user name in 2 or 3 different data streams and formats , then do searches, queries and reports on MYUSER.
It's been the most impressive thing I've seen so far, like logalyze turbo-boosted to 1.5 bar.
It looks like we have access to training if you want a deeper dive : https://www.internet2.edu/blogs/detail/10079
Galen,
This sounds very good. When you're done evaluating the tools you should produce a formal and brief summary and ordered list of recommendations.
Thanks,
JP
The decision tree is yours to walk. I see it something like this in outline form with the major (left most) bullets being choices and the minor ones being sub choices within that choice. I'm still leaning toward logalyze after going back through my notes and looking at their site again. They're open and support a lot of data formats out of the box. If it were my call, I'd see what is a good fit for logwatch, and do some with that, saving the more complex reporting for logalyze. There's no penalty for doing some of it both ways. The >50 admin in me likes logwatch because it's more linux/unix. You're going to get prettier reports and more options though with logalyze.
If you wanted an interesting student project for an intern, I think you'd get motivated people on the machine learning/big data angle and they could get hands-on experience with a hot marketable skill. I attended the PSC Big Data workshop a few days ago so that made the final cut (Apache Spark ). Since the log data will persist, I would not consider any of these to be mutually exclusive. If you can get 90% of what you want with Logalyze and linux logwatch, that's a win. Hire a student or intern to get the Big Data experience and go after the query/report that's elusive.
Galen,
Thanks for looking at several tools and recommending both software options and potential student projects.
Based on your input, I'm inclined to start with logwatch and see how far that takes us.
JP
Also, keep asking these questions, and I'll keep responding with: You need a database. !
[ xdcdb seems like a great fit and collaboration here given they already collect job data ]
"What component and version was used
- When the usage happened/started (UTC time)
- Where usage happened (client and/or server)"
A database might be a great solution. XCI has the expertise to design, implement, and support databases. Whether to use XCDB or a different one should be decided based on multiple factors like who owns the information, the type of information, who is responsible for designing and supporting it, reliability and scalability requirements, etc.
Are you recommending a database? Can the usage analysis tools you are exploring work with database information?
JP