What should XSEDE require/recommend SPs automatically publish?

15 posts / 0 new
Last post
What should XSEDE require/recommend SPs automatically publish?

XSEDE just released a new Information Publishing Framework (IPF) V1.4 used by SPs to publish:

  1. software module information (both SP and CSA)
  2. network service information (GridFTP, SSH/login)
  3. batch queue configuration, contents, and utilization
  4. batch job events

IPF is one of the few tools XSEDE requires that all Level 1 and 2 service providers install. This new release expands XSEDE’s recommendation to Level 3 (un-allocated) campus resources. Users view IPF published information in the XUP. Science gateways, monitoring systems, and other XSEDE tools access and use this information thru APIs.

SPs can configure IPF to publish any combination of the above 4 types of information. The goal of this e-mail is to gain internal XSEDE consensus on what we want to require/recommend SPs publish using IPF. A preliminary recommendation is:

Software module information

  • Required for Level 1, 2, and 3 SPs that offer login, command line accessible software, and batch computing

Network services information

  • Required for Level 1, 2, and 3 SPs that offer XSEDE GridFTP or login services

Batch configuration information

  • Required for Level 1, 2, and 3 SPs that offer batch computing

Batch job events

  • Required for Level 1 and 2 SPs that offer allocated batch computing
  • Optional for all other SPs that offer unallocated batch computing

NOTE: if we can get allocated SP resources to publish job events, science gateways will have a reliable way to monitor allocated job status

After internal XSEDE consensus we should present this to the SP Forum and SP Admins for input before it becomes an official part of Victor's SP integration process.

Delivery Effort Stage: 

I don't have strong feelings about this, but I would feel better if all of this were optional for Level 3 resources/SPs. I think we should keep the L3 bar low. For someone to "upgrade" to L2 status, I can see requiring the software module and network services info. (I don't know what's in the batch config info, so not sure if that should be req'd or optional.)

IPF has been one of those low bar items since the start of XSEDE (along with RDR) because the combination of both enables XSEDE users and staff to discover basic information about XSEDE integrated resources. RDR and IPF play a similar role, with RDR focusing on information entered by hand by the SP and IPF focusing on information that can be gathered and refreshed automatically by a package (IPF) installed by the SP.

While IPF is currently required for all SPs, till now we had not been clear about which of the four types of information IPF can publish about are required, recommended, or optional for different SP levels. We need to be more precise about that, especially as we have more diverse resources where some resources don't have specific types of information (JetStream doesn't have a batch queue, and command line software).

The batch information that is published includes what are the names and resource constrains of all the batch queues, what jobs are currently in the batch queues, and how busy a resource currently is. The XUP resource monitor is based on this IPF batch config/status information.

I would recommend we don't lower this low bar further, and that we treat RDR and IPF similar from a requirements perspective for L3 SPs.

We can use UT NICS as a pilot.  Tabitha already has this new version of IPF installed.  I'll check and see if Tabitha ended up publising the 3 required items.  

Since there are no XSEDE users on Level 3 SPs, I would suggest removing the "optional" status for the Batch job events.  I think having local usernames for SPs that could be viewed somehow in XSEDE portal or APIs is not a good idea from a security standpoint even if it is only viewable by users who login or authenticate.  This is something that should be discussed further probably with SP Forum and SP Software groups.

Maybe there is something new for this that I am not familiar with, but is the list of GridFTP endpoint servers (input from "network" services information concerning GridFTP) available at info.xsede.org?  I see them in the Services API link but you have to click on the view of each one to get at the URL info.  Is there a page for the URL info for GridFTP or are we getting away from that with GCS?

Yes, we should discuss this further with the SP Forum and SP Software groups.

Whether Batch job events remain "optional" or we downgrade it to something like "available", I think it would be good to let L3 SPs know that if they want to they can publish job events. If they choose to publish job events they will gain features that they might really like, like the ability to build local portals and gateways that access XSEDE APIs for the latest job status information, or the ability to subscribe to job local events.

If local account security is an issue we could build better security mechanisms around the information giving the SPs more control over who can view their job information, including the ability to control whether their job information is visible to XSEDE services like the XUP. Maybe there are parallels with how XRAS works in the sense that an SP can choose to use XRAS for local allocations and maintain full control/confidentiality of their allocations without giving XSEDE users access to their allocation details.

Re: Maybe there is something new for this that I am not familiar with, but is the list of GridFTP endpoint servers (input from "network" services information concerning GridFTP) available at info.xsede.org?

We have these 2 machine API interfaces:

This XUP table is the user interface for information from info.xsede.org:

JP

If NICS has or puts in Software info is it going to show up in the XSEDE Software search?  I don't see the ACF resource in the Software search currently.  I haven't checked with Tabitha if she has published that yet, but I will.

Is there going to be some different way to view allocated SP software versus unallocated SP software?  A checkbox or something on the software search page?

Looks like Tabitha is publishing about 266 software packages on acf.utk.edu:

Hi JP,

Apologies in advance if any of my comments or questions are off the mark. This is really out of my domain.

Like Dave, I don't have strong feeling about this topic but do agree that "Optional for all other SPs that offer unallocated batch computing" isn't needed. Not sure what the benefit is of even making this an option for resources that aren't allocated through XSEDE. That said, this all sounds like a good starting point.

OK, I have a few questions:

(1) Can you clarify what you mean by Batch Job Events? For example, is this job submissions, starts, completions and deletions?

(2) You wrote "if we can get allocated SP resources to publish job events, science gateways will have a reliable way to monitor allocated job status". Can I interpret this to mean that it will solve or help to solve the problem of tracking the gateway attributes that Amit Chourasia is working on?

 

Hey Robert,

Not to worry. We appreciate you looking at this and taking the time to comment. Input is good and reviewing previous decisions is also good since it helps us confirm whether previous decisions are still reasonable.

Since the start of the XSEDE 2 program the thinking has been that in order to integrate an XSEDE 2 resource at any level, the minimum requirement is that the SP needs to describe the resource that has been integrated so that users and staff can discover those resources. What sense does it make for an SP to integrate with XSEDE and not provide us information about the resource they've integrated? At the lowest integration level, what we call Level 3, the resource doesn't have to be allocated, and they don't need to install any specific user facing XSEDE software or service, they just need to describe the resource.

XSEDE 2 uses two complementing tools to describe resources:

  1. RDR used for SPs to manually enter resource descriptive information
  2. IPF used for SPs to automatically publish resource (software, batch queues, etc.) information

So, we have expected SPs to at a minimum enter information in RDR and install IPF to publish resource information. Information from both sources gets merged in information services which is accessible thru APIs by the XUP and other XSEDE tools.

(1) Can you clarify what you mean by Batch Job Events? For example, is this job submissions, starts, completions and deletions?

Yes, job events are job state transitions, such as the ones you mention.

(2) You wrote "if we can get allocated SP resources to publish job events, science gateways will have a reliable way to monitor allocated job status". Can I interpret this to mean that it will solve or help to solve the problem of tracking the gateway attributes that Amit Chourasia is working on?

No, these things are unrelated. What Amit is working on is the mechanisms that science gateways use to record in XSEDE's accounting system which gateway user a community account job is for. In effect the gateway is adding another attribute about a job to XSEDE's accounting job record. This happens once per job immediately after the job is submitted.

What we are doing with IPF is capturing job events (or job state changes) record by the scheduler (SLURM, PBS, etc.) and publishing them as they happen to XSEDE's publish/subscribe service. This makes it possible for anyone who is authorized to subscribe and reliably receive these state changes and in effect monitor job status. This is a much more efficient and scalable method to monitor job status, versus for example every gateway, the XUP, meta-schedulers, workflow engines, etc having to independently query job status on a resources. We are effectively mirroring distributed job status in XSEDE's central information services (info.xsede.org) and offering two methods to receive job status information: 1) subscribe to job state changes (events) with sub-second latency, or 2) RESTful API query of job status.

JP

Thanks JP, thanks for the detailed reply. I'm already familiar with the RDR (I've updated the RDR for Comet and ECSS), but am quite new to IPF. And your response to my second question makes perfect sense. I'm sure some of the more heavily used gateways (e.g. Cipres, I-TASSER) will find this useful.

Victor, Maytal, Marlon, Shava, Dave, John, and Robert,

Thanks Dave, Robert, and Victor for your feedback and questions on the forum.

We need internal consensus to present to the SP Forum and SP Administrators as an XSEDE proposal. With SP input and support Victor will then be able to update XSEDE resource integration documentation with requirements/recommendations for current and future SPs.

The main questions raised by Dave, Robert, and Victor were:

  1. Make everything optional for L3 SPs. SPs don’t have to describe their resources and XSEDE has no formal details on L3 integrated resources and can’t show resource details for L3 SPs. This will save Level 3 SPs 1-2 days of integration effort entering information into RDR and deploying IPF.
  2. Remove “optional” for Level 3 SPs to publish Batch Job Events. This essentially tells them that we are not offering this service even if the SP would like to use it.

Answering a Robert question: Batch Job Event publishing provides Level 3 SPs access to XSEDE’s job status interfaces. These interfaces could be used by local SP portals and software to access local job status informaton. With appropriate security mechanisms, we could limit XSEDE’s access to this information.

Two use cases that motivated making this capability OPTIONAL:

  • A science gateways that accesses both XSEDE and non-XSEDE resources would like a common interface to access all job status information). If the non-XSEDE resources joins XSEDE as a Level 3 resource (remaining unallocated by XSEDE) it can then CHOOSE to publish job status information to XSEDE giving science gateways a single job status interface.
  • A campus wants a reliable and efficient interface to lookup job status/state information without giving XSEDE or others access to their job information. If the SP joins XSEDE as a Level 3 resource (remaining unallocated by XSEDE) it can then CHOOSE to publish job status information to XSEDE for local access.

To decide on the above two question would each of you please reply with a vote to:

Question 1) Should XSEDE require Level 3 SPs to describe the resources they are integrating in RDR and IPF?

Question 2) Should XSEDE allow Level 3 SPs that want to publish Batch Job Events to XSEDE to do so?

Thanks,

JP

My votes: (1) Yes — A Service Provider should be required to identify the existence of at least one service that they are providing to the ecosystem writ large (even if it's to a closed subset of local users). If an organization wants to be a forum member without showing that they do in fact provide at least one service, then perhaps we need something like an "SP Forum Affiliate Member" status.

(2) Yes — as long as this won't impact the operations of the IPF and/or Batch Job Event publishing service for XSEDE-allocated resources.

I agree with Dave.

All,

I will ask for time on the September SP Forum and SP Software/Administrators meetings to present our recommendations to these groups. In light of the quarterly meeting discussion about a possible review and refresh of the Level 1-3 SP categories, I think our recommendation should describe our requirements and recommendations for describing SP resources in only two categories: allocated and unallocated.

If anyone has strong opinions about the above questions please vote by early next week.

I plan to share the draft slides with you for feedback so that we can list all our names as contributors to the recommendation.

Thanks,

JP

Everyone,

Draft slides to be presented to the SP Forum and SP Admins/Software meetings are here.

Comments welcome.

Thanks,

JP

Log in to post comments