⚠ CMS SST — Daily Site Report

Generated:  |  Source: siteStatus/summary.html + GGUS
Tier:
Shown: 50 🔴 With errors: 50 Total selected sites: 90

Metric columns = last 3 days (oldest → most recent).
Click a cell for the SAM/HC/FTS log. Click a site name for the full readiness report. ↻ refreshes data locally or opens GitHub Actions on the public page.

Tier-0
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%100%99%100%97%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

Tier-1
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #1002256 (id:1002256) Failed FTS transfers from NCBJ to GRIDKA
State: in progress  |  Priority: very urgent  |  Opened: 2026-04-02 13:18 (2d ago)  |  Updated: 2026-04-02 13:56
Conversation (2 messages)
NCBJ recently deployed IPv6-only pool nodes at their SE. After the deployment we have started to observe failed FTS transfers from NCBJ to GRIDKA.
In the logs we see the following error:
failure: Connect to sgate025.cis.gov.pl:23454 [sgate025.cis.gov.pl/2001:67c:2870:1:0:0:2001:25] failed: Connect timed out; redirects [https://sgate025.cis.gov.pl:23454/grid/lhcb/LHCb/Collision26/SL.DST/00369465/0009/00369465_00098423_1.sl.dst?<redacted>]
INFO Thu, 02 Apr 2026 14:15:36 +0200; Gfal2: Copy failed with mode 3rd pull: Transfer failure: Connect to sgate025.cis.gov.pl:23454 [sgate025.cis.gov.pl/2001:67c:2870:1:0:0:2001:25] failed: Connect timed out; redirects [https://sgate025.cis.gov.pl:23454/grid/lhcb/LHCb/Collision26/SL.DST/00369465/0009/00369465_00098423_1.sl.dst?<redacted>]
INFO Thu, 02 Apr 2026 14:15:36 +0200; Gfal2: Gfal http copy clean-up disabled

Current status of active transfers (both tapes and disk) are available here:
https://fts3-lhcb.cern.ch:8449/fts3/ftsmon/#/jobs?source_se=https:%2F%2Fse.cis.gov.pl&dest_se=https:%2F%2Flhcbdcache-kit-tape.gridka.de&vo=lhcb&with_file=FAILED&time_window=1

https://fts3-lhcb.cern.ch:8449/fts3/ftsmon/#/jobs?source_se=https:%2F%2Fse.cis.gov.pl&dest_se=https:%2F%2Flhcbdcache-kit.gridka.de&vo=lhcb&with_file=FAILED&time_window=1

Could you please check?

Thanks,
Henryk
Hi Henryk,

I can confirm that our servers are unable to reach NCBJ via tracepath. Even from lxplus.cern.ch I do not get through. Nevertheless, I've informed our network experts. Sadly, considering the current time of day and the imminent Easter holiday period, I have to dampen your expectations for a quick resolution.

Kind regards,
Xavier.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%98%100%100%99%100%100%97%93%100%96%100%100%100%95%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (3)

WLCG tickets (3)
WLCG #1001954 (id:1001954) Testing new tape library at PIC (LHCb)
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-03 10:19 (32d ago)  |  Updated: 2026-04-03 00:00
Conversation (18 messages)
Dear site admins,
This ticket is to coordinate new tape library tests @PIC.
As i understand (correct me if i am wrong), this week LHCb is expected to write some data to ` https://dcdoor01-test.pic.es:8446/pnfs/pic.es/data/lhcb/CTA `.
A few questions:
1. Would that be OK to copy the data from production PIC SE webdav-lhcbt1.pic.es (via https)?
2. How much data should we copy?
3. When should we start?
Best Regards,
Alex
Hi Alex,

Sorry for the delay. We were doing some late tests and the system was not ready till now.
Answering to your questions:
1. Yes, Production is perfect as source, and the endpoint provided is https.
2. There is a buffer of 127TB to write data in front of the tape, but you could copy more data.
3. Now the system is ready to accept data.

Best Regards,
Elena
Hi
Elena,
Thanks! I've just (re)started the tests, transfers should start soonish.
Best Regards,
Alex
Hi
Elena,
The transfers started, but they are all failing with, e.g. [1]:
```
INFO Wed, 04 Mar 2026 11:29:18 +0100; Davix: PerformanceMarker:
failure: Failed to select pool: diskCacheV111.services.space.SpaceException: Access latency conflicts with access latency defined by default space reservation.

```
Could you please have a look?
Best Regards,
Alex
[1] https://fts-lhcb-005.cern.ch:8449/var/log/fts3/transfers/2026-03-04/webdav-lhcbt1.pic.es__dcdoor01-test.pic.es/2026-03-04-1029__webdav-lhcbt1.pic.es__dcdoor01-test.pic.es__253686662__0c5ae3ba-17b4-11f1-a9e4-e24dd15c0b06
Hi Alex,

There was defined a write token, inherited from production, when we recreated the path. Now we've removed it and the transfers should work.

Sorry for the inconvenience.
Cheers,
Elena
Hi
Elena,
Thanks! After the fix there were some successful uploads to the test SE. However, all of them failed during archival stage (e.g. when the file has already been transferred and FTS is just polling it until it is copied to tapes) due to issues with tape rest API query. E.g. from [1] we got
``` ARCHIVING [5] [Tape REST API] Failed to query /.well-known/wlcg-tape-rest-api: curl error (18): Transferred a partial file
```
Could you please have a look?
Best Regards,
Alex
[1] https://fts3-lhcb.cern.ch:8449/fts3/ftsmon/#/job/6792dd34-187f-11f1-93cb-fa163e8e0048
Hi Alex,

The json file which provides the information to the query of well-know/wlcg-tape-rest-api was not defined correctly an that's why you received this error.

I've already created it and checked that the https://dcdoor01-test.pic.es:8446/.well-known/wlcg-tape-rest-api is responding correctly the endpoint to the Tape-rest-api

Sorry for the delay in correcting it.
Cheers
Elena
Las week tests were stopped to drain the requests and allow the drives firmware update needed.

@Alex: now the firmware is updaded and we can restart the tests, than you!
Hi,
Thanks for the info!
I've just restarted the tests, transfers should appear soon.
Best Regards,
Alex
Hi,
Most of the transfers are failing. There seems to be a problem with PIC's production SE, e.g.
bash-5.1$ gfal-copy https://webdav-lhcbt1.pic.es:8446/pnfs/pic.es/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision25/RDLOW.DST/00294787/0002/00294787_00026628_1.rdlow.dst .
Copying https://webdav-lhcbt1.pic.es:8446/pnfs/pic.es/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision25/RDLOW.DST/00294787/0002/00294787_00026628_1.rdlow.dst [FAILED] after 0s
gfal-copy error: 5 (Input/output error) - Result HTTP 500 : Unexpected server error: 500 , while readding after 1 attempts
bash-5.1$Could you please have a look?
Best Regards,
Alex
Writing tests have concluded with some discoveries, thank you all.
We have been requested by the storage team to start the reading test phase now, discussing the details
Hi Alvaro,
During the write test we wrote ~8500 files to your test storage (list attached). These files can be used for staging test, i think. However, as far as i can see, they are still present in the buffer:
bash-5.1$ shuf -n 100 pic_urls.txt | while read line; do gfal-xattr "$line" user.status ; done
ONLINE_AND_NEARLINE
ONLINE_AND_NEARLINE
ONLINE_AND_NEARLINE
ONLINE_AND_NEARLINE
ONLINE_AND_NEARLINE
ONLINE_AND_NEARLINE
ONLINE_AND_NEARLINE
[..]
ONLINE_AND_NEARLINE

Could you please evict them from the buffer? Then i can start the staging test.
Best Regards,
Alex
Hi Alex,

We've already removed all cached files from disk.
You could start the staging test whenever you want.

Best Regards,
Elena
Hi Elena,
Thanks! I've just submitted the first staging request for 10 files:
bash-5.1$ curl --cert "$X509_USER_PROXY" --key "$X509_USER_PROXY" --capath /etc/grid-security/certificates "https://dcdoor01-test.pic.es:8476/api/v1/tape/stage/96d0f5d2-586a-47d5-a8d0-8b5e1c6c2b20"
{
"id" : "96d0f5d2-586a-47d5-a8d0-8b5e1c6c2b20",
"createdAt" : 1774285907974,
"startedAt" : 1774285907995,
"files" : [ {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012911_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012912_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012913_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012914_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012915_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012916_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012917_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012918_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012919_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
}, {
"path" : "/pnfs/pic.es/data/lhcb/CTA/lhcb/LHCb/Collision25/TURCAL_RAWBANKS.DST/00293651/0001/00293651_00012923_1.turcal_rawbanks.dst",
"startedAt" : 1774285907984,
"state" : "STARTED"
} ]
}bash-5.1$Let's see how it works!
Best Regards,
Alex
Hi Alex,

It worked perfectly. Few files were not restored till this morning because de tapes where the files are hosted were DISABLED due to a previous error, but once the tapes were ONLINE again, the stages ended rapidly.
Could you send more requests to stress the environment?
Thanks!
Elena
Hi Elena,
Recall requests for all test files have been submitted, see here:
https://fts3-lhcb.cern.ch:8449/fts3/ftsmon/#/?vo=lhcb&source_se=https:%2F%2Fdcdoor01-test.pic.es&dest_se=https:%2F%2Fwebdav.echo.stfc.ac.uk&time_window=1
The destination is RAL since, as we discussed with Alvaro, PIC disk SE is not very healthy (but that is not important i guess, since we are mostly interested in the staging part).
Best Regards,
Alex
We have finished the reading tests correctly. We will soon start mixed workload tests.
Hi,
FYI: we started combined test this week. The reading part is still ongoing, and the writing part has already finished. During the writing part there were many archival timeouts, e.g. [1].
Best Regards,
Alex
[1] https://fts3-lhcb.cern.ch:8449/fts3/ftsmon/#/job/6c9eeefc-2d2a-11f1-af16-fa163e8e0048
WLCG #1002236 (id:1002236) PIC RSE basepath prefix
State: in progress  |  Priority: less urgent  |  Opened: 2026-04-01 11:07 (3d ago)  |  Updated: 2026-04-01 14:23
Conversation (2 messages)
Your site ATLAS config is currently using different basepath prefix per protocol.

webdav: /
xroot: /pnfs/pic.es/data/atlas

1. what is the motivation for this difference?
2. would it be possible to set the same path for both protocols?

Thanks!
Hello,
I am the person on duty. We will discuss and I will make sure I give you the right answer, but I think this has been like this historically.

Let us get back to you, probably after easter, with the information you request.

Jordi
WLCG #1002182 (id:1002182) Disk and Tape resources at PIC
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-24 12:21 (11d ago)  |  Updated: 2026-03-25 07:52
Conversation (3 messages)
Hello,

Now that the 2026 data taking has started, we are reviewing the state of the storage
pledges on the LHCb T1 and T2 sites.

Earlier this year, we were informed that (with a few exceptions) the sites would have
the pledged disk and tape resources available to the experiment, and we hope you
can confirm that.

1 - Status of disk pledges

PIC pledges 5855 TB of disk space for 2026, which includes a recent addition of 500 TB.

However, the Storage Resource Record (SRR) advertises 5393.6 TB [1].

Do you know what explains that discrepancy?

2 - Status of tape pledges

The site pledges 11685 TB of tape space for 2026. Are these resources available
to LHCb already?

Thanks in advance for your efforts to make these crucial resources available
to us!

best regards, Jan van Eldik / LHCb Compute project lead
[1]
+-------+--------------+-----------+-----------+----------+
| Site | Share | Size | Used | Fraction |
+-------+--------------+-----------+-----------+----------+
| PIC | LHCb_BUFFER | 113.6 | 113.6 | 100.0% |
| PIC | LHCb-Disk | 4900.0 | 4274.1 | 87.2% |
| PIC | LHCb-Tape | 300.0 | 130.7 | 43.6% |
| PIC | LHCb_USER | 80.0 | 62.1 | 77.7% |
+-------+--------------+-----------+-----------+----------+
Hi Jan,

The discrepancy in the disk numbers is due to the fact that the last increment has not yet been assigned.
We'll do it as soon as possible.
Regarding the tape assignment, we are running a bit behind schedule because the installation of the new library was only recently completed, and we are currently conducting validation tests.
Once these tests are finished, we plan to migrate to CTA first (from Enstore), and then we will proceed with the assignments in the new robot.

Don't hesitate to contact us for any question.

Regards,
Elena
Thanks for the quick reply Elena, a few quick questions:
> The discrepancy in the disk numbers is due to the fact that the last increment has not yet been assigned.> We'll do it as soon as possible.
Thanks. Would that be a matter of days? Or of weeks?
Similar questions for tape:
> we plan to migrate to CTA first (from Enstore), and then we will proceed with the assignments in the new robot.
Can you give us an idea of the timescale? Also, will we be able to access the existing tape resources (ie. the 2025 pledge of 8766 TB, or maybe a bit more?) until the migration happens?

Thanks for the information, this will help us to plan the data distribution of the fresh 2026 data.

best, Jan
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%56%66%100%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

WLCG tickets (2)
WLCG #1002240 (id:1002240) IN2P3-CC RSE basepath prefix
State: in progress  |  Priority: less urgent  |  Opened: 2026-04-01 11:57 (3d ago)  |  Updated: 2026-04-01 12:13
Conversation (2 messages)
Your site ATLAS config is currently using different basepath prefix per protocol.

webdav: /
xroot: /pnfs/in2p3.fr/data/atlas

1. what is the motivation for this difference?
2. would it be possible to set the same path for both protocols?

Thanks!
Hi Peter,

I guess it is for "historical" reasons but it is totally possible to configure the XRootd endpoint as Webdav in order to have only "/" as basepath prefix.
But If I change the Xrootd doors configuration to xrootd.root=/pnfs/in2p3.fr/data/atlas, it will break all current transfers/connections to the door.
So I don't know how we can proceed without a site DT to avoid multiple failures.

Cheers,
Adrien
WLCG #1002126 (id:1002126) Upgrade your HTCondorCE endpoints to 24.0.x series (IN2P3-CC)
State: in progress  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 15:43
Conversation (5 messages)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
Hi Alessandro,
We already have the version 24:
htcondor-ce-24.0.2-1.el9.noarch
condor-24.0.11-1.el9.x86_64
Where did you find we have version 23?
Thank you,
Vanessa
Hi Vanessa,
thanks for the reply. When I looked for the version you run, I wasn't able to find it because the endpoint is not published in the BDII, so I had to open the ticket anyway. If you manage to configure the infoprovider and publish the CE in the bdii, it would be helpful.

Cheers,
Alessandro
Hi Vanessa,
actually I found the following CE with an unsupported version:

- cccondorcm01.in2p3.fr: 9.0.20
- cccondorcm02.in2p3.fr: 23.10.1

They are registered in GOCDB and the version information is published in the BDII.

Cheers,
Alessandro
Hi Alessandro,

Those are Central Managers (cm), and we used to publish long time ago in the CM (only in one point) looks like our BDIIs are not in good shape. I will check.

Thank you,
Vanessa
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%99%100%97%100%100%98%100%100%100%100%
HammerCloud99%100%100%100%100%100%100%100%100%100%100%100%100%99%99%99%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (7)

CMS tickets (1)
CMS #1000981 (id:1000981) Locality UNAVAILABLE CNAF Tape
State: on hold  |  Priority: less urgent  |  Opened: 2025-10-27 15:05 (159d ago)  |  Updated: 2025-11-19 07:54
Conversation (7 messages)
We noticed that's an increasing issue previous days to transfers targeting to CNAF tape.

Files seems to making it into the buffer, although FTS is not able to get the locality. As a result, locality is marked as unavailable and FTS fails the transfer with the following message:
```
ARCHIVING [11] [Tape REST API] File locality reported as UNAVAILABLE
```

Here is an example: https://fts3-cms.cern.ch:8449/fts3/ftsmon/#/job/a5c49af4-b333-11f0-8322-fa163e1db714?file=5220387371

FTS logs here: https://fts-cms-009.cern.ch:8449/var/log/fts3/transfers/2025-10-27/eoscms.cern.ch__xfer-tape-cms.cr.cnaf.infn.it/2025-10-27-1255__eoscms.cern.ch__xfer-tape-cms.cr.cnaf.infn.it__5220387371__a5c49af4-b333-11f0-8322-fa163e1db714

Trying manually, I was able to retrieve the locality (using both my personal and rucio certificate):
```
$ gfal-xattr davs://xfer-tape-cms.cr.cnaf.infn.it:8443/cmstape/store/data/Run2025G/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/398/419/00000/83b8a4f5-9e01-4c55-b184-1854b45f87d8.root user.status
ONLINE
```

Could you have a look in case there is an issue on your side?
Thanks,
Christos for DM
Hi Christos,
regarding this particular example you mentioned, we believe that what happens is the following:
one of our StoRM WebDAV servers successfully handles
the TPC from EOS at 13:55:32:

2025-10-27T13:55:32+01:00
[f9f4a29e5958aff86a1e3d512ddcca1f;job-id=a5c49af4-b333-11f0-8322-fa163e1db714;file-id=5220387371;retry=0]
job-id:a5c49af4-b333-11f0-8322-fa163e1db714,file-id:5220387371,retry:0
INFO 2023053 --- [jetty-5207954] o.i.storm.webdav.tpc.TransferFilter :
Pull third-party transfer completed: DONE. Source:
https://eoscms-ns-ip700.cern.ch:443/eos/cms/tier0/store/data/Run2025G/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/398/419/00000/83b8a4f5-9e01-4c55-b184-1854b45f87d8.root,
Destination:
/cmstape/store/data/Run2025G/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/398/419/00000/83b8a4f5-9e01-4c55-b184-1854b45f87d8.root,
Bytes transferred: 2383492487, Duration (msec): 106597, Throughput: 21
MB/sec, id:
f9f4a29e5958aff86a1e3d512ddcca1f;job-id=a5c49af4-b333-11f0-8322-fa163e1db714;file-id=5220387371;retry=0
The TPC is successful, which is something you can see also in the FTS log: checksum is correct.
9 seconds later, FTS queries file locality, and such request goes to the StoRM Tape REST API endpoint, which is a different one wrt to the StoRM WebDAV server above. Apparently, the information on the attributes of that file has not been propagated to all the filesystem nodes yet, and so we UNAVAILABLE as locality, which corresponds, in the log file of the service to the following line:
Oct
27 13:55:41 tape-cms storm-tape[230526]: (2025-10-27 12:55:41) [ERROR ]
The file
/storage/gpfs_tsm_cms/cms/store/data/Run2025G/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/398/419/00000/83b8a4f5-9e01-4c55-b184-1854b45f87d8.root
appears lost, check stubbification and presence of user.storm.migrated
xattr

When you repeat the request manually, a few seconds after, the locality is correctly reported.

Don't you think it would be best to adjust the workflow client-side in order to avoid such false problems?
It doesn't seem a great idea to decleare the TPC as failed only based on file locality checked by a different process a few seconds after the file is correctly transferred.
Let us know what you think about it,
lucia
Hi Lucia,

Thanks for the explanation. That was really helpful.

After looking into several cases from the Rucio side, I noticed that transfers were marked as completed after the first attempt [1]. It seems that the delay between StoRM and the Tape API isn’t constant, so this can likely be considered a non-critical issue.

However, this state still adds unnecessary load to the site and FTS due to redundant retries. In some cases, this could cause problems. As far as I know, if a file doesn’t make it to tape before Rucio retries, the file will be deleted from the buffer and FTS will restart the process from scratch. If the delay still persist, this could result in an increasing number of unnecessary retries and potentially reduce overall transfer efficiency.

Could we check whether this issue appears when the site is under heavy load? I can also try reducing the number of concurrent transfers that FTS submits to the site. In any case, I’ll contact the FTS team. Perhaps accepting UNAVAILABLE during archive polling could help in such cases.

Thanks,
Christos

[1] https://monit-opensearch.cern.ch/dashboards/goto/76fa1bfcade6ffc4ba0c57edf1bdec46
Hi Christos,
thanks for the investigation.
We attach a plot of traffic OUT of the servers handling CMS filesystem, and you can see that on Oct 27th around 1-2pm, there actually was a spike of traffic.
However, nothing particularly worrying.
Given you'll contact the FTS team, we stress again that it would be best to avoid such unnecessary retries, and also avoid locality checking to assess the status of a TPC.
Thanks for understanding,
Lucia
Just placing the relevant JIRA ticket:
https://its.cern.ch/jira/browse/CMSDM-362
Following FTS response; We should investigate the reason UNAVAILABLE is reported.

As FTS reports, this is a state that should be reported in case of issues:
https://its.cern.ch/jira/browse/CMSDM-362

I also managed to find a reference document that clearly describes the behavior (pages: 23 - 25):
https://cernbox.cern.ch/pdf-viewer/public/vLhBpHDdaXJSqwW/WLCG%20Tape%20REST%20API%20reference%20document.pdf

From what I understand, the delay of propagating information to the nodes can not clearly explain the UNAVAILABLE locality.
From the point of view of the conformance to the Tape REST API, StoRM Tape should respond LOST, because at the moment of the request the file appears as a stub, meaning that the disk file has no contents, and it doesn't have the extended attribute that tells that it is on tape. This is what is reported in the log pasted above by Lucia. In practice, however, it returns UNAVAILABLE, not to scare the client too much :-) I don't see other options here and in fact UNAVAILABLE seems the perfect value for a transitory situation, which, for a distributed filesystem like GPFS, can certainly happen, since information needs some time to propagate, especially if the load is high.
WLCG tickets (6)
WLCG #1002202 (id:1002202) Token authentication fails with xrootd on xrootd-archive.cr.cnaf.infn.it
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-26 16:03 (9d ago)  |  Updated: 2026-04-03 15:09
Conversation (13 messages)
For the last 2 weeks we have seen xrootd token access fail on xrootd-archive.cr.cnaf.infn.it for DUNE.the root port advertises the ZTN protocol as it should, but we get error:
xrdcp root://xrootd-archive.cr.cnaf.infn.it:1096//dune/testpro/bb/7f/awt-download-2023-03-07-01.txt ./infnroot2.txt

sec_PM: Loaded ztn protocol object from libXrdSecztn.so

sec_PM: Using ztn protocol, args='0:4096:'
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Server responded with an error: [3010] Unable to open /dune/testpro/bb/7f/awt-download-2023-03-07-01.txt; permission denied (source)

It is the same error with root:// and roots:// protocol. davs:// protocol is not affected.

Please investigate. This used to work up until a couple weeks ago. Above test was made Thursday. 26 Mar at 11:01 US/Central time = 17:01 Central Europe time
{
"sub": "dunepro@fnal.gov",
"iss": "https://cilogon.org/dune",
"wlcg.ver": "1.0",
"aud": "https://wlcg.cern.ch/jwt/v1/any",
"acr": "https://refeds.org/profile/sfa",
"nbf": 1774531982,
"auth_time": 1774531981,
"scope": "storage.read:/resilient/jobsub_stage storage.create:/ compute.cancel compute.create compute.read storage.read:/ storage.create:/resilient/jobsub_stage storage.create:/persistent/jobsub/jobs storage.stage:/ compute.modify",
"exp": 1774542787,
"iat": 1774531987,
"wlcg.groups": [
"/dune/production",
"/dune"
],
"jti": "<redacted>"
}
Actually my earlier statement was not correct.. the davs:// protocol is not working either

davs://xfer-archive.cr.cnaf.infn.it:8443/dune-grid/dune/testpro/bb/7f/awt-download-2023-03-07-01.txt

gfal-copy error: 13 (Permission denied) - Could not stat the source: Result (Neon): SSL handshake failed: Connection timed out during SSL handshake after 1 attempts
Hi,
is it possibile that the DN of the FNAL voms servers changed without us being notified about it?
We currently have the following LSC files for DUNE vo:
# cat voms1.fnal.gov.lsc
/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=voms1.fnal.gov
/C=US/O=Internet2/CN=InCommon RSA IGTF Server CA 3

# cat voms2.fnal.gov.lsc
/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=voms2.fnal.gov
/C=US/O=Internet2/CN=InCommon RSA IGTF Server CA 3

Could you please check?
Thanks,
lucia
That is the right DN for the VOMS servers.
X.509 proxy authentication is fine.
It is only token authentication that changed.
There was an update of the scitokens lib that went out in EPEL the week before last, maybe that changed something? (upgrading to 1.4).
Steve Timm

From: helpdesk@ggus.org <helpdesk@ggus.org>
Sent: Friday, March 27, 2026 3:57 AM
To: Steven Timm <timm@fnal.gov>
Subject: [GGUS-Ticket-ID: #1002202] "ASSIGNED" "NGI_IT" "Token authentication fails with xrootd on xrootd-archive.cr.cnaf.infn.it" (Updated)

[EXTERNAL] – This message is from an external sender

GGUS Helpdesk Notification
Ticket #1002202 "Token authentication fails with xrootd on xrootd-archive.cr.cnaf.infn.it"
was updated by Lucia Morganti on 2026-03-27 08:57 (UTC).

Hi,
is it possibile that the DN of the FNAL voms servers changed without us being notified about it?
We currently have the following LSC files for DUNE vo:

# cat voms1.fnal.gov.lsc
/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=voms1.fnal.gov
/C=US/O=Internet2/CN=InCommon RSA IGTF Server CA 3

# cat voms2.fnal.gov.lsc
/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=voms2.fnal.gov
/C=US/O=Internet2/CN=InCommon RSA IGTF Server CA 3

Could you please check?
Thanks,
lucia

Ticket is assigned to NGIs › NGI_IT.

Affected VO is dune.

https://helpdesk.ggus.eu/#ticket/zoom/1002202

You are receiving this because you were subscribed via Mailing List (Notified mode on every change) in this ticket. | Manage your notification settings | EGI/WLCG
Ok, thanks.
If this is the right DN for the VOMS servers, then please the VO card should be updated accordingly:
https://operations-portal.egi.eu/vo/view/voname/dune#VOV_section
The fact that you find problems with both protocols would suggest the tokens have something wrong.
We did not change anything in our servers lately, both xfer-archive.cr.cnaf.infn.it and xrootd-archive.cr.cnaf.infn.it.
In xrootd-archive.cr.cnaf.infn.it we have the following scitoken rpms:
xrootd-scitokens-5.7.3-1.el9.x86_64
scitokens-cpp-1.3.0-1.el9.x86_64
In
xfer-archive.cr.cnaf.infn.it the following validator is used by StoRM
WebDAV:
https://github.com/italiangrid/storm-webdav/blob/develop/src/main/java/org/italiangrid/storm/webdav/oauth/validator/WlcgProfileValidator.java
Hope this helps,
lucia
IT is not straightforward for me to run the above code but I looked at it by examination and my tokens have all the fields that are being requested and they are of legal ranges.I'll add some debugging to the client and see if gives a clue.
Hi Steven,
sorry for the late reply.

I would like to start with the WebDAV endpoint.
Could you please try to run the following commands and share the output with us?

If you use oidc-agent to store the token:
$ export BEARER_TOKEN=$(oidc-token <dune>)
$ voms-proxy-destroy
$ unset X509_USER_PROXY
$ gfal-ls -vv davs://xfer-archive.cr.cnaf.infn.it:8443/dune-grid
$ rpm -qa | grep ca-policy
$ rpm -qa | grep gfal

Thanks,
Andrea
OK here you go
As you can see from the output, this test was actually successful.

------------------------------------------------------------------------------
<> htgettoken -i dune -r production -a htvaultprod.fnal.gov
Attempting to get token from https://htvaultprod.fnal.gov:8200 ... succeeded
Storing bearer token in /run/user/2904/bt_u2904
<> export BEARER_TOKEN=`cat /run/user/2904/bt_u2904`
<> voms-proxy-destroy

Proxy file doesn't exist or has bad permissions

<> id
uid=2904(timm) gid=9010(dune) groups=9010(dune),2038(duneatmexot),2042(duneadmintest),8080(dunecal),8085(dunehe),8086(dunele),8087(dunendsim),8089(dunefdsim),8090(dunenuint),8092(dunelbl),8094(dunebsm),8095(dunepd),8271(duneadmin),8660(ifadmin-mgr),9960(lbnf)
<> ls -l /tmp/x509up_u2904
ls: cannot access '/tmp/x509up_u2904': No such file or directory
<> unset X509_USER_PROXY
<> gfal-ls -vv davs://xfer-archive.cr.cnaf.infn.it:8443/dune-grid
INFO [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success
INFO [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success
INFO [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success
INFO [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success
INFO Davix: Unable to find a proxy or cert/key pair using either X509_USER_* variables or /tmp/x509up_u2904
INFO Davix: Unable to find a proxy or cert/key pair using either X509_USER_* variables or /tmp/x509up_u2904
INFO [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success
INFO Using bearer token for HTTPS request authorization
INFO Davix: > PROPFIND /dune-grid HTTP/1.1
> User-Agent: gfal2-util/1.9.1 gfal2/2.23.5 neon/0.0.29
> Keep-Alive:
> Connection: Keep-Alive
> TE: trailers
> Host: xfer-archive.cr.cnaf.infn.it:8443
> Depth: 0
> Authorization: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>

INFO Davix: < HTTP/1.1 207 Multi-Status
INFO Davix: < Date: Fri, 03 Apr 2026 14:05:24 GMT
INFO Davix: < Content-Type: application/xml; charset=utf-8
INFO Davix: < Content-Length: 722
INFO Davix: < Connection: keep-alive
INFO Davix: < Server: StoRM-WebDAV/1.12.0 (instance=tTFk)
INFO Davix: < DAV: 1
INFO Davix: < X-Content-Type-Options: nosniff
INFO Davix: < X-XSS-Protection: 0
INFO Davix: < Cache-Control: no-cache, no-store, max-age=0, must-revalidate
INFO Davix: < Pragma: no-cache
INFO Davix: < Expires: 0
INFO Davix: < X-Frame-Options: DENY
INFO Davix: <
INFO Using bearer token for HTTPS request authorization
INFO Davix: > PROPFIND /dune-grid HTTP/1.1
> User-Agent: gfal2-util/1.9.1 gfal2/2.23.5 neon/0.0.29
> TE: trailers
> Host: xfer-archive.cr.cnaf.infn.it:8443
> Depth: 1
> Authorization: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I double-checked and the normal test I have been running myself with the davs URI works too,which is a gfal-copy.
davs://xfer-archive.cr.cnaf.infn.it:8443/dune-grid/dune/testpro/bb/7f/awt-download-2023-03-07-01.txt

Copying davs://xfer-archive.cr.cnaf.infn.it:8443/dune-grid/dune/testpro/bb/7f/awt-download-2023-03-07-01.txt [DONE] after 0s

<> httokendecode
{
"sub": "dunepro@fnal.gov",
"iss": "https://cilogon.org/dune",
"wlcg.ver": "1.0",
"aud": "https://wlcg.cern.ch/jwt/v1/any",
"acr": "https://refeds.org/profile/sfa",
"nbf": 1775225020,
"auth_time": 1775057711,
"scope": "storage.read:/resilient/jobsub_stage storage.create:/ compute.cancel compute.create compute.read storage.read:/ storage.create:/resilient/jobsub_stage storage.create:/persistent/jobsub/jobs storage.stage:/ compute.modify",
"exp": 1775235825,
"iat": 1775225025,
"wlcg.groups": [
"/dune/production",
"/dune"
],
"jti": "<redacted>"
}
Great,

gfal always tries to use the VOMS proxy before the token.
This was probably the cause of the problem.

So I think we can now move on to XRootD.

Could you also run:

xrdfs root://xrootd-archive.cr.cnaf.infn.it:1096/ ls /
xrdcp root://xrootd-archive.cr.cnaf.infn.it:1096//dune/testpro/bb/7f/awt-download-2023-03-07-01.txt ./test260403

Thanks,
Andrea
OK here are results both with and without XrdSecDEBUG=1
so it is trying the ztn protocol like it should but it is failing.

<> xrdfs root://xrootd-archive.cr.cnaf.infn.it:1096/ ls /
xrdcp root://xrootd-archive.cr.cnaf.infn.it:1096//dune/testpro/bb/7f/awt-download-2023-03-07-01.txt ./test260403
[ERROR] Server responded with an error: [3010] Unable to locate /; permission denied

[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Server responded with an error: [3010] Unable to open /dune/testpro/bb/7f/awt-download-2023-03-07-01.txt; permission denied (source)

<> xrdfs root://xrootd-archive.cr.cnaf.infn.it:1096/ ls /
[ERROR] Server responded with an error: [3010] Unable to locate /; permission denied

<> xrdcp root://xrootd-archive.cr.cnaf.infn.it:1096//dune/testpro/bb/7f/awt-download-2023-03-07-01.txt ./test260403
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Server responded with an error: [3010] Unable to open /dune/testpro/bb/7f/awt-download-2023-03-07-01.txt; permission denied (source)

<> export XrdSecDEBUG=1
<> xrdfs root://xrootd-archive.cr.cnaf.infn.it:1096/ ls /dune
sec_Client: protocol request for host xrootd-archive.cr.cnaf.infn.it token='&P=gsi,v:10600,c:ssl,ca:82977a7e.0|5a04724c.0&P=ztn,0:4096:'
sec_PM: Loaded gsi protocol object from libXrdSecgsi.so
Secgsi -------------------------------------------------------------------
Secgsi Mode: client
Secgsi Debug: 1
Secgsi CA dir: /etc/grid-security/certificates/
Secgsi CA verification level: verifyss
Secgsi CRL dir: /etc/grid-security/certificates/
Secgsi CRL extension: .r0
Secgsi CRL check level: try
Secgsi CRL refresh time: 86400
Secgsi Certificate: /nashome/t/timm/.globus/usercert.pem
Secgsi Key: /nashome/t/timm/.globus/userkey.pem
Secgsi Proxy file: /tmp/x509up_u2904
Secgsi Proxy validity: 12:00
Secgsi Proxy dep length: 0
Secgsi Proxy bits: 2048
Secgsi Proxy sign option: 1
Secgsi Proxy delegation option: 0
Secgsi Pure Cert/Key authentication allowed
Secgsi Allowed server names: [*/]<target host name>[/*]
Secgsi Crypto modules: ssl
Secgsi Ciphers: aes-128-cbc:bf-cbc:des-ede3-cbc
Secgsi MDigests: sha256
Secgsi Trusting DNS for hostname checking
Secgsi -------------------------------------------------------------------
sec_PM: Using gsi protocol, args='v:10600,c:ssl,ca:82977a7e.0|5a04724c.0'
260403 09:23:12 1426839 cryptossl_X509::CertType: certificate has 8 extensions
260403 09:23:12 1426839 secgsi_VerifyCA: Warning: CA certificate not self-signed and integrity not checked: assuming OK (82977a7e.0)
260403 09:23:12 1426839 cryptossl_X509::CertType: certificate has 8 extensions
260403 09:23:12 1426839 secgsi_QueryProxy: problems initializing proxy via external shell
sec_Client: protocol request for host xrootd-archive.cr.cnaf.infn.it token='&P=ztn,0:4096:'
sec_PM: Loaded ztn protocol object from libXrdSecztn.so
sec_PM: Using ztn protocol, args='0:4096:'
[ERROR] Server responded with an error: [3010] Unable to locate /dune; permission denied

<> xrdcp root://xrootd-archive.cr.cnaf.infn.it:1096//dune/testpro/bb/7f/awt-download-2023-03-07-01.txt ./test260403
sec_Client: protocol request for host xrootd-archive.cr.cnaf.infn.it token='&P=gsi,v:10600,c:ssl,ca:82977a7e.0|5a04724c.0&P=ztn,0:4096:'
sec_PM: Loaded gsi protocol object from libXrdSecgsi.so
Secgsi -------------------------------------------------------------------
Secgsi Mode: client
Secgsi Debug: 1
Secgsi CA dir: /etc/grid-security/certificates/
Secgsi CA verification level: verifyss
Secgsi CRL dir: /etc/grid-security/certificates/
Secgsi CRL extension: .r0
Secgsi CRL check level: try
Secgsi CRL refresh time: 86400
Secgsi Certificate: /nashome/t/timm/.globus/usercert.pem
Secgsi Key: /nashome/t/timm/.globus/userkey.pem
Secgsi Proxy file: /tmp/x509up_u2904
Secgsi Proxy validity: 12:00
Secgsi Proxy dep length: 0
Secgsi Proxy bits: 2048
Secgsi Proxy sign option: 1
Secgsi Proxy delegation option: 0
Secgsi Pure Cert/Key authentication allowed
Secgsi Allowed server names: [*/]<target host name>[/*]
Secgsi Crypto modules: ssl
Secgsi Ciphers: aes-128-cbc:bf-cbc:des-ede3-cbc
Secgsi MDigests: sha256
Secgsi Trusting DNS for hostname checking
Secgsi -------------------------------------------------------------------
sec_PM: Using gsi protocol, args='v:10600,c:ssl,ca:82977a7e.0|5a04724c.0'
260403 09:23:52 1427478 cryptossl_X509::CertType: certificate has 8 extensions
260403 09:23:52 1427478 secgsi_VerifyCA: Warning: CA certificate not self-signed and integrity not checked: assuming OK (82977a7e.0)
260403 09:23:52 1427478 cryptossl_X509::CertType: certificate has 8 extensions
260403 09:23:52 1427478 secgsi_QueryProxy: problems initializing proxy via external shell
sec_Client: protocol request for host xrootd-archive.cr.cnaf.infn.it token='&P=ztn,0:4096:'
sec_PM: Loaded ztn protocol object from libXrdSecztn.so
sec_PM: Using ztn protocol, args='0:4096:'
[0B/0B][100%][==================================================][0B/s]
Run:
Hi,

thanks a lot for the feedback.
Could you please try the following test, forcing the XRootD client to use the token only (without X.509)?

$ export BEARER_TOKEN=$(oidc-token <dune>)
$ export XrdSecPROTOCOL=ztn
$ unset X509_USER_PROXY
$
$ export XrdSecDEBUG=1
$ xrdfs root://xrootd-archive.cr.cnaf.infn.it:1096/ ls /dune
$ xrdcp root://xrootd-archive.cr.cnaf.infn.it:1096//dune/testpro/bb/7f/awt-download-2023-03-07-01.txt ./test_token

Please send us the full output.

Can you also retry with a token that has only the "storage.read:/" scope?

In particular, we would like to confirm that only the ztn protocol is used (and no gsi appears in the debug output).

Thanks,
Andrea
WLCG #1002246 (id:1002246) LHCb disk storage as published in the SRR
State: in progress  |  Priority: urgent  |  Opened: 2026-04-01 12:29 (3d ago)  |  Updated: 2026-04-03 10:16
Conversation (4 messages)
Hello,
Following a discussion with Lucio Anderlini, and his comment that there should be more disk storage available to LHCb than currently advertised, here's the information we would like to see in the SRR [1].
We'd expect to see the same 3 shares we see today:

```
+---------+--------------+-----------+-----------+----------+
| Site | Share | Size | Used | Fraction |
+---------+--------------+-----------+-----------+----------+
| CNAF | LHCb-Disk | 14594.8 | 12699.0 | 87.0% |
| CNAF | LHCb_USER | 300.0 | 190.0 | 63.3% |
| CNAF | LHCb-Tape | 85.8 | 0.0 | 0.0% |
+---------+--------------+-----------+-----------+----------+

```
We don't need any re-organization of the information, as long as those
numbers are reliable 🙂
It would be great if you could look at this, as we rely on this information for our data distribution.

thanks in advance, cheers, Chris & Jan
[1] https://xfer-lhcb.cr.cnaf.infn.it:8443/info/report.json
Hi Jan,
we should add the missing amount of available storage to the LHCb-Disk SE.
Could you please check the report.json file?
Thanks,
Andrea
Hi Andrea,
We now see this:
```
+--------------+-----------+-----------+----------+
| Share | Size | Used | Fraction |
+--------------+-----------+-----------+----------+
| LHCb-Disk | 17460.9 | 12536.6 | 71.8% |
| LHCb_USER | 300.0 | 190.1 | 63.4% |
| LHCb-Tape | 8.2 | 2.6 | 32.1% |
+--------------+-----------+-----------+----------+

```
Is that indeed what we should work with for now?
Thanks, Jan
Hi Jan,

sorry for the late reply and for the confusion.
The actual report.json is almost correct, but we are still working on our report.json generator.
We’ll let you know as soon as we have something new.

Thanks for your understanding,
Andrea
WLCG #1002062 (id:1002062) INFN-T1 transfer, deletion and staging failures
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-12 01:40 (23d ago)  |  Updated: 2026-04-03 10:12
Conversation (19 messages)
Dear site admins,I would like to report that the site INFN-T1, especially INFN-T1_DATATAPE has been failing as a source for transfers and staging from this site endpoint, with ~4.13K transfer failures over the past 6h.
The most common errors are:

STAGING [22] [Tape REST API] Stage call failed: HTTP 502 : Unexpected server error: 502 : <html> <head><title>502 Bad Gateway</PATH <body> <center><h1>502 Bad Gateway</PATH <hr><center>nginx/PATH </body> </html>

TRANSFER ERROR: Copy failed (3rd pull). Last attempt: copy HTTP 500 : Unexpected server error: 500

The site has also shown a high rate of deletion failure (only ~50% deletion efficiency over the past 6h)

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=src_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity=T0+Recall&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=INFN-T1&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1773257283289&to=1773278883289

Example:

03/12/2026, 12:20:16 AM

Data Carousel Production

data17_13TeV

AOD.28874557._000032.pool.root.1

STAGING [22] [Tape REST API] Stage call failed: HTTP 502 : Unexpected server error: 502 : <html> <head><title>502 Bad Gateway</title></head> <body> <center><h1>502 Bad Gateway</h1></center> <hr><center>nginx/1.28.2</center> </body> </html> )

transfer-failed

INFN-T1_DATATAPE

CERN-PROD_DATADISK

12.9 hours

1.23 GB

https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/01885592-1d3d-11f1-bd7f-fa163ea7ee69

davs://xfer-tape-atlas.cr.cnaf.infn.it:8443/atlas-tape/atlasdatatape/data17_13TeV/AOD/r13575_p5087/data17_13TeV.00338377.physics_BphysLS.merge.AOD.r13575_p5087_tid28874557_00/AOD.28874557._000032.pool.root.1?copy_mode=pull

davs://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data17_13TeV/ea/3a/AOD.28874557._000032.pool.root.1

davs

1228974757

-1475439524-1773274816000

1773276618

1773228166000

data17_13TeV.00338377.physics_BphysLS.merge.AOD.r13575_p5087_tid28874557_00

data17_13TeV

AOD

UNKNOWN

true

UNKNOWN

447465609

1773276611000

Please have a look.

Khanh (ADCoS shift).
Thank you for your patience. Due to a temporary reduction in help
desk staffing over the next two months, you may experience longer than
usual response and resolution times. Please be assured that we are
working diligently to address your request as quickly as possible.
Hi
Khanh,
could you share an FTS log file?
Unfortunately, I cnanot open the link https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/01885592-1d3d-11f1-bd7f-fa163ea7ee69
and I find pretty hard to debug anything using Monit.
Thanks for you help,
lucia
Hi again,
concerning the specific transfer you mentioned, I checked the log files of our services.

The recall process succeeded at 03:17:34 today:

03/12/26 03:17:34 tsm-hsm-7 yamssProcessRecall[2540341]: Recall of file /storage/gpfs_tsm_atlas/atlas/atlasdatatape/data17_13TeV/AOD/r13575_p5087/data17_13TeV.00338377.physics_BphysLS.merge.AOD.r13575_p5087_tid28874557_00/AOD.28874557._000032.pool.root.1 succeded
the file is on buffer (it still is), and xfer-atlas has correctly performed a GET at 04:09:18 today:

2026-03-12T04:09:18+01:00 [a4f382aa3d998cb9da7448f727ddb6a5] 2001:1458:301:27::100:75 - "-" "GET /atlas-tape/atlasdatatape/data17_13TeV/AOD/r13575_p5087/data17_13TeV.00338377.physics_BphysLS.merge.AOD.r13575_p5087_tid28874557_00/AOD.28874557._000032.pool.root.1?copy_mode=pull HTTP/1.1" "xrootd-tpc/5.9.1" 200 1228974757 6.849
where you see all the 1228974757 bytes seem correctly transfered...

I don't see a problem but please help me,
lucia
Hi,

Here is an example of a transfer failure as source:

12/03/2026, 20:24:44

Data Carousel Production

data18_13TeV

AOD.28681953._000369.pool.root.1

TRANSFER
ERROR: Copy failed (3rd pull). Last attempt: Transfer failure: socket
timeout on GET (received 491135 B (480 KiB) of data; 0 B pending): Read
timed out

transfer-failed

INFN-T1_DATATAPE

PIC_DATADISK

14.3 mins

3.76 GB

https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/f2bb19a6-1e49-11f1-a4fe-fa163e2cda95

davs://xfer-tape-atlas.cr.cnaf.infn.it:8443/atlas-tape/atlasdatatape/data18_13TeV/AOD/r13313_p4910/data18_13TeV.00359823.physics_BphysLS.merge.AOD.r13313_p4910_tid28681953_00/AOD.28681953._000369.pool.root.1?copy_mode=pull

davs://webdav-at1.pic.es:8446/atlasdatadisk/rucio/data18_13TeV/a5/e5/AOD.28681953._000369.pool.root.1

davs

3759350859

2058218065-1773347084000

1773347101

1773343767000

data18_13TeV.00359823.physics_BphysLS.merge.AOD.r13313_p4910_tid28681953_00

data18_13TeV

AOD

UNKNOWN

true

UNKNOWN

445520297

177334708

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=src_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=INFN-T1&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&var-enr_filters=data.purged_reason%7C%3D%7CTRANSFER+ERROR%3A+Copy+failed+%283rd+pull%29.+Last+attempt%3A+Transfer+failure%3A+socket+timeout+on+GET+%3A+Read+timed+out&from=1773336368508&to=1773347168508

Best regards,

Gabriela (ADCOS shifter)
Hi Luca,
TLS communication hangs for some time / sometimes even timeout, e.g. following transfer gets stuck 3 times for quite some time (40s, 22s, 226s) and there is no packet loss. Are (all?) your servers somehow overloaded? Still it is weird that transfers gets stuck even during TLS handshake
$ curl -s -v --capath /etc/grid-security/certificates -L -I -H "Authorization: Bearer $TSRC" "$SRC" 2>&1 | timestamp
Mar 13 01:16:42.011 0.007 0.007 * Trying 2001:760:4205:128::129:204:8443...
Mar 13 01:16:42.039 0.034 0.027 * Connected to xfer-atlas.cr.cnaf.infn.it (2001:760:4205:128::129:204) port 8443 (#0)
Mar 13 01:16:42.040 0.036 0.002 * ALPN, offering h2
Mar 13 01:16:42.040 0.036 0.000 * ALPN, offering http/1.1
Mar 13 01:16:42.048 0.043 0.008 * CAfile: /etc/pki/tls/certs/ca-bundle.crt
Mar 13 01:16:42.048 0.043 0.000 * CApath: /etc/grid-security/certificates
Mar 13 01:16:42.048 0.043 0.000 * TLSv1.0 (OUT), TLS header, Certificate Status (22):
Mar 13 01:16:42.048 0.043 0.000 } [5 bytes data]
Mar 13 01:16:42.048 0.043 0.000 * TLSv1.3 (OUT), TLS handshake, Client hello (1):
Mar 13 01:16:42.048 0.043 0.000 } [512 bytes data]
Mar 13 01:17:22.295 40.290 40.247 * TLSv1.2 (IN), TLS header, Certificate Status (22):
Mar 13 01:17:22.295 40.291 0.001 { [5 bytes data]
Mar 13 01:17:22.295 40.291 0.000 * TLSv1.3 (IN), TLS handshake, Server hello (2):
Mar 13 01:17:22.295 40.291 0.000 { [122 bytes data]
Mar 13 01:17:22.295 40.291 0.000 * TLSv1.2 (IN), TLS header, Finished (20):
Mar 13 01:17:22.295 40.291 0.000 { [5 bytes data]
Mar 13 01:17:22.295 40.291 0.000 * TLSv1.3 (IN), TLS change cipher, Change cipher spec (1):
Mar 13 01:17:22.295 40.291 0.000 { [1 bytes data]
Mar 13 01:17:22.295 40.291 0.000 * TLSv1.2 (IN), TLS header, Unknown (23):
Mar 13 01:17:22.295 40.291 0.000 { [5 bytes data]
Mar 13 01:17:22.295 40.291 0.000 * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
Mar 13 01:17:22.295 40.291 0.000 { [25 bytes data]
Mar 13 01:17:22.295 40.291 0.000 * TLSv1.2 (IN), TLS header, Unknown (23):
Mar 13 01:17:22.295 40.291 0.000 { [5 bytes data]
Mar 13 01:17:22.295 40.291 0.000 * TLSv1.2 (IN), TLS header, Unknown (23):
Mar 13 01:17:22.295 40.291 0.000 { [5 bytes data]
Mar 13 01:17:22.295 40.291 0.000 * TLSv1.3 (IN), TLS handshake, Request CERT (13):
Mar 13 01:17:22.295 40.291 0.000 { [20389 bytes data]
Mar 13 01:17:22.297 40.292 0.001 * TLSv1.2 (IN), TLS header, Unknown (23):
Mar 13 01:17:22.297 40.292 0.000 { [5 bytes data]
Mar 13 01:17:22.297 40.292 0.000 * TLSv1.3 (IN), TLS handshake, Certificate (11):
Mar 13 01:17:22.297 40.292 0.000 { [2104 bytes data]
Mar 13 01:17:22.297 40.292 0.000 * TLSv1.2 (IN), TLS header, Unknown (23):
Mar 13 01:17:22.297 40.292 0.000 { [5 bytes data]
Mar 13 01:17:22.297 40.292 0.000 * TLSv1.3 (IN), TLS handshake, CERT verify (15):
Mar 13 01:17:22.297 40.292 0.000 { [392 bytes data]
Mar 13 01:17:22.297 40.292 0.000 * TLSv1.2 (IN), TLS header, Unknown (23):
Mar 13 01:17:22.297 40.292 0.000 { [5 bytes data]
Mar 13 01:17:22.297 40.292 0.000 * TLSv1.3 (IN), TLS handshake, Finished (20):
Mar 13 01:17:22.297 40.292 0.000 { [52 bytes data]
Mar 13 01:17:22.297 40.292 0.000 * TLSv1.2 (OUT), TLS header, Finished (20):
Mar 13 01:17:22.297 40.292 0.000 } [5 bytes data]
Mar 13 01:17:22.297 40.292 0.000 * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
Mar 13 01:17:22.297 40.292 0.000 } [1 bytes data]
Mar 13 01:17:22.297 40.292 0.000 * TLSv1.2 (OUT), TLS header, Unknown (23):
Mar 13 01:17:22.297 40.293 0.000 } [5 bytes data]
Mar 13 01:17:22.297 40.293 0.000 * TLSv1.3 (OUT), TLS handshake, Certificate (11):
Mar 13 01:17:22.297 40.293 0.000 } [8 bytes data]
Mar 13 01:17:22.297 40.293 0.000 * TLSv1.2 (OUT), TLS header, Unknown (23):
Mar 13 01:17:22.297 40.293 0.000 } [5 bytes data]
Mar 13 01:17:22.297 40.293 0.000 * TLSv1.3 (OUT), TLS handshake, Finished (20):
Mar 13 01:17:22.297 40.293 0.000 } [52 bytes data]
Mar 13 01:17:22.297 40.293 0.000 * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
Mar 13 01:17:22.297 40.293 0.000 * ALPN, server accepted to use http/1.1
Mar 13 01:17:22.297 40.293 0.000 * Server certificate:
Mar 13 01:17:22.297 40.293 0.000 * subject: DC=org; DC=terena; DC=tcs; C=IT; L=Roma; O=Istituto Nazionale di Fisica Nucleare; CN=dm-12-14-07.cr.cnaf.infn.it
Mar 13 01:17:22.297 40.293 0.000 * start date: Nov 27 10:03:20 2025 GMT
Mar 13 01:17:22.297 40.293 0.000 * expire date: Nov 27 10:03:20 2026 GMT
Mar 13 01:17:22.297 40.293 0.000 * subjectAltName: host "xfer-atlas.cr.cnaf.infn.it" matched cert's "xfer-atlas.cr.cnaf.infn.it"
Mar 13 01:17:22.297 40.293 0.000 * issuer: C=GR; O=Hellenic Academic and Research Institutions CA; CN=GEANT TLS RSA 1
Mar 13 01:17:22.297 40.293 0.000 * SSL certificate verify ok.
Mar 13 01:17:22.297 40.293 0.000 * TLSv1.2 (OUT), TLS header, Unknown (23):
Mar 13 01:17:22.297 40.293 0.000 } [5 bytes data]
Mar 13 01:17:22.297 40.293 0.000 > HEAD /webdav/atlas/atlasdatadisk/rucio/mc16_13TeV/af/05/HITS.48953936._022474.pool.root.1 HTTP/1.1
Mar 13 01:17:22.297 40.29
Hi Petr,
thanks for debugging the issue.
Servers are not overloaded, nor is the underlying filesystem. We have evidence that some nginx workers get stucked for some requests, which was not expected.
The StoRM developers are looking into this.
Cheers,
lucia
Hi Petr and everybody,
we've been looking into this issue together with StoRM developers, who identified a possible limitation in the current Nginx configuration.
Hence, we modified the configuration in 4 out of 5 endpoints in the xfer-atlas.cr.cnaf.infn.it alias, leaving only one unchanged to further debug.
"Unfortunately", traffic is now very low, so we do not see any issue at the moment, but we are waiting for you to "push the button" :-)
Thanks for understanding,
lucia
Hi,
from what I can see, the traffic is rather stable since this ticket was submitted:

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&theme=dark&var-binning=1h&var-groupby=src_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Functional+Test&var-activity=Express&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=SFO+to+EOS+export&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Recall&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=INFN-T1&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_cloud&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Express&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1773231189372&to=1773835989372&viewPanel=121

and these failures seems to be gone for several days:

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&theme=dark&var-binning=1h&var-groupby=src_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Functional+Test&var-activity=Express&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=SFO+to+EOS+export&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Recall&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=INFN-T1&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_cloud&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Express&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1773231234394&to=1773836034394&var-enr_filters=data.purged_reason%7C%3D%7E%7C.*Unexpected.*&viewPanel=122

So, the problem might be fixed. Is there any investigation ongoing or can this ticket be closed?
Hi,

the investigation is still ongoing on our side and unfortunately we do not believe that the new configuration solved the issues reported here.

Are you saying that ATLAS is not observing low efficiency even with a traffic similar to the one ATLAS generated last week, when this ticket was opened?
Thanks for checking,
lucia
Yes, according to that plot I linked, the failures stopped on 13th of March.
Hi,
we enabled asyncronous I/O in all the endpoints given our analysis suggests this improves the handling of GET requests.
It would be great if you could increase the timeout for DELETE requests to 30 seconds, in order to avoid false negatives.
Then, ok for us to close this ticket.
Thanks for your help,
lucia
Transfers and deletions continue to be fine. Let me add the ddmops for your last question/request.
So, it seems the problems have been fixed but I wonder if something related to the addition of ddmops is still pending to proceed with closing the ticket.
Hi,
yes it is... "It would be great if you could increase the timeout for DELETE requests to 30 seconds, in order to avoid false negatives."
Cheers,
lucia
Ciao Lucia, all,

False negatives are less of an issue, true positives taking more than 10 seconds are. When successful, deletions typically complete in less than half a second. [1] In the light of this, I am not sure that increasing the timeout would allow the failures to turn into successes: we had 30 seconds sometime in the past and - with changes effective and communicated on the 27.01.2026 (Reduce timeout from 20s to 10s) - the consensus was that if a namespace operation is doomed to fail anyway, we rather fail early and reattempt later, possibly with a backoff strategy.

That said, I'll raise the subject again within ADC/DDM, but for the purpose of this ticket we can close it.

Cheers,
Fabio for DDM Ops

[1] https://monit-grafana.cern.ch/goto/4ntRrycDR?orgId=17
The point is that deletion requests go into a queue, like all requests. It really doesn't matter if the request then takes 1 ms to complete. This is independent of success or failure.
The deletions issue appears to be resolved. The efficiency for millions of files over the last days is very close to 100%.
Transfers from INFN-T1 are also very close to 100% efficient over the last 7 days.

There are periods of high staging error rates, but these are dominated by timeouts rather than the "unexpected error" first report above.

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=src_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=INFN-T1&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1774859061433&to=1775031861434&viewPanel=155

An FTS log with a bunch of these failures (
STAGING [110] [Tape REST API] Stage pooling call failed: timeout of 300s ):
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/cab3d0a2-2d02-11f1-9000-fa163e2cda95

Is there anything to do about these timeouts?
Hi Alexander,

we are looking into the problem with the StoRM developers to improve the situation as much as possible.
We’ll let you know as soon as we have something new.

Andrea
WLCG #1002238 (id:1002238) INFN-T1 RSE basepath prefix
State: in progress  |  Priority: less urgent  |  Opened: 2026-04-01 11:53 (3d ago)  |  Updated: 2026-04-01 12:48
Conversation (2 messages)
Your site ATLAS config is currently using different basepath prefix per protocol.

webdav: /webdav/atlas
xroot: /atlas

1. what is the motivation for this difference?
2. would it be possible to set the same path for both protocols?

Thanks!
Hi,
feel free to use /atlas instead of /webdav/atlas for webdav protocol.
The latter is there for backward compatibility, they are equivalent.
Let us know it there are problems,
Cheers,
lucia
WLCG #1002105 (id:1002105) INFN-T1: failures on tape archive (archive monitor)
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-18 15:30 (17d ago)  |  Updated: 2026-03-19 12:42
Conversation (2 messages)
Dear site admins,

Following up from < https://its.cern.ch/jira/browse/ATLDDMOPS-5797 | Enable archive monitoring across all tape storages > , we notice a high number of failures as destination for the INFN-T1_MCTAPE endpoint (where the archive monitoring feature was enabled, while it wasn't on DATATAPE so far).

For convenience, you can find the related plots and reports in the DDM Transfer dashoard at https://monit-grafana.cern.ch/goto/bfmhAZ5vR?orgId=17 , specifically in the Transfer Plots, Errors, and Details panels.

The most common error (40k+) reads like "ARCHIVING [11] [Tape REST API] File locality reported as UNAVAILABLE".
This leads to repeated transfer attempts and, while investigating a few replicas, I noticed the we can end up with unnecessary duplicate replicas of the same file. For example (note the _1773763209 timestamp suffix):

```
# rucio list-file-replicas --rses INFN-T1_MCTAPE --pfns mc23_13p6TeV:AOD.46297565._002741.pool.root.1
davs://xfer-tape-atlas.cr.cnaf.infn.it:8443/atlas-tape/atlasmctape/mc23_13p6TeV/AOD/e8592_e8586_s4369_s4370_r16083_r15970/mc23_13p6TeV.700781.Sh_2214_Wmunu_maxHTpTV2_CFilterBVeto.merge.AOD.e8592_e8586_s4369_s4370_r16083_r15970_tid46297565_00/AOD.46297565._002741.pool.root.1_1773763209

# curl -X POST --cert
$X509_USER_PROXY --capath /etc/grid-security/certificates -H
"Content-Type: application/json" https://tape-atlas.cr.cnaf.infn.it:8443/api/v1/archiveinfo
--data
'{"paths":["/atlas-tape/atlasmctape/mc23_13p6TeV/AOD/e8592_e8586_s4369_s4370_r16083_r15970/mc23_13p6TeV.700781.Sh_2214_Wmunu_maxHTpTV2_CFilterBVeto.merge.AOD.e8592_e8586_s4369_s4370_r16083_r15970_tid46297565_00/AOD.46297565._002741.pool.root.1"]}'
| jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 508 100 263 100 245 903 841 --:--:-- --:--:-- --:--:-- 1739
[
{

"path":
"/atlas-tape/atlasmctape/mc23_13p6TeV/AOD/e8592_e8586_s4369_s4370_r16083_r15970/mc23_13p6TeV.700781.Sh_2214_Wmunu_maxHTpTV2_CFilterBVeto.merge.AOD.e8592_e8586_s4369_s4370_r16083_r15970_tid46297565_00/AOD.46297565._002741.pool.root.1",
"locality": "TAPE"
}
]

# curl -X POST --cert $X509_USER_PROXY --capath /etc/grid-security/certificates -H "Content-Type: application/json" https://tape-atlas.cr.cnaf.infn.it:8443/api/v1/archiveinfo --data '{"paths":["/atlas-tape/atlasmctape/mc23_13p6TeV/AOD/e8592_e8586_s4369_s4370_r16083_r15970/mc23_13p6TeV.700781.Sh_2214_Wmunu_maxHTpTV2_CFilterBVeto.merge.AOD.e8592_e8586_s4369_s4370_r16083_r15970_tid46297565_00/AOD.46297565._002741.pool.root.1_1773749267"]}' | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 530 100 274 100 256 913 853 --:--:-- --:--:-- --:--:-- 1766
[
{
"path": "/atlas-tape/atlasmctape/mc23_13p6TeV/AOD/e8592_e8586_s4369_s4370_r16083_r15970/mc23_13p6TeV.700781.Sh_2214_Wmunu_maxHTpTV2_CFilterBVeto.merge.AOD.e8592_e8586_s4369_s4370_r16083_r15970_tid46297565_00/AOD.46297565._002741.pool.root.1_1773749267",
"locality": "TAPE"
}
]
```

For the time being, I have disabled the archive monitoring feature for INFN-T1, but we would like the issue to be addressed and reenabled it again.
Could you please have a look?

Cheers,
Fabio for DDM Ops
Hi Fabio,

thanks a lot for the notification.

We figured out that the problem is strictly related to the proper management of the extended attributes of the files on the tape buffer: there is a brief window of time during which the files can be in the "UNAVAILABLE" status. This is quite normal in our opinion because there is always a migration period during which the files are in an in-transit state.

We are evaluating some possible improvements and fixes, and we'll let you know about them.

However, with respect to the file you mentioned, we tried to look at its history.

The file was written on March 17th at 6:40:

pull third-party transfer completed: DONE. Source: https://se1.farm.particle.cz:443/atlas/atlasdatadisk/rucio/mc23_13p6TeV/b4/bd/AOD.46297565._002741.pool.root.1?copy_mode=pull
, Destination: /atlas-tape/atlasmctape/mc23_13p6TeV/AOD/e8592_e8586_s4369_s4370_r16083_r15970/mc23_13p6TeV.700781.Sh_2214_Wmunu_maxHTpTV2_CFilterBVeto.merge.AOD.e8592_e8586_s4369_s4370_r16083_r15970_tid46297565_00/AOD.46297565._002741.pool.root.1, Bytes transferred: 7243335889, Duration (msec): 207755, Throughput: 33 MB/sec

but then you looked at the locality about 6 hours later, 4 times, with an interval of 10 minutes:

Mar 17 12:36:37
Mar 17 12:46:37
Mar 17 12:56:37
Mar 17 13:06:37

So, some questions naturally arise.

Can you confirm the time between the completed transfer and the first /archiveinfo attempt? Is it possible to configure or extend it up to 24 hours? Or were the observed 6 hours due to a certain load, or queue, on the FTS servers?

How many times can the Tape REST API report an error or the UNAVAILABLE status before FTS declares the transfer as failed and retries?

Can you extend the time between two /archiveinfo requests from 10 minutes to 1 hour?

Have you considered avoiding the retry behaviour where a timestamp is appended to the file name, and instead simply retry transferring the same file?

Thanks a lot,
Andrea
WLCG #681615 (id:1716) Request to deploy IPv6 on CEs and WNs at WLCG sites (INFN-T1)
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 09:59 (430d ago)  |  Updated: 2026-01-22 13:51
Conversation (13 messages)
GGUS ID: 164372
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:37:47
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (INFN-T1)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164372
Last modifier: Carmelo Pellegrino
Date: 2023-11-28 15:42:49

Status: in progress
Responsible Unit: NGI_IT
Public Diary:
We are unable to perform this at the moment.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164372
Last modifier: Carmelo Pellegrino
Date: 2023-11-29 06:43:43

Public Diary:
We are unable to perform this at the moment.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164372
Last modifier: Andrea Sciaba
Date: 2024-06-06 08:34:17

Public Diary:
Dear Daniele,
yes, it is acceptable.
Ciao,
Andrea
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164372
Last modifier: Daniele Cesini
Date: 2024-06-03 07:27:13

Public Diary:
Dear Andrea,
we will probably not manage to fulfill the request by the 30th of June. We are in the middle of the live transition to the new datacenter and we would prefer to configure IPv6 on the WNs when the transition will be completed in September. All other services (CEs, DM) are already in dual stack.
Please let us know if this is acceptable.
Kind Regards,
Daniele.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164372
Last modifier: Daniele Cesini
Date: 2024-10-03 09:29:00

Public Diary:
quick update on this:
we completed the transition of the farm to the new datacenter, this also involved the inclusion of WNs instantiated on supercomputing resources at CINECA.(200WNs). We "see" this resources trough InfiniBand-Ethernet bridges, the SKYWAY by NVIDIA-Mellanox. Unfortunately, it resulted that this devices do not support IPv6 on the data interfaces (only on management interfaces). So it is impossible for us to configure all the farm in dual stack and we prefer not to have it enabled on just a fraction of the nodes (gpfs will be affected).
We are pushing NVIDIA-Mellanox to release a firmware update to supper IPv6 on the data plane, but it seems that this will not happen shortly, so it is difficult to provide a new deadline for completing the task.
Next iteration with NVIDIA on this topic will happen at SC25 in November.
Kind Regards,
Daniele.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164372
Last modifier: Daniele Cesini
Date: 2024-10-03 09:33:34

Public Diary:
sorry, I mean SC24 in November
Daniele.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164372
Last modifier: Chan-anun Rungphitakchai
Date: 2024-11-21 20:55:45
Changed CC to cms-comp-ops-site-support-team@cern.ch;

Public Diary:
sorry, I mean SC24 in November
Daniele.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164372
Last modifier: Andrea Sciaba
Date: 2024-12-03 13:50:27

Public Diary:
Hi Daniele, any news form Nvidia and from CNAF on this issue?
Andrea
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164372
Last modifier: Andrea Chierici
Date: 2024-12-23 15:15:52

Status: on hold
Responsible Unit: NGI_IT
Public Diary:
We are unable to perform this at the moment.
Internal Diary:
Escalated this ticket to NGI_IT
Ciao Andrea,
are there any news?
Andrea
Ciao Andrea,
are there any news?
Andrea
Dear Andrea,

copying the same message in #1715, unfortunately, nVidia has decided to cancel the plan they were having for the in-firmware implementation of IPv6 on their network apparatuses.

We are still unable to enable IPv6 on > 50% of our worker nodes.

Kind regards,
Carmelo
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM97%100%97%100%100%100%100%100%96%100%99%100%99%100%100%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #1001605 (id:1001605) Errors on transfer from JINR tape to JINR disk
State: in progress  |  Priority: urgent  |  Opened: 2026-01-21 11:44 (73d ago)  |  Updated: 2026-01-26 09:05
Conversation (1 message)
Hi,We are noticing a large proportion of failing transfers from JINR tape to the disk endpoint.

The main error message is:
TRANSFER [5] TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Transfer failure: rejected GET: 500 Server Error

We wonder if this is masking the real problem. Could you please check?

In case it is helpful, we also see a smaller number of errors with this message:
The peer's certificate with subject's DN CN=se-wbdv-mss.jinr-t1.ru,OU=jinr.ru,OU=hosts,O=RDIG,C=RU was rejected. The peer's certificate status is: FAILED The following validation errors were found:;error affecting the whole chain (category: X509_CHAIN): No trusted CA certificate was found for the certificate chain;error at position 0 in chain, problematic certificate subject: CN=se-wbdv-mss.jinr-t1.ru,OU=jinr.ru,OU=hosts,O=RDIG,C=RU (category: X509_CHAIN): Trusted issuer of this certificate was not established

Could you please investigate?

Thanks,
Katy
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%85%100%99%94%100%100%100%100%100%100%100%100%
HammerCloud100%100%100%100%96%99%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (5)

CMS tickets (1)
CMS #682704 (id:2837) File Read Fails through XRootD Fallback
State: on hold  |  Priority: less urgent  |  Opened: 2025-03-18 09:49 (382d ago)  |  Updated: 2026-03-23 11:36
Conversation (35 messages)
Hi,There is the XRootD fallback file read errors at RAL when production jobs at non-RAL sites tried to read files only at RAL Disk.
This is an example XRootD fallback file read error breakdown (job site vs input file site) during last hour of this ticket opening:
JobWN Site Input SE Site
T1_DE_KIT (888) T1_UK_RAL (878)

T1_FR_CCIN2P3 (796) T1_UK_RAL (788)

T2_US_Purdue (379) T1_UK_RAL (367), T1_US_FNAL (1)

T2_US_Vanderbilt (340) T1_UK_RAL (339)

T2_US_Wisconsin (270) T1_UK_RAL (270), T1_US_FNAL (1)

T2_CH_CERN (176) T1_UK_RAL (173)

T2_US_Nebraska (141) T1_UK_RAL (138)

T2_DE_RWTH (114) T1_UK_RAL (114)

T2_US_Caltech (81) T1_UK_RAL (81)

T1_ES_PIC (17) T1_ES_PIC (17)

T1_IT_CNAF (15) T1_IT_CNAF (11) , T1_UK_RAL (4)

T2_IT_Bari (2) T1_UK_RAL (2)

As you can see, jobs at many sites are failing when they read files at RAL.
It appears the input files that are used are only at RAL, e.g., /store/data/Run2024I/ParkingDoubleMuonLowMass6/RAW/v1/000/386/604/00000/6de4141d-b4c5-4988-93ae-400a25d0ffb2.root
Also I think this is due to the RAL xrootd servers that are exposed to AAA. They are serving files on Ceph, but RAL's Ceph xrootd servers that are used at RAL and that are subscribing to AAA are two different qualities, I am afraid.
ral_xrootd=$( xrdmapc --list all rdr.echo.stfc.ac.uk:1094 | grep Srv | awk '{print $NF}' | sort -u | head -1)
ral_aaa_xrootd=$( xrdmapc --list all cms-aaa-manager01.gridpp.rl.ac.uk:1094 | grep Srv | awk '{print $NF}' | sort -u |grep -v :0 | head -1)
xrdcp -d 1 -f root://${ral_xrootd}/$lfn /dev/null
[2025-03-18 04:50:20.919501 -0400][Info ][AsyncSock ] [ceph-gw14.gridpp.rl.ac.uk:1094.0] TLS hand-shake done.
[3.956GB/3.956GB][100%][==================================================][68.65MB/s]
xrdcp -d 1 -f root://${ral_aaa_xrootd}/$lfn /dev/null
[2025-03-18 04:51:58.019870 -0400][Info ][AsyncSock ] [ceph-gw11.gridpp.rl.ac.uk:1094.0] TLS hand-shake done.
[3.956GB/3.956GB][100%][==================================================][26.65MB/s]
While the AAA xrootd servers are fine when I run xrdcp, they failed to read files when files are read through cmsRun. To reproduce the issue, See
https://github.com/cms-sw/cmssw/issues/43162#issuecomment-2724627647
/afs/cern.ch/user/b/bockjoo/public/PR47593_Run/cmsRun.log
/afs/cern.ch/user/b/bockjoo/public/PR47593_Run/PSet.py

It might help if some more better xrootd servers are added to cms-aaa-manager01.gridpp.rl.ac.uk.
Could somebody look into this?
Thanks,
Bockjoo

Kind regards,

Bockjoo Kim

************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
Ticket Link: https://helpdesk.ggus.eu/#ticket/zoom/2837
added are xrootd SME to investigate
Hi Bockjoo,The CMS-AAA service at RAL started seeing unusually high traffic since 5AM today, and hit certain system limits (configured max number of connections and memory use), causing instability of the service.
I'll try relaxing the limits, but unless the current level of traffic is to be sustained, this can be considered a case where the provided resources were fully saturated under heavy load and is expected to return to normal once the load returns to the previous values.
Regards,
Jyothish
Hi Jyothish,Yes, I realize there is a lot of load on the AAA service at RAL due to CMS jobs asking files at RAL at the moment.
Hopefully, it will subside, but I am afraid the problem might still persist even with low load.
When the load goes down, can somebody run `cmsRun` with
/afs/cern.ch/user/b/bockjoo/public/PR47593_Run/PSet.py ( `CMSSW_15_1_0_pre1`)
from a machine not at RAL, e.g., lxplus and see if the issue of reading files through AAA is gone too
to confirm there is no problem with 3 xrootds under AAA at RAL redirector?
Thanks,
Bockjoo
Hi Bockjoo, The load on cms-aaa is down currently. Could you check if the failures still persist?
Regards,
Jyothish
Hi Jyothish,I ran cmsRun with the PSet mentioned above. It went through but the server quality does
not look good and fails to open server from time to time.
Could one of the rdr echo servers be moved to AAA? It could help, I think.
Thanks,
Bockjoo
Hi Bockjoo, It's the same hardware, but AAA hosts are configured with a throttle limit on connections. I'll raise that limit next week after our storage upgrades and that should improve things.
Jyothish
Hi Bockjoo,
the limit has been increased now. Could you check if the quality has improved?Thanks
Jyothish
Hi Jyothish,In the beginning of cmsRun, server quality was good, but then I am still getthing this error from time to time:
```
[2025-03-24 06:14:59.302056 -0400][Error ][AsyncSock ] [ceph-svc20.gridpp.rl.ac.uk:1094.0] Socket error encountered: [ERROR] TLS error
[2025-03-24 06:14:59.302189 -0400][Error ][XRootD ] [ceph-svc20.gridpp.rl.ac.uk:1094] Unable to get the response to request kXR_readv (handle: 0x00000000, chunks: [(offset: 37599859, size: 524288); (offset: 38124147, size: 524288); (offset: 38648435, size: 524288); (offset: 39172723, size: 524288); (offset: 39697011, size: 524288); (offset: 40221299, size: 524288); (offset: 40745587, size: 256788); (offset: 41003110, size: 524288); (offset: 41527398, size: 524288); (offset: 42051686, size: 524288); (offset: 42575974, size: 344556); (offset: 42922239, size: 524288); (offset: 43446527, size: 524288); (offset: 43970815, size: 524288); (offset: 44495103, size: 524288); (offset: 45019391, size: 524288); (offset: 45543679, size: 524288); (offset: 46067967, size: 524288); (offset: 46592255, size: 524288); (offset: 47116543, size: 524288); (offset: 47640831, size: 524288); (offset: 48165119, size: 524288); (offset: 48689407, size: 524288); (offset: 49213695, size: 524288); (offset: 49737983, size: 524288); (offset: 50262271, size: 524288); (offset: 50786559, size: 524288); (offset: 51310847, size: 524288); (offset: 51835135, size: 524288); (offset: 52359423, size: 524288); (offset: 52883711, size: 524288); (offset: 53407999, size: 524288); (offset: 53932287, size: 524288); (offset: 54456575, size: 524288); (offset: 54980863, size: 524288); (offset: 55505151, size: 300014); (offset: 58816793, size: 2038); (offset: 65474772, size: 401); (offset: 65482018, size: 731); (offset: 65490423, size: 147); (offset: 65502209, size: 145); (offset: 65502889, size: 1773); ], total size: 18208097)
Got failure when trying to open a new source

```
cmsRun is running, though.
Thanks,
Bockjoo
Thanks Bockjoo, so it looks like the limit increase has worked. I'll observe the situation on the storage cluster over today to ensure it still has sufficient IOps capacity to accommodate further increases and maintain headroom for bursts.
If all goes well, I'll increase the limits further over this week.
Regards
Jyothish
Sounds good! Thanks Jyothish!Bockjoo
Hi Jyothish,A large number of jobs are failing to read files from RAL:
```
JobWN Site Input SE Site
T2_CH_CERN (2527) T1_UK_RAL (2480), EU_Redir (6), T1_IT_CNAF (5), T2_BR_SPRACE (1), Global_Redir (1)
T1_FR_CCIN2P3 (571) T1_UK_RAL (566), EU_Redir (1), T1_IT_CNAF (1)
T2_DE_RWTH (274) T1_UK_RAL (271), EU_Redir (1)
T2_US_Caltech (228) T1_UK_RAL (223), T1_IT_CNAF (3), T1_US_FNAL (1)
T2_ES_CIEMAT (198) T1_ES_PIC (146), T1_UK_RAL (50), EU_Redir (2)
T2_US_Purdue (105) T1_UK_RAL (103), T1_IT_CNAF (2)
T2_US_Wisconsin (103) T1_UK_RAL (101), T1_IT_CNAF (2)
T2_US_Vanderbilt (94) T1_UK_RAL (83), T1_IT_CNAF (11)
T1_DE_KIT (88) T1_UK_RAL (87), T1_IT_CNAF (1)
T2_IT_Bari (13) EU_Redir (9), T1_UK_RAL (2), T1_RU_JINR (1), T1_IT_CNAF (1)
T2_DE_DESY (12) T1_UK_RAL (4), EU_Redir (2), T2_US_MIT (2), T1_RU_JINR (1), T2_US_Vanderbilt (1), T2_US_Wisconsin (1), T1_IT_CNAF (1)
T2_US_UCSD (8) T1_IT_CNAF (4), T2_CH_CERN (2), T2_BR_SPRACE (1), T1_UK_RAL (1)
T2_UK_London_IC (6) EU_Redir (6)
T2_US_MIT (5) T1_IT_CNAF (4)
T2_FR_GRIF (5) EU_Redir (2), T2_US_Vanderbilt (2), T1_IT_CNAF (1)
T2_IT_Legnaro (3) T1_IT_CNAF (2), EU_Redir (1)
T1_RU_JINR (3) T1_RU_JINR (3)
T3_US_NERSC (2) T1_US_FNAL (1)
T2_IT_Rome (2) T1_UK_RAL (1), T1_IT_CNAF (1)
T1_UK_RAL (2) T2_BR_SPRACE (1), T1_IT_CNAF (1)
```
So, it seems the change was barely enough.

Thanks,
Bockjoo
Hi Bockjoo, One of the servers in the cluster (svc20) is currently having issues, so the failures might be from that.
Unfortunately it's unlikely I'll be able to increase the limits more than what they currently are as the IOps hitting the storage cluster are close to our physical limits.
I'll update the ticket once that server is back to being stable and if the failures still occur, we'll see on how to proceed from there.
Thanks,
Jyothish
Hi Bockjoo,That server has now been stable since an hour ago. How is the quality looking for this period?
Regards,
Jyothish
Hi Jyothish,The number of AAA file read error is reduced. I think it's simply because jobs that were accessing files at RAL are not
running anymore.
Thanks,
Bockjoo
Hi Katy,You can see the files that are being accessed by looking at the lsof output against the xrootd process.
Also I would like to see number of connections when the load is high through ss -nrp command.
Thanks,
Bockjoo
Hi Bockjoo,
on the ss -nrp:
[root@ceph-svc20 ~]# ss -nrp | wc -l
30335

lsof of the systemd service accessing files (xrootd@ceph):
[root@ceph-svc20 ~]# lsof -p 2336767 | wc -l
25430
filenames are not mentioned as they're read from remote storage we only get file handler ids.

Regards,
Jyothish
Hi Jyothish,If you filter ss -nrp | grep :1094, you should be able to see which hosts are connecting to the server.
If you break them down with TLD, we will have better understand where the connections are coming from.
Thanks,
Bockjoo
with some sorting and filtering, I've got the following sources for open connections:
[root@ceph-svc20 ~]# sort tst2 | uniq -c
32
52 accre.vanderbilt.edu
1172 cern.ch
1 chtc.wisc.edu
21 ciemat.es
20 cmsaf.mit.edu
2 datagrid.cea.fr
34 desy.de
1 dice.priv
319 fnal.gov
67 gridka.de
68 hep.wisc.edu
4 ijclab.in2p3.fr
187 in2p3.fr
1 kfki.hu
18 lnl.infn.it
4 physics.ox.ac.uk
138 physik.rwth-aachen.de
1 pic.es
35 rcac.purdue.edu
71 recas.ba.infn.it
1 roma1.infn.it
13 t2.ucsd.edu
51 ultralight.org
4 unl.edu
2317 open connections on port 1094
Hi CMS,

Does this help in any way ?

Daniela
Hi Daniela,If you are asking if the information that Jyothish provided is useful, yes, knowing connections gives us some insight on if there is any specific site contributes to the load. The information gives us some general idea.
Otherwise, can you be more specific?
Thanks,
Bockjooj
Hi Bockjoo,

I'm just doing the weekly ticket follow up trying to work out which tickets are stuck and from the last update, which was a week ago, it wasn't clear (to me) if there was anything actionable in this ticket. But I gather from your reply that this is still being worked on.

Thanks,
Daniela
Hi Daniela, There were further updates on a separate thread that I forgot to update here:
The current status from RAL is that the limit for CMS-AAA cannot be raised further due to the type of traffic resulting in a much higher IOps count hitting the storage, pushing it near its theoretical limits.
There's a long term project started at RAL as a result to try and replace standard XRootD gateways with XCaches as it would protect our storage while providing a better service for AAA, but this will need the appropriate hardware to be available in a few months.
Regards,
Jyothish
—-—-—-—
Alastair Dewhurst reopened this request.

· Turn off this request's notifications
Sent on 02 April 2025 12:49:29 BST
As an update - an antares buffer node is to be used for testing of AAA XCache when it's retired from production use in antares. I'll update this ticket once the host is available.
Jyothish Thomas Do you have a time line for this ?
No concrete timeline on its availability yet. I'm planning some smaller scale proof of concept tests using our available hardware around November/December this year.
Jyothish Thomas Do you have an update on this ticket, please ?
proof of concept tests were not run due to the available server specs being too low. The antares node is still in production use and not available yet. One node that fits this use case will be included in the next round of procurement.
Hi

We shouldn't rely on the next round of procurement. The Antares nodes are still the best bet for this and I feel that we should have at least one available before data taking restarts.

Alastair
Alastair Dewhurs Jyothish Thomas
Have you agreed on a way forward ? Mainly I am looking for a time line at this point.
Thanks,
Daniela
Hi Daniela,Timeline-wise, we're looking at mid March for the proof of concept, to be ready for testing, with a possible full transition around late September if the validation works.
the test nodes are now being migrated and re-racked for this use
hardware migration is delayed to prioritise adding new production harware. I'll post further updates after easter.
WLCG tickets (4)
WLCG #1001721 (id:1001721) Intermittent checksum errors writing dune test files to webdav.echo.stfc.ac.uk
State: in progress  |  Priority: urgent  |  Opened: 2026-02-03 21:32 (59d ago)  |  Updated: 2026-04-02 15:00
Conversation (45 messages)
In the past few days we have seen about half of our DUNE AWT test jobs fail when writing files to davs://webdav.echo.stfc.ac.uk

DEBUG:root:gfal.NoRename: uploading file from awt-1770047259-ZdrNp7GUXd to davs://webdav.echo.stfc.ac.u
k:1094/dune:/protodune/RSE/testpro/5d/5e/awt-1770047259-ZdrNp7GUXd
INFO:root:Successful upload of temporary file. davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/t

estpro/5d/5e/awt-1770047259-ZdrNp7GUXd
DEBUG:root:skip_upload_stat=False
DEBUG:root:stat: pfn=davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/5d/5e/awt-177004725

9-ZdrNp7GUXd
DEBUG:root:gfal.NoRename: getting stats of file davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/

testpro/5d/5e/awt-1770047259-ZdrNp7GUXd
WARNING:root:Upload attempt failed
INFO:root:Exception: RSE checksum unavailable.
Details:
Error while processing gfal checksum call (adler32). Error: HTTP 403 : Permission refused
Error while processing gfal checksum call (md5). Error: checksum calculation for MD5 not supported for
davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/5d/5e/awt-1770047259-ZdrNp7GUXd

So although the file is transferred to the storage, the checksum calculation is not being done correctly and thus rucio registers the
file upload as a failure.

In some other cases we see the upload fail because
the xrootd check that is meant to do the file existence pre-check fails.. this can be either because of a "permission denied"
or a "service is not available" error

DEBUG:root:gfal.NoRename: checking if file exists root://xrootd.echo.stfc.ac.uk:1094/dune:/protodune/RS

E/testpro/d0/d2/awt-1770152619-4SmcKAlvDC
--- Upload try 1/1
--- Rucio upload 1/1 fails: The requested service is not available at the moment.
Details: An unknown exception occurred.
Details: Failed to stat file (Permission denied)

and in some cases, the upload proceeds without event.

Since there have been recent changes to the voms1.fnal.gov.lsc and voms2.fnal.gov.lsc please first verify that these are in order on all hosts.
I note that actually reading our test file from these servers is fine, the errors happen only on inbound copy.

Steven Timm
Hi Steven,

I
have assigned this ticket to the appropriate team and would hope to get
a status update back to you within the next 4 working hours
Hi Steven, we recently replaced the lsc files in our servers. The current files look like this:

/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=voms1.fnal.gov
/C=US/O=Internet2/CN=InCommon RSA IGTF Server CA 3

/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=voms2.fnal.gov
/C=US/O=Internet2/CN=InCommon RSA IGTF Server CA 3

From our server logs,

260204 04:34:16 1612657 XrdVomsFun: adding cert: /C=UK/O=eScience/OU=Manchester/L=HEP/CN=justin-jobs-no-roles.dune.hep.ac.uk/CN=3460465510
260204 04:34:16 1612657 XrdVomsFun: adding cert: /C=UK/O=eScience/OU=Manchester/L=HEP/CN=justin-jobs-no-roles.dune.hep.ac.uk
260204 04:34:16 1612657 XrdVomsFun: retrieval FAILED: Cannot verify AC signature! Underlying error: Unable to match certificate chain against file: /etc/grid-security/vomsdir/dune/voms1.fnal.gov.lsc

I'll check if there's a missing update somewhere else in the chain and update if anything pops up.

Jyothish
Well, by eye the content of the voms*.lsc files looks OK.and reads are working across the board. We only get the issue during writes, which is strange.

while you continue to check, I will look at the proxy being used in AWT jobs from JustIN and make sure it has indeed been updated, it may be that there's a old cached version being used although I doubt it since we rebooted and reset the whole system yesterday,.
Hi Steven, things look better on our side. Could you confirm if the issue is resolved?
No, the errors are the same as before. Will try to get a better reproducer of it today and add more to the ticket.
The errors still persist with the same error messages as initially reported.
Hi Steve,The justin certs now correctly map onto dune, from our server logs. Is the issue still ongoing?
could you provide some client logs for a failed transfer? I can try to find more info
This is difficult because I can't reproduce it interactively yet, it's only happening in the automated AWT testing logs.. I will try to get logs from there.
GFAL_CONFIG_DIR: GFAL_PLUGIN_DIR:
justin-rucio-upload attempt 1
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['htt
ps://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWr
apper object at 0x151e9414a970> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['htt
ps://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "h
ost_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.
gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?express
ion=RAL_ECHO HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.
gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL_ECHO
HTTP/1.1" 200 1222
DEBUG:root:Input validation done.
INFO:root:Preparing upload for file awt-1772031697-7xsxJfTpD1
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL_ECHO
/attr/ HTTP/1.1" 200 351
DEBUG:root:wan domain is used for the upload
DEBUG:root:Registering file
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /accounts/dune
pro/scopes/ HTTP/1.1" 200 870
DEBUG:root:Trying to create dataset: testpro:awt-uploads-202608
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/testpro
/awt-uploads-202608 HTTP/1.1" 409 104
INFO:root:Dataset testpro:awt-uploads-202608 already exists - no rule will be cr
eated
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/testpro/
awt-1772031697-7xsxJfTpD1/meta?plugin=DID_COLUMN HTTP/1.1" 404 129
DEBUG:root:File DID does not exist
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTT
P/1.1" 201 7
INFO:root:Successfully added replica in Rucio catalogue at RAL_ECHO
DEBUG:root:gfal.NoRename: connecting to storage
DEBUG:root:Checking if root://xrootd.echo.stfc.ac.uk:1094/dune:/protodune/RSE/te
stpro/7c/f0/awt-1772031697-7xsxJfTpD1 exists
DEBUG:root:gfal.NoRename: checking if file exists root://xrootd.echo.stfc.ac.uk:
1094/dune:/protodune/RSE/testpro/7c/f0/awt-1772031697-7xsxJfTpD1
--- Upload try 1/1
--- Rucio upload 1/1 fails: The requested service is not available at the moment
.
Details: An unknown exception occurred.
Details: Failed to stat file (Permission denied)
--- Exit with 99
'justin-rucio-upload --rse RAL_ECHO --protocol davs --scope testpro --dataset aw
t-uploads-202608 awt-1772031697-7xsxJfTpD1 --timeout 1200' returns 99
so the log above is the same as what I initially posted.There are two ways it can fail / is failing.
The above is part of a rucio upload and is failing when it is trying to stat, via xrootd, that the file exists.
I will see if I can find a copy of another one.
here is a different error
[dunepro@fermicloud808 ~]$ rucio -v upload --rse RAL_ECHO --lifetime 172000 --scope test --register-after-upload --protocol davs /tmp/1gbtestfile.20260225.cccc
2026-02-25 09:31:12,311 DEBUG baseclient.py No trace_host passed. Using rucio_host instead
2026-02-25 09:31:12,311 DEBUG baseclient.py No creds passed. Trying to get it from the config file.
2026-02-25 09:31:12,311 DEBUG baseclient.py HTTPS is required, but no ca_cert was passed. Trying to get it from X509_CERT_DIR.
2026-02-25 09:31:12,312 DEBUG baseclient.py HTTPS is required, but no ca_cert was passed and X509_CERT_DIR is not defined. Trying to get it from the config file.
2026-02-25 09:31:12,312 DEBUG baseclient.py No account passed. Trying to get it from the RUCIO_ACCOUNT environment variable or the config file.
2026-02-25 09:31:12,312 DEBUG baseclient.py No VO passed. Trying to get it from environment variable RUCIO_VO.
2026-02-25 09:31:12,312 DEBUG baseclient.py No VO found. Trying to get it from the config file.
2026-02-25 09:31:12,312 DEBUG baseclient.py No VO found. Using default VO.
2026-02-25 09:31:12,312 DEBUG baseclient.py got token from file
2026-02-25 09:31:16,529 DEBUG uploadclient.py Num. of files that upload client is processing: 1
2026-02-25 09:31:16,530 DEBUG baseclient.py HTTP request: GET https://dune-rucio.fnal.gov/rses/?expression=RAL_ECHO

2026-02-25 09:31:16,530 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden]
2026-02-25 09:31:16,530 DEBUG baseclient.py HTTP header: X-Rucio-VO: def
2026-02-25 09:31:16,531 DEBUG baseclient.py HTTP header: Connection: Keep-Alive
2026-02-25 09:31:16,531 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/38.3.0
2026-02-25 09:31:16,531 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::-v
2026-02-25 09:31:16,531 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro
2026-02-25 09:31:16,554 DEBUG baseclient.py HTTP Response: 401 UNAUTHORIZED
2026-02-25 09:31:16,555 DEBUG baseclient.py Response text (length=106): [{"ExceptionClass": "CannotAuthenticate", "ExceptionMessage": "Cannot authenticate with given credentials"}]
2026-02-25 09:31:16,555 DEBUG baseclient.py get a new token
2026-02-25 09:31:16,555 DEBUG baseclient.py HTTP request: GET https://dune-rucio.fnal.gov/auth/x509_proxy

2026-02-25 09:31:16,555 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden]
2026-02-25 09:31:16,555 DEBUG baseclient.py HTTP header: X-Rucio-VO: def
2026-02-25 09:31:16,555 DEBUG baseclient.py HTTP header: Connection: Keep-Alive
2026-02-25 09:31:16,555 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/38.3.0
2026-02-25 09:31:16,555 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::-v
2026-02-25 09:31:16,556 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro
2026-02-25 09:31:16,584 DEBUG baseclient.py HTTP Response: 200 OK
2026-02-25 09:31:16,609 DEBUG baseclient.py HTTP Response: 200 OK
2026-02-25 09:31:16,636 DEBUG uploadclient.py Input validation done.
2026-02-25 09:31:16,637 INFO Preparing upload for file 1gbtestfile.20260225.cccc
2026-02-25 09:31:16,637 DEBUG baseclient.py HTTP request: GET https://dune-rucio.fnal.gov/rses/RAL_ECHO/attr/

2026-02-25 09:31:16,637 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden]
2026-02-25 09:31:16,637 DEBUG baseclient.py HTTP header: X-Rucio-VO: def
2026-02-25 09:31:16,637 DEBUG baseclient.py HTTP header: Connection: Keep-Alive
2026-02-25 09:31:16,637 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/38.3.0
2026-02-25 09:31:16,638 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::-v
2026-02-25 09:31:16,638 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro
2026-02-25 09:31:16,648 DEBUG baseclient.py HTTP Response: 200 OK
2026-02-25 09:31:16,649 DEBUG uploadclient.py wan domain is used for the upload
2026-02-25 09:31:16,808 DEBUG logging.py gfal.NoRename: connecting to storage
2026-02-25 09:31:17,164 DEBUG rsemanager.py Checking if davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/test/de/f8/1gbtestfile.20260225.cccc exists
2026-02-25 09:31:17,164 DEBUG logging.py gfal.NoRename: checking if file exists davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/test/de/f8/1gbtestfile.20260225.cccc

2026-02-25 09:31:17,799 ERROR The requested service is not available at the moment.
Details: An unknown exception occurred.
Details: HTTP 403 : Permission refused
Completed in 5.4897 sec.
[dunepro@fermicloud808 ~]$ voms-proxy-info -all
subject : /DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=dunepro-dunegpvm01.fnal.gov/CN=2543900623
issuer : /DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=dunepro-dunegpvm01.fnal.gov
identity : /DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=dunepro-dunegpvm01.fnal.gov
type : RFC compliant proxy
strength : 2048 bits
path : /opt/dunepro/dunepro.Production.proxy
timeleft : 165:27:32
key usage : Digital Signature, Key Encipherment
=== VO dune extension information ===
VO : dune
subject : /DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=dunepro-dunegpvm01.fnal.gov
issuer : /DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Forward Discovery Group, LLC/CN=voms1.fnal.gov
attribute : /dune/Role=Production/Capability=NULL
attribute : /dune/Role=NULL/Capability=NULL
timeleft : 165:27:32
uri : voms1.fnal.gov:15042
As I said above the failures are intermittent.. Are there more than one physical xrootd server behind the xrootd gateway for instance?or more than one httpd door/server behind the gateway? it's got to be something like that.
on closer look I see these failures are also showing up in FTS3

INFO Wed, 25 Feb 2026 16:18:08 +0100; Setting Gfal2 configuration: RETRIEVE_BEARER_TOKEN=false
INFO Wed, 25 Feb 2026 16:18:08 +0100; Will attempt retrieval of "SE-issued token (macaroon)" for source
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix:
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: > POST /dune:/protodune/RSE/testpro/59/c9/awt-1772031788-SR8kkqxH6e HTTP/1.1
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: > Host: webdav.echo.stfc.ac.uk:1094

INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: > Accept: */*
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: > Content-Type: application/macaroon-request
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: > User-Agent: libdavix/0.8.10 libcurl/7.76.1
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: > Content-Length: 60
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix:
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: < HTTP/1.1 200 OK
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: < Connection: Keep-Alive
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: < Server: XrootD/v5.7.3
INFO Wed, 25 Feb 2026 16:18:08 +0100; Davix: < Content-Length: 463
INFO Wed, 25 Feb 2026 16:18:08 +0100; Will attempt retrieval of "SE-issued token (macaroon)" for destination
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix:
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: > POST /dune/persistent/staging/testpro/59/c9/awt-1772031788-SR8kkqxH6e HTTP/1.1
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: > Host: fndcadoor.fnal.gov:2880

INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: > Accept: */*
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: > Content-Type: application/macaroon-request
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: > User-Agent: libdavix/0.8.10 libcurl/7.76.1
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: > Content-Length: 72
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix:
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: < HTTP/1.1 200 OK
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: < Date: Wed, 25 Feb 2026 15:18:09 GMT
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: < Server: dCache/10.2.17
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: < Strict-Transport-Security: max-age=31536000
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: < Content-Type: application/json
INFO Wed, 25 Feb 2026 16:18:09 +0100; Davix: < Content-Length: 1600
INFO Wed, 25 Feb 2026 16:18:09 +0100; Setting Gfal2 configuration: DEFAULT_COPY_MODE=3rd pull
INFO Wed, 25 Feb 2026 16:18:09 +0100; Setting Gfal2 configuration: ENABLE_FALLBACK_TPC_COPY=false
INFO Wed, 25 Feb 2026 16:18:09 +0100; Transfer accepted
INFO Wed, 25 Feb 2026 16:18:09 +0100; Proxy: /tmp/x509up_h15882726718118876761_1e114c457739fe0c
INFO Wed, 25 Feb 2026 16:18:09 +0100; VO: dune
INFO Wed, 25 Feb 2026 16:18:09 +0100; Job id: 307d82ea-125d-11f1-b711-fa163e5dcbe0
INFO Wed, 25 Feb 2026 16:18:09 +0100; File id: 324542250
INFO Wed, 25 Feb 2026 16:18:09 +0100; Source url: davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/59/c9/awt-1772031788-SR8kkqxH6e

INFO Wed, 25 Feb 2026 16:18:09 +0100; Dest url: davs://fndcadoor.fnal.gov:2880/dune/persistent/staging/testpro/59/c9/awt-1772031788-SR8kkqxH6e

INFO Wed, 25 Feb 2026 16:18:09 +0100; Overwrite enabled: 1
INFO Wed, 25 Feb 2026 16:18:09 +0100; Overwrite when only on disk: 0
INFO Wed, 25 Feb 2026 16:18:09 +0100; Disable delegation: 0
INFO Wed, 25 Feb 2026 16:18:09 +0100; Disable local streaming: 1
INFO Wed, 25 Feb 2026 16:18:09 +0100; Skip eviction of source file: 0
INFO Wed, 25 Feb 2026 16:18:09 +0100; Disable cleanup: 0
INFO Wed, 25 Feb 2026 16:18:09 +0100; Source space token:
INFO Wed, 25 Feb 2026 16:18:09 +0100; Dest space token:
INFO Wed, 25 Feb 2026 16:18:09 +0100; Checksum: 60df073c
INFO Wed, 25 Feb 2026 16:18:09 +0100; Checksum mode: both
INFO Wed, 25 Feb 2026 16:18:09 +0100; User filesize: 26
INFO Wed, 25 Feb 2026 16:18:09 +0100; Scitag: 0
INFO Wed, 25 Feb 2026 16:18:09 +0100; File metadata: {\"request_id\":?\"51655d400538466fb84954a85a5e5f90\",?\"scope\":?\"testpro\",?\"name\":?\"awt-1772031788-SR8kkqxH6e\",?\"activity\":?\"User?Subscriptions\",?\"request_type\":?\"TRANSFER\",?\"src_type\":?\"DISK\",?\"dst_type\":?\"DISK\",?\"src_rse\":?\"RAL_ECHO\",?\"dst_rse\":?\"DUNE_US_FNAL_DISK_STAGE\",?\"src_rse_id\":?\"74b1c7c3d91f4afd9df68e76b6b798ca\",?\"dest_rse_id\":?\"eefb7729e020491d99e1a081a1b6afe6\",?\"filesize\":?26,?\"md5\":?\"a3508cf3d43c935cd80c985077c50bf4\",?\"adler32\":?\"60df073c\"}
INFO Wed, 25 Feb 2026 16:18:09 +0100; Archive metadata:
INFO Wed, 25 Feb 2026 16:18:09 +0100; Job metadata: {\"issuer\":?\"rucio\",?\"multi_sources\":?false,?\"overwrite_when_only_on_disk\":?false,?\"auth_method\":?\"certificate\"}
INFO Wed, 25 Feb 2026 16:18:09 +0100; Bringonline token:
INFO Wed, 25 Feb 2026 16:18:09 +0100; UDT: 0
INFO Wed, 25 Feb 2026 16:18:09 +0100; Report on the destination tape file: 0
INFO Wed, 25 Feb 2026 16:18:09 +0100; Third Party TURL protocol list: https;gsiftp;root
INFO Wed, 25 Feb 2026 16:18:09 +0100; Getting source fi
Hi Steven, apologies for the wait. there are multiple servers behind the redirector, all with the same configuration. from the logs the servers that deny some requests from dune also grant access multiple times, so I find it unlikely to be the cause.

unknown.5441200:35687@dunestor2503.fnal.gov login as nobody

260309 03:42:43 2682211 macarons_AuthzCheck: running verify path /dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn
260309 03:42:43 2682211 macarons_AuthzCheck: path request verified for /dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn
260309 03:42:43 2682211 acc_Audit: unknown.5441200:35687@dunestor2503.fnal.gov deny https *@[::ffff:131.225.238.18] read /dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn
260309 03:42:43 2682211 ofs_open: unknown.5441200:35687@dunestor2503.fnal.gov Unable to open /dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn; permission denied
260309 03:42:43 2682211 XrootdXeq: unknown.5441200:35687@dunestor2503.fnal.gov disc 0:00:00 (send failure)
260309 03

one thing I noticed is that all denies come from the same client, which appears to be worker nodes at fermilab:

260309 03:34:15 2697868 acc_Audit: unknown.5398997:36070@stkendca2214.fnal.gov deny https *@[::ffff:131.225.69.179] read /dune:/protodune/RSE/testpro/f5/c6/awt-1773026848-V57GLqR2Ij
260309 03:28:43 2670175 acc_Audit: unknown.5475288:35808@stkendca2205.fnal.gov deny https *@[::ffff:131.225.69.22] read /dune:/protodune/RSE/testpro/ac/ad/awt-1773026229-6Y6N34yIR6
260309 03:38:44 2666851 acc_Audit: unknown.5448397:35936@dunestor2501.fnal.gov deny https *@[::ffff:131.225.238.16] read /dune:/protodune/RSE/testpro/ac/ad/awt-1773026229-6Y6N34yIR6
260309 03:26:14 2723049 acc_Audit: unknown.5452323:36137@dunestor2502.fnal.gov deny https *@[::ffff:131.225.238.17] read /dune:/protodune/RSE/testpro/c3/00/awt-1773026185-zv6NPQq3XG
260309 03:26:14 2686480 acc_Audit: unknown.5425716:36106@dunestor2505.fnal.gov deny https *@[::ffff:131.225.238.20] read /dune:/protodune/RSE/testpro/90/d0/awt-1773026262-YS6aYYdlaY
260309 03:42:43 2682211 acc_Audit: unknown.5441200:35687@dunestor2503.fnal.gov deny https *@[::ffff:131.225.238.18] read /dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn
260309 03:35:15 2779234 acc_Audit: unknown.6401528:36017@stkendca2220.fnal.gov deny https *@[2620:6a:0:4812:f0:0:69:185] read /dune:/protodune/RSE/testpro/f4/ec/awt-1773026747-mB0Fufdntj
260309 03:44:44 2738464 acc_Audit: unknown.6403016:35853@stkendca2226.fnal.gov deny https *@[2620:6a:0:4812:f0:0:69:191] read /dune:/protodune/RSE/testpro/f5/c6/awt-1773026848-V57GLqR2Ij

could you check if there's anything that changed on their end? or what's the cert bundle used by their workernodes?
dunestor2503 is a new node that was recently added to our dCache pool.It (and 12 other newly-added nodes) are on a different subnet than previous fermilab dCache nodes,
namely 131.225.238.0/23. Could we be dealing with a situation where either the lhcone or lhcopn configurations are
not up to date? Fermilab changed them on our end approximately 3 weeks ago.

Steve Timm

timm@MAC-141177 ~ % host dunestor2503.fnal.gov

dunestor2503.fnal.gov has address 131.225.238.18

dunestor2503.fnal.gov has IPv6 address 2620:6a:0:8470:f0:0:238:18
timm@MAC-141177 ~ % host stkendca2220.fnal.gov

stkendca2220.fnal.gov has address 131.225.69.185

stkendca2220.fnal.gov has IPv6 address 2620:6a:0:4812:f0:0:69:185

However, that in itself can not be the whole problem because we have rucio uploads failing to echo from all around the grid,
it is not just 3rd party transfers to Fermilab that are failing.

Steve Timm
are the other failures also permission denied, or are they timing out?network connectivity should be fine as the transfers are getting to the server, this looks like more likely to be some CA chain/auth issue
These are permission errors and according to the FTS logs they are on the RAL end

INFO Thu, 12 Mar 2026 10:32:30 +0100; Davix: PerformanceMarker:
failure: rejected GET: 403 Forbidden; redirects [https://ceph-svc39.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/d8/e8/awt-1773307296-YHTqAUVPYv?<redacted>]

INFO Thu, 12 Mar 2026 10:32:30 +0100; Gfal2: Copy failed with mode 3rd pull: Transfer failure: rejected GET: 403 Forbidden; redirects [https://ceph-svc39.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/d8/e8/awt-1773307296-YHTqAUVPYv?<redacted>

INFO Thu, 12 Mar 2026 10:32:30 +0100; Gfal2: Gfal http copy clean-up disabled

INFO Thu, 12 Mar 2026 10:32:30 +0100; [1773307950221] BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (3rd pull). Last attempt: Transfer failure: rejected GET: 403 Forbidden; redirects [https://ceph-svc39.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/d8/e8/awt-1773307296-YHTqAUVPYv?<redacted>

INFO Thu, 12 Mar 2026 10:32:30 +0100; Gfal2: Event triggered: BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (3rd pull). Last attempt: Transfer failure: rejected GET: 403 Forbidden; redirects [https://ceph-svc39.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/d8/e8/awt-1773307296-YHTqAUVPYv?<redacted>

INFO Thu, 12 Mar 2026 10:32:30 +0100; Gfal2: Using bearer token for HTTPS request authorization

You should be able to see these logs on fts3-public.cern.ch if you have any kind of a grid cert.

https://fts-public-004.cern.ch:8449/var/log/fts3/transfers/2026-03-12/webdav.echo.stfc.ac.uk__fndcadoor.fnal.gov/2026-03-12-0932__webdav.echo.stfc.ac.uk__fndcadoor.fnal.gov__326531582__3d6ebd1c-1df6-11f1-9a26-fa163ea627c0

for instance

https://fts3-public.cern.ch:8449/fts3/ftsmon/#/errors/list?source_se=davs:%2F%2Fwebdav.echo.stfc.ac.uk&dest_se=davs:%2F%2Ffndcadoor.fnal.gov&time_window=6

the above one mentions ceph-svc39 but we see it with a large number of different machines on your end.
From Dmitry Litvinsev lead dev. at fermilab


Can you extract the same kind bit for a successful transfer for comparison?

We would like to understand why xrootd says this:

260309 03:42:43 2682211 acc_Audit: unknown.5441200:35687@dunestor2503.fnal.gov deny https *@[::ffff:131.225.238.18] read /dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn
260309 03:42:43 2682211 ofs_open: unknown.5441200:35687@dunestor2503.fnal.gov Unable to open /dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn; permission denied

(I won’t be surprised that it might be some ipv4 vs ipv6 ... and what is the significance of the “unknown” above)

actuall.... this seems to be crucial:

unknown.5441200:35687@dunestor2503.fnal.gov login as nobody

why nobody?

How this string for successfull transfer look like ?
Just another interesting example here:
This upload was successful but only on the second try and only because it fell back to root://
This is debug output of our upload..aftere first downloading the file with xrdcp it then tries to upload it again
with a different name. the xrdcp download is successful and what you see after that is the debug output of a rucio upload.

UK_RAL-Tier1 RAL_ECHO davs root://xrootd.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/bb/7f/awt-download-2023-03-07-01.txt

'xrdcp --force --nopbar --verbose root://xrootd.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/bb/7f/awt-download-2023-03-07-01.txt downloaded.txt' returns 0
{
"created_timestamp": null,
"creator": "dunepro",
"fid": "gqkSljHTSSGyv8dz",
"metadata": {},
"name": "awt-1773328913-2q979DYpFQ",
"namespace": "testpro",
"retired": false,
"retired_by": null,
"retired_timestamp": null,
"size": 0,
"updated_by": null,
"updated_timestamp": null
}
metacat file declare returns 0
GFAL_CONFIG_DIR: GFAL_PLUGIN_DIR:
justin-rucio-upload attempt 1
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock

DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x7fe00430a970> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443

DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL_ECHO HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443

DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL_ECHO HTTP/1.1" 200 1222
DEBUG:root:Input validation done.
INFO:root:Preparing upload for file awt-1773328913-2q979DYpFQ
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL_ECHO/attr/ HTTP/1.1" 200 351
DEBUG:root:wan domain is used for the upload
DEBUG:root:Registering file
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /accounts/dunepro/scopes/ HTTP/1.1" 200 870
DEBUG:root:Trying to create dataset: testpro:awt-uploads-202610
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/testpro/awt-uploads-202610 HTTP/1.1" 409 104
INFO:root:Dataset testpro:awt-uploads-202610 already exists - no rule will be created
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/testpro/awt-1773328913-2q979DYpFQ/meta?plugin=DID_COLUMN HTTP/1.1" 404 129
DEBUG:root:File DID does not exist
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 201 7
INFO:root:Successfully added replica in Rucio catalogue at RAL_ECHO

DEBUG:root:gfal.NoRename: connecting to storage
DEBUG:root:Checking if root://xrootd.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/3b/6a/awt-1773328913-2q979DYpFQ exists
DEBUG:root:gfal.NoRename: checking if file exists root://xrootd.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/3b/6a/awt-1773328913-2q979DYpFQ

DEBUG:root:gfal.NoRename: closing protocol connection
DEBUG:root:[{'hostname': 'xrootd.echo.stfc.ac.uk', 'scheme': 'root', 'port': 1094, 'prefix': '/dune:/protodune/RSE', 'impl': 'rucio.rse.protocols.gfal.NoRename', 'domains': {'lan': {'read': 1, 'write': 2, 'delete': 2}, 'wan': {'read': 1, 'write
': 2, 'delete': 2, 'third_party_copy_read': 10, 'third_party_copy_write': 10}}, 'extended_attributes': None}, {'hostname': 'webdav.echo.stfc.ac.uk', 'scheme': 'davs', 'port': 1094, 'prefix': '/dune:/protodune/RSE', 'impl': 'rucio.rse.protocols.
gfal.NoRename', 'domains': {'lan': {'read': 2, 'write': 1, 'delete': 1}, 'wan': {'read': 2, 'write': 1, 'delete': 1, 'third_party_copy_read': 1, 'third_party_copy_write': 1}}, 'extended_attributes': None}]
INFO:root:Trying upload with davs to RAL_ECHO
DEBUG:root:Processing upload with the domain: wan
DEBUG:root:gfal.NoRename: connecting to storage
DEBUG:root:The PFN created from the LFN: davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/3b/6a/awt-1773328913-2q979DYpFQ

DEBUG:root:gfal.NoRename: checking if file exists davs://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/3b/6a/awt-1773328913-2q979DYpFQ

WARNING:root:Upload attempt failed
INFO:root:Exception: The requested service is not available at the moment.
Details: An unknown exception occurred.
Details: HTTP 403 : Permission refused
Traceback (most recent call last):
Fil
Thanks Steven.0) on the login as nobody:
I agree, this looks like a failure to authenticate. I can see that all successful operations have a mapping to a known cert, e.g.

260313 05:20:51 3239824 XrootdBridge: 7dc42260.6305771:35972@cmsftssrv1.fnal.gov login as 7dc42260.0
260313 05:20:51 3239824 acc_Audit: 7dc42260.6305771:35972@cmsftssrv1.fnal.gov grant gsi 7dc42260.0@[2620:6a:0:8420:f0:0:204:23] mkdir /store/mc/RunIII2024Summer24MiniAODv6/ZH-Zto2Q-Hto2Wto2L2Nu_Par-M-125_TuneCP5_13p6TeV_powhegMINLO-jhugen-pythia8/MINIAODSIM/150X_mcRun3_2024_realistic_v2-v2/2560000
260313 05:21:53 3239818 XrootdXeq: 7dc42260.6305771:35972@cmsftssrv1.fnal.gov disc 0:01:02
260313 05:37:52 3246172 XrootdBridge: unknown.6307220:36086@cmsftssrv1.fnal.gov login as nobody
260313 05:38:02 3223232 XrootdXeq: unknown.6307220:36086@cmsftssrv1.fnal.gov disc 0:00:10
260313 05:38:12 3239826 XrootdBridge: 7dc42260.6307251:35931@cmsftssrv1.fnal.gov login as 7dc42260.0
260313 05:38:22 3246152 XrootdXeq: 7dc42260.6307251:35931@cmsftssrv1.fnal.gov disc 0:00:11
260313 05:55:56 3246164 XrootdBridge: 7dc42260.6308706:36019@cmsftssrv1.fnal.gov login as 7dc42260.0
260313 05:56:06 3248118 XrootdXeq: 7dc42260.6308706:36019@cmsftssrv1.fnal.gov disc 0:00:10
260313 06:03:53 3231438 XrootdBridge: 7dc42260.6309234:36115@cmsftssrv1.fnal.gov login as 7dc42260.0
260313 06:04:02 3237775 XrootdXeq: 7dc42260.6309234:36115@cmsftssrv1.fnal.gov disc 0:00:09
260313 07:16:33 3235672 XrootdBridge: 7dc42260.6315376:35901@cmsftssrv1.fnal.gov login as 7dc42260.0
260313 07:16:33 3235672 acc_Audit:

and all unknown logins are denied unless there's a subsequent successful mapping for that connection.

1) ls would fail, but it should just return a file not found response, e.g.:
[fur43467@lcgui05 ~]$ gfal-ls https://webdav.echo.stfc.ac.uk:1094//dteam:/test/test2
gfal-ls error: 2 (No such file or directory) - Result HTTP 404 : File not found after 1 attempts

2) possibly, expired macaroons could explain the auth failures. the stat parts of the transfers seem to succeed from what I've seen

3) the routing seems to switch a lot, even while live - here's the mtr output:

ceph-svc14.gridpp.rl.ac.uk (130.246.179.18) -> 131.225.69.0 2026-03-13T09:32:03+0000
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 10.246.179.1 0.0% 39 0.3 0.4 0.3 0.4 0.0
2. 10.5.0.216 0.0% 39 0.5 0.3 0.2 0.5 0.1
10.5.0.215
10.5.0.218
10.5.0.217
3. 10.0.10.210 0.0% 39 1.0 0.3 0.3 1.0 0.1
4. 10.0.4.25 0.0% 39 0.4 0.4 0.2 0.9 0.1
5. ae3.erdiss-sbr2.ja.net 0.0% 39 3.8 4.5 3.6 20.4 2.7
6. ae31.londpg-sbr2.ja.net 0.0% 39 8.9 10.0 7.8 47.3 7.7
7. (waiting for reply)
8. eex-esnet-lhcone-gw-rt0.lon.nl.geant.net 0.0% 39 8.9 9.0 8.9 9.8 0.2
9. (waiting for reply)
10. bost-cr6--eqxch2-bb-c.igp.es.net 0.0% 39 100.2 98.4 97.8 102.1 0.9
eqxdc4-cr6--eqxch2-bb-e.igp.es.net
newy32aoa-cr6--wash-bb-e.igp.es.net
eqxam3-cr6--bost-bb-a.igp.es.net
11. bost-cr6--eqxch2-bb-c.igp.es.net 0.0% 39 105.2 101.7 94.4 107.9 5.1
eqxch2-cr6--anl541b-bb-c.igp.es.net
wash-cr6--eqxdc4-bb-d.igp.es.net
newy32aoa-cr6--wash-bb-e.igp.es.net
12. wash-cr6--eqxdc4-bb-d.igp.es.net 0.0% 39 119.7 102.2 97.8 119.7 7.0
eqxch2-cr6--anl541b-bb-c.igp.es.net
anl541b-cr6--anl221-bb-a.igp.es.net
eqxdc4-cr6--eqxch2-bb-e.igp.es.net
chic-cr6--fnalgcc-bb-c.igp.es.net
13. anl541b-cr6--anl221-bb-a.igp.es.net 0.0% 39 101.8 99.8 94.5 107.8 3.0
anl221-cr6--fnalfcc-bb-b.igp.es.net
fnalgcc-cr6--fnalfcc-bb-a.igp.es.net
eqxdc4-cr6--eqxch2-bb-e.igp.es.net
eqxch2-cr6--anl541b-bb-c.igp.es.net
14. anl221-cr6--fnalfcc-bb-b.igp.es.net 60.5% 39 101.5 101.6 97.9 107.8 3.4
chic-cr6--fnalgcc-bb-c.igp.es.net
eqxch2-cr6--anl541b-bb-c.igp.es.net
anl541b-cr6--anl221-bb-a.igp.es.net
fnalgcc-cr6--fnalfcc-bb-a.igp.es.net
15. 198.124.81.86 5.1% 39 98.3 98.5 94.4 111.6 3.0
fnalgcc-cr6--fnalfcc-bb-a.igp.es.net
anl221-cr6--fnalfcc-bb-b.igp.es.net
chic-cr6--fnalgcc-bb-c.igp.es.net
16. 198.124.81.86 92.1% 39 101.5 107.0 101.5 111.6 5.1
anl221-cr6--fnalfcc-bb-b.igp.es.net
fnalgcc-cr6--fnalfcc-bb-a.igp.es.net
17. 198.124.81.86 61.5% 39 98.1 98.0 97.9 98.1 0.1
18. 198.124.81.86 97.4% 39 107.9 107.9 107.9 107.9 0.0
19. (waiting for

ceph-svc14.gridpp.rl.ac.uk (130.246.179.18) -> 131.225.238.0 2026-03-13T09:34:02+0000
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. 10.246.179.1
Our dCache admins have requested the following: Do we have any logs on the RAL server side for a transfer that is successful? It would be helpful to compare.
Here's a snippet of a successful write:

260317 04:05:40 3778590 XrootdBridge: 8ed924d1.6859762:35712@lcg2672 login as 8ed924d1.0
260317 04:05:40 3778590 acc_Audit: 8ed924d1.6859762:35712@lcg2672 grant gsi 8ed924d1.0@[2001:630:54:12:82f6:db5f::] create /dune:/protodune/RSE/justin-logs/e8/9e/315305.29-dunegpschedd02.fnal.gov.logs.tgz
260317 04:05:40 ceph_namelib : translated /dune:/protodune/RSE/justin-logs/e8/9e/315305.29-dunegpschedd02.fnal.gov.logs.tgz to dune:/protodune/RSE/justin-logs/e8/9e/315305.29-dunegpschedd02.fnal.gov.logs.tgz
260317 04:05:40 dumpClusterInfo : Counts: 5 5 3 454962 5 48 48 CountsbyCluster: [0:90142172, 1:90142172, 2:90142171, 3:90142171, 4:90142171, ],
260317 04:05:40 Access Mode: /dune:/protodune/RSE/justin-logs/e8/9e/315305.29-dunegpschedd02.fnal.gov.logs.tgz flags&O_ACCMODE 578
260317 04:05:40 File descriptor 2717212 associated to file /dune:/protodune/RSE/justin-logs/e8/9e/315305.29-dunegpschedd02.fnal.gov.logs.tgz opened in write mode
XrdCephOssBufferedFile::Open got fd: 2717212 /dune:/protodune/RSE/justin-logs/e8/9e/315305.29-dunegpschedd02.fnal.gov.logs.tgz
260317 04:05:40 ceph_posix_fstat
XrdCephOssBufferedFile: buffer: got 0 buffers already
XrdCephBufferDataSimple: Global: 1120 12074352640
XrdCephOssBufferedFile::Open: fd: 2717212 Buffer created: 16777216
XrdCephOssBufferedFile: buffer: got 1 buffers already
XrdCephBufferDataSimple: Global: 1121 12091129856
XrdCephOssBufferedFile::Open: fd: 2717166 Buffer created: 16777216
260317 04:05:41 3792837 XrootdXeq: u15551.143:36103@[::ffff:134.219.225.234] disc 0:05:20
260317 04:05:41 3812904 XrdLinkXeq: anon.0:35524@t2xcache01.physics.ox.ac.uk connection upgraded to TLSv1.3
260317 04:05:41 3794805 XrootdBridge: unknown.6859764:36021@fts-atlas-011.cern.ch login as nobody
260317 04:05:41 3794805 scitokens_Access: Trying token-based access control
260317 04:05:41 3794805 scitokens_Access: Token not found in recent cache; parsing.
260317 04:05:41 3794805 scitokens_Access: New valid token mapped_username=xrootd, subject=f92b9b93-1e65-4402-ab21-178155d8b284, issuer=https://atlas-auth.cern.ch/, authorizations=/atlas:datadisk/rucio/data25_13p6TeV/db/e7/AOD.49193645._004179.pool.root.1:create,mkdir,mv,insert,update,chmod,stat,del;/atlas:datadisk/rucio:read,dir,stat
260317 04:05:41 3794805 scitokens_Access: Grant authorization based on scopes for operation=stat, path=/atlas:datadisk/rucio/data25_13p6TeV/db/e7/AOD.49193645._004179.pool.root.1
260317 04:05:41 3794805 scitokens_Access: Request username xrootd
Stat path = /atlas:datadisk/rucio/data25_13p6TeV/db/e7/AOD.49193645._004179.pool.root.1
m_translateFileName - translated '/atlas:datadisk/rucio/data25_13p6TeV/db/e7/AOD.49193645._004179.pool.root.1' to 'atlas:datadisk/rucio/data25_13p6TeV/db/e7/AOD.49193645._004179.pool.root.1'
260317 04:05:41 ceph_posix_stat
260317 04:05:41 ceph_namelib : translated /atlas:datadisk/rucio/data25_13p6TeV/db/e7/AOD.49193645._004179.pool.root.1 to atlas:datadisk/rucio/data25_13p6TeV/db/e7/AOD.49193645._004179.pool.root.1
260317 04:05:41 3794805 scitokens_Access: Trying token-based access control
--
XrdCephOssBufferedFile::Open: fd: 2717213 Buffer created: 16777216
260317 04:05:41 3812904 XrdVomsFun: proxy: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=xcache/CN=614260/CN=Robot: XCache Service Account/CN=264884072
260317 04:05:41 3812904 XrdVomsFun: adding cert: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=xcache/CN=614260/CN=Robot: XCache Service Account
260317 04:05:41 3812904 XrdVomsFun: retrieval successful
260317 04:05:41 3812904 XrdVomsFun: found VO: atlas
260317 04:05:41 3812904 XrdVomsFun: ---> group: '/atlas', role: 'NULL', cap: 'NULL'
260317 04:05:41 3812904 XrdVomsFun: ---> fqan: '/atlas/Role=NULL/Capability=NULL'
260317 04:05:41 3812904 XrdVomsFun: ---> group: '/atlas/team', role: 'NULL', cap: 'NULL'
260317 04:05:41 3812904 XrdVomsFun: ---> fqan: '/atlas/team/Role=NULL/Capability=NULL'
260317 04:05:41 3812904 XrdVomsFun: ---> group: '/atlas/usatlas', role: 'NULL', cap: 'NULL'
260317 04:05:41 3812904 XrdVomsFun: ---> fqan: '/atlas/usatlas/Role=NULL/Capability=NULL'
260317 04:05:41 3812904 XrootdXeq: u3147.942544:35524@t2xcache01.physics.ox.ac.uk pub IP46 TLSv1.3 login as 94657429.0
260317 04:05:41 3812904 acc_Audit: u3147.942544:35524@t2xcache01.physics.ox.ac.uk grant gsi 94657429.0@t2xcache01.physics.ox.ac.uk read /atlas:datadisk/rucio/data15_13TeV/11/e7/DAOD_PHYS.43585940._000036.pool.root.1
260317 04:05:41 ceph_namelib : translated /atlas:datadisk/rucio/data15_13TeV/11/e7/DAOD_PHYS.43585940._000036.pool.root.1 to atlas:datadisk/rucio/data15_13TeV/11/e7/DAOD_PHYS.43585940._000036.pool.root.1
260317 04:05:41 dumpClusterInfo : Counts: 5 5 4 454964 5 48 48 CountsbyCluster: [0:90142199, 1:90142198, 2:90142198, 3:90142198, 4:90142198, ],
260317 04:05:41 Access Mode: /atlas:datadisk/rucio/data15_13TeV/11/e7/DAOD_PHYS.43585940._000036.pool.root.1 flags&O_ACCMODE 0
260317 04:05:41 File descriptor 2717214 associa
Would RAL be available to have some of your experts on a zoom call 1 week from today Thursday 26 Marat 1400 UTC = 0900 Fermilab time.

We are not converging on this ticket.

In the meantime could you please provide successful logs
for this transfer

https://fts-public-001.cern.ch:8449/var/log/fts3/transfers/2026-03-19/dtn14.nersc.gov__fndcadoor.fnal.gov/2026-03-19-1559__dtn14.nersc.gov__fndcadoor.fnal.gov__327267820__94c31214-23ac-11f1-8e97-fa163ea3e633

which was successful from RAL to Fermilab at 16:59 UTC today 19 Mar.

Steven Timm
Hi Steven,Yes, that time is fine.
The link you posted doesn't seem to pass trough RAL? It's a nersc to fnal transfer when I click on it.
I found a different transfer with RAL as source:
https://fts-public-005.cern.ch:8449/var/log/fts3/transfers/2026-03-19/webdav.echo.stfc.ac.uk__fndcadoor.fnal.gov/2026-03-19-1113__webdav.echo.stfc.ac.uk__fndcadoor.fnal.gov__327214225__a079e286-2384-11f1-ae63-769325077eee

logs are as follows for that transfer:

260319 11:13:20 3824109 XrootdBridge: unknown.7384846:35971@dunestor2517.fnal.gov login as nobody
260319 11:13:20 3824109 macarons_AuthzCheck: Verifying macaroon with name:78de50c3.0
260319 11:13:20 3824109 macarons_AuthzCheck: running verify activity activity:READ_METADATA,DOWNLOAD,UPLOAD,DELETE,MANAGE,UPDATE_METADATA,LIST
260319 11:13:20 3824109 macarons_AuthzCheck: macaroon has desired activity DOWNLOAD
260319 11:13:20 3824109 macarons_AuthzCheck: running verify activity activity:DOWNLOAD,LIST
260319 11:13:20 3824109 macarons_AuthzCheck: macaroon has desired activity DOWNLOAD
260319 11:13:20 3824109 macarons_AuthzCheck: running verify path /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW
260319 11:13:20 3824109 macarons_AuthzCheck: path request verified for /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW
260319 11:13:20 3824109 macarons_AuthzCheck: Checking macaroon for expiration; caveat: before:2026-03-19T11:33:17Z
260319 11:13:20 3824109 macarons_AuthzCheck: Macaroon has not expired.
260319 11:13:20 3824109 macarons_Access: Setting the request name to 78de50c3.0
260319 11:13:20 ceph_namelib : translated /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW to dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW
260319 11:13:20 dumpClusterInfo : Counts: 5 5 2 504985 5 45 45 CountsbyCluster: [0:116224037, 1:116224036, 2:116224036, 3:116224036, 4:116224036, ],
260319 11:13:20 Access Mode: /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW flags&O_ACCMODE 0
260319 11:13:20 File descriptor 2917469 associated to file /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW opened in read mode
XrdCephOssBufferedFile::Open got fd: 2917469 /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW
260319 11:13:20 ceph_posix_fstat
260319 11:13:20 3824109 macarons_AuthzCheck: Verifying macaroon with name:78de50c3.0
260319 11:13:20 3824109 macarons_AuthzCheck: running verify activity activity:READ_METADATA,DOWNLOAD,UPLOAD,DELETE,MANAGE,UPDATE_METADATA,LIST
260319 11:13:20 3824109 macarons_AuthzCheck: running verify activity activity:DOWNLOAD,LIST
260319 11:13:20 3824109 macarons_AuthzCheck: running verify path /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW
260319 11:13:20 3824109 macarons_AuthzCheck: path request verified for /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW
260319 11:13:20 3824109 macarons_AuthzCheck: Checking macaroon for expiration; caveat: before:2026-03-19T11:33:17Z
260319 11:13:20 3824109 macarons_AuthzCheck: Macaroon has not expired.
260319 11:13:20 3824109 macarons_Access: Setting the request name to 78de50c3.0
260319 11:13:20 ceph_getxattr: path /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW name=XrdCks.adler32
260319 11:13:20 ceph_namelib : translated /dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW to dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW
XrdCephOssBufferedFile: buffer: got 0 buffers already
XrdCephOssBufferedFile::Close: Flushed data on close fd: 2917463 rc:95955
XrdCephOssBufferedFile::Summary: {"fd":2917463, "Elapsed_time_ms":1039, "path":"/atlas:scratchdisk/rucio/user/xinzhe/c5/73/user.xinzhe.49266839._047912.output-tree.root", read_B:0, readV_B:0, readAIO_B:0, writeB:95955, writeAIO_B:0, startTime:"2026-03-19 11:13:19", endTime:"2026-03-19 11:13:20", nBuffersRead:0}
260319 11:13:20 ceph_fremovexattr: fd 2917463 name=XrdCks.adler32
XrdCephBufferDataSimple: Global: 988 12046041088
XrdCephOssBufferedFile::Open: fd: 2917469 Buffer created: 16777216
XrdCephOssBufferedFile::Summary: {"fd":2917469, "Elapsed_time_ms":22, "path":"/dune:/protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW", read_B:26, readV_B:0, readAIO_B:0, writeB:0, writeAIO_B:0, startTime:"2026-03-19 11:13:20", endTime:"2026-03-19 11:13:20", nBuffersRead:1}
260319 11:13:20 ceph_close: closed fd 2917469 for file /protodune/RSE/testpro/7a/40/awt-1773918642-tXQGU0afLW, read ops count 1, write ops count 0, async write ops 0/0, async pending write bytes 0, async read ops 0/0, bytes written/max offset 0/0, longest async write 0.000000, longest callback invocation 0.000000, last async op age 0.000000
XrdCephBufferAlgSimple::Destructor, fd=2917469, retrieved_bytes=26, bypassed_bytes=0, delivered_bytes=26, cache_hit_frac=1
CephIOAdapterRaw::Summary fd:2917469 nwrite:0 byteswritten:0 write_s:0 writemax_s0 write_MBs:0 nread:1 bytesread:26 read_s:17.8154 readmax_s:17.8154 read_MBs:0 striperlessRead: 1
XrdCephBufferDataSimple~
Thanks.. the following question from our expert, Dmitry Litvintsev:

260309 03:42:43 2682211 acc_Audit: unknown.5441200:35687@dunestor2503.fnal.gov deny https *@[::ffff:131.225.238.18] read /dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn

[1:54 PM]

the question is very straightforward, if macaroon is OK, wht we get "deny https *@[::ffff:131.225.238.18] read 0/dune:/protodune/RSE/testpro/82/19/awt-1773027596-k48ElanXHn" on their end?

It looks like more debugging has been added to the output from the time that you showed us the above failure from the logs which occurred on 9 March. to the output from the success that you sent us today.

can you check your logs on this fail which happened less than an hour ago:

INFO Thu, 19 Mar 2026 19:46:25 +0100; Davix: PerformanceMarker:
failure: rejected GET: 403 Forbidden; redirects [https://ceph-svc23.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/e4/55/awt-1773945408-D38pKIDzPH?<redacted>]

INFO Thu, 19 Mar 2026 19:46:25 +0100; Gfal2: Copy failed with mode 3rd pull: Transfer failure: rejected GET: 403 Forbidden; redirects [https://ceph-svc23.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/e4/55/awt-1773945408-D38pKIDzPH?<redacted>
260319 18:46:25 4002975 XrootdBridge: unknown.7447677:36155@dunestor2501.fnal.gov login as nobody
260319 18:46:25 4002975 macarons_AuthzCheck: Verifying macaroon with name:78de50c3.0
260319 18:46:25 4002975 macarons_AuthzCheck: running verify activity activity:READ_METADATA
260319 18:46:25 4002975 macarons_AuthzCheck: running verify activity activity:DOWNLOAD,LIST
260319 18:46:25 4002975 macarons_AuthzCheck: macaroon has desired activity DOWNLOAD
260319 18:46:25 4002975 macarons_AuthzCheck: running verify path /dune:/protodune/RSE/testpro/e4/55/awt-1773945408-D38pKIDzPH
260319 18:46:25 4002975 macarons_AuthzCheck: path request verified for /dune:/protodune/RSE/testpro/e4/55/awt-1773945408-D38pKIDzPH
260319 18:46:25 4002975 macarons_AuthzCheck: Checking macaroon for expiration; caveat: before:2026-03-19T19:06:23Z
260319 18:46:25 4002975 macarons_AuthzCheck: Macaroon has not expired.
260319 18:46:25 4002975 macarons_Access: Macaroon verification failed
260319 18:46:25 4002975 scitokens_Access: Trying token-based access control
260319 18:46:25 ceph_posix_unlink : /atlas:scratchdisk/rucio/user/leet/0f/0c/user.leet.700779.Sh.DAOD_PHYS.e8514_s4162_r15540_p6697.TtNT_2L_2Lsys_260ceed.log.48705478.000197.log.tgz unlink successful: 78 ms
260319 18:46:25 4019280 XrootdXeq: unknown.7447676:35987@atlas-rucio-prod-03-2gf6usp72ugk-node-10.ipv6.cern.ch disc 0:00:00 (link SSL read error)
260319 18:46:25 4002975 scitokens_Reconfig: Configuring issuer https://atlas-auth.cern.ch/
260319 18:46:25 4002975 scitokens_Reconfig: Configuring issuer https://atlas-auth.web.cern.ch/
260319 18:46:25 4002975 scitokens_Reconfig: Configuring issuer https://cms-auth.cern.ch/
260319 18:46:25 4002975 scitokens_Reconfig: Configuring issuer https://cms-auth.web.cern.ch/
260319 18:46:25 4002975 scitokens_Reconfig: Configuring issuer https://cilogon.org/dune
260319 18:46:25 4002975 scitokens_Reconfig: Configuring issuer https://lhcb-auth.cern.ch/
260319 18:46:25 4002975 scitokens_Reconfig: Configuring issuer https://lhcb-auth.web.cern.ch/
260319 18:46:25 4002975 scitokens_Access: Token not found in recent cache; parsing.
260319 18:46:25 4002975 scitokens_Parse: Token does not appear to be a valid JWT; skipping.
260319 18:46:25 4002975 acc_Audit: unknown.7447677:36155@dunestor2501.fnal.gov deny https *@[::ffff:131.225.238.16] read /dune:/protodune/RSE/testpro/e4/55/awt-1773945408-D38pKIDzPH
260319 18:46:25 4002975 ofs_open: unknown.7447677:36155@dunestor2501.fnal.gov Unable to open /dune:/protodune/RSE/testpro/e4/55/awt-1773945408-D38pKIDzPH; permission denied
260319 18:46:25 4002975 XrootdXeq: unknown.7447677:36155@dunestor2501.fnal.gov disc 0:00:00 (send failure)

looks like it failed to verify the macaroon
OK so the problem is that all of these transfers thus far are using x.509 proxies, not tokensSo why would it be looking at the scitokens_reconfig at all for these?
Thanks Jyothish

Great! We are making progress.

Do you think it is reasonable say the following:

- all jobs use x509 proxies
- gfal-copy surreptitiously makes macaroon request (I know that it does)
- the macaroon is passes in Authoration header to RAL when making GET request from FNAL data server
- for some reason RAL expects SciToken and complains that macaroin is not a Jwt token

(I wonder what makes successful transfer different. Are there multiple endpoints on RAL proxied to, could some of them be configured differently?
So depending on what endpoints we connect we get different result
)

Dmitry
Here is the record from our end regarding the quoted error from RAL :

```
level=WARN ts=2026-03-19T13:46:25.870-0500 event=org.dcache.webdav.request request.method=COPY request.url=https://fndcadoor.fnal.gov:2880/dune/persistent/staging/testpro/e4/55/awt-1773945408-D38pKIDzPH response.code=202 socket.remote=[2001:1458:301:cd::100:11c]:42572 user-agent="libdavix/0.8.10 libcurl/7.76.1" user.dn="CN=1773792726,CN=2241908512,CN=dune-rucio.fnal.gov,O=Fermi Forward Discovery Group\\, LLC,ST=Illinois,C=US,DC=incommon,DC=org" user.mapped=50762:9010,9281,9010,9767,8008,9960,8623[d1RmzfTR:tPVb3zAk] tpc.credential=none tpc.error="rejected GET: 403 Forbidden; redirects [https://ceph-svc23.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/e4/55/awt-1773945408-D38pKIDzPH?authz=Bearer%20MDA...]" tpc.require-checksum=false tpc.source=https://webdav.echo.stfc.ac.uk:1094/dune:/protodune/RSE/testpro/e4/55/awt-1773945408-D38pKIDzPH duration=880
```

that `Bearer%20MDA...` a macaroon, not a SciToken
I did not see too carefully:

60319 18:46:25 4002975 macarons_AuthzCheck: Checking macaroon for expiration; caveat: before:2026-03-19T19:06:23Z
260319 18:46:25 4002975 macarons_AuthzCheck: Macaroon has not expired.
260319 18:46:25 4002975 macarons_Access: Macaroon verification failed
260319 18:46:25 4002975 scitokens_Access: Trying token-based access control

It looks like there is a "fallback" to scitoken of macaroon verification failed. Right?

So why did this happen:

260319 18:46:25 4002975 macarons_Access: Macaroon verification failed
Hi Dmitri, yes scitokens is used as a fallback option when x509 fails. The reason for the failure is not specified in the logs. We could try to get more info by increasing the debug level further, but it'd have to be in a separate endpoint
yes, it would be great to get to the bottom of it. I am on holidays next week, back the week after.
If there will be an endpoint with debug logging we can give it a try. Thank you
I've set ceph-svc16.gridpp.rl.ac.uk to run in debug mode. could you try using that to try and replicate the issue?
I had mentioned above in this ticket a proposed meeting tomorrow 26 March at 14:00 UTC but it appears that I and one of my key colleagues (Dmitry Litvintsev) will not be available at that time. I would rather push that one week ahead to 2 April.
Steve Timm

By the way, given that the transfers in question are being executed via fts3 it may be tricky to grab out one host ceph-svc16 but we will do our best to try.
fts-rest-transfer-submit -s https://fts3-public.cern.ch:8446 davs://ceph-svc16.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/bb/7f/awt-download-2023-03-07-01.txt davs://fndcadoor.fnal.gov:2880/dune/scratch/users/dunepro/awt-download-2023-03-07-01.txt
Job successfully submitted.
Job id: 018ae486-2933-11f1-98d9-fa163ea3e633

Log of transfer is here:

https://fts-public-001.cern.ch:8449/var/log/fts3/transfers/2026-03-26/ceph-svc16.gridpp.rl.ac.uk__fndcadoor.fnal.gov/2026-03-26-1644__ceph-svc16.gridpp.rl.ac.uk__fndcadoor.fnal.gov__328110740__018ae486-2933-11f1-98d9-fa163ea3e633

(it was successful)

today, 17:44 UTC.

Now will do a bunch of them in a row until we see one that fails

OK so did 23 in rapid succession, all successful, every last one.
Nevertheless, simultaneous to this, other transfers also generated by fts3 were failing intermittently as they always have been.
I propose we meet at 08:00 Fermilab time = 1400 British time on Thursday 2 April.I will have Fermilab dcache expert and fermilab network expert on the call, please try to have the same available from your side.

Steve Timm
https://ceph-svc16.gridpp.rl.ac.uk:1094/dune:/protodune/RSE/testpro/bb/7f/awt-download-2023-03-07-01.txt?authz=Bearer%20MDAxNmxvY2F0aW9uIFJBTC1MQ0cyCjAwMzRpZGVudGlmaWVyIDA4M2YwZmM1LTE4ZjgtNDdhMy05ZjRjLTA0MGQ3YmI1NjMxOQowMDE4Y2lkIG5hbWU6NWRkYjJmMTYuMAowMDFmY2lkIGFjdGl2aXR5OlJFQURfTUVUQURBVEEKMDAxZmNpZCBhY3Rpdml0eTpET1dOTE9BRCxMSVNUCjAwNGZjaWQgcGF0aDovZHVuZTovcHJvdG9kdW5lL1JTRS90ZXN0cHJvL2JiLzdmL2F3dC1kb3dubG9hZC0yMDIzLTAzLTA3LTAxLnR4dAowMDI0Y2lkIGJlZm9yZToyMDI2LTA0LTAyVDE2OjEyOjI1WgowMDJmc2lnbmF0dXJlIFbekIlp4OaBgw1lGfRd_L6DAhiqLgJmZWRdaTdCki3uCg
echo -n "MDAxNmxvY2F0aW9uIFJBTC1MQ0cyCjAwMzRpZGVudGlmaWVyIDA4M2YwZmM1LTE4ZjgtNDdhMy05ZjRjLTA0MGQ3YmI1NjMxOQowMDE4Y2lkIG5hbWU6NWRkYjJmMTYuMAowMDFmY2lkIGFjdGl2aXR5OlJFQURfTUVUQURBVEEKMDAxZmNpZCBhY3Rpdml0eTpET1dOTE9BRCxMSVNUCjAwNGZjaWQgcGF0aDovZHVuZTovcHJvdG9kdW5lL1JTRS90ZXN0cHJvL2JiLzdmL2F3dC1kb3dubG9hZC0yMDIzLTAzLTA3LTAxLnR4dAowMDI0Y2lkIGJlZm9yZToyMDI2LTA0LTAyVDE2OjEyOjI1WgowMDJmc2lnbmF0dXJlIFbekIlp4OaBgw1lGfRd_L6DAhiqLgJmZWRdaTdCki3uCg" | base64 --decode
0016location RAL-LCG2
0034identifier 083f0fc5-18f8-47a3-9f4c-040d7bb56319
0018cid name:5ddb2f16.0
001fcid activity:READ_METADATA
001fcid activity:DOWNLOAD,LIST
004fcid path:/dune:/protodune/RSE/testpro/bb/7f/awt-download-2023-03-07-01.txt
0024cid before:2026-04-02T16:12:25Z
e?]base64: invalid input
servers for future testing/debugging:
echo-internal-manager01.gridpp.rl.ac.uk (test redirector)
ceph-svc16.gridpp.rl.ac.uk (test gateway)
meeting summary:
tested against the test cluster at RAL with increased debug logs. the error could be replicated consistently and occurred after the macaroon expiry verification, during the signature verification. decoding the macaroon used during the transfer threw an invalid input error on the information containing the signature information.
error occurrence was not isolated to individual nodes at RAL or FNAL.
next steps:
Dmitry to test generating macaroons against the above endpoint to check the error frequency,
xrootd ticket to be opened
Jyothish to generate a debug version of xrootd to print out the signature verification info
My colleague James Walder also pointed out that for the activity list:
001fcid activity:READ_METADATA
001fcid activity:DOWNLOAD,LIST
usually the first line is what's available, while the second one is the request. It might be worth checking with a working macaroon.
WLCG #1002248 (id:1002248) Noticed sharp spike in job failures and job pressure drop
State: in progress  |  Priority: less urgent  |  Opened: 2026-04-01 17:42 (3d ago)  |  Updated: 2026-04-02 08:51
Conversation (3 messages)
Hi
Noticed sharp spike in failure and jobs pressure drop at RAL-LCG2, below are some usefull links.
majority of the error messages are
- lost heartbeat
- Failed to execute
payload:PyJobTransforms.transform.execute CRITICAL Transform executor
raised TransformValidationException: athena got a SIGKILL signal (exit
code 137); Long ERROR message at line 203 (see jobReport for further
details)

- payload execution failed with 139
- Invalid configuration of a reduction job

and its not specific to any particular user or workflow so this looks like site end issue.



















[1]
https://monit-grafana.cern.ch/d/D26vElGGz/site-oriented-dashboard?orgId=17&var-bin=1h&var-groupby_jobs=dst_experiment_site&var-groupby_ddm=experiment_site&var-cloud=UK&var-country=All&var-federation=All&var-site=RAL-LCG2&var-computingsite=All&var-es_division_factor=1&var-measurement=ddm_transfer_1h&var-jobtype=All&from=1774978581769&to=1775064981769
[2]
https://monit-grafana.cern.ch/d/MVrMkcP7k/site-errors-view?orgId=17&var-bin=1h&var-country=All&var-federation=All&var-tier=All&var-cloud=All&var-site=RAL-LCG2&var-computingsite=All&from=1774978595352&to=1775064995352

[3]
https://bigpanda.cern.ch/errors/?computingsite=RAL&date_from=2026-04-01T11:28&date_to=2026-04-01T17:28
#Update from Tom

High number of workers reporting SIGKILLs due to memory overcommits of athena.py. Are any of the ATLAS jobs going over thier memory massively?
In the past this has been user jobs, it looks like all of these jobs hit the 3x cap of 12GB when they’ve requested 4GB. Either way, something in the job is requesting way to much RAM
its clear that job pressure reduction is due to scheduled CephFS migration of harvester and not due to this spike in failures.
the lost heartbeat error also seems to due to this harvester outage.
Since last night the failures are not seen so this does not seem to be a site end issue.
keeping this ticket open for today till the job pressure returns to normal.
WLCG #1002186 (id:1002186) LHCb Disk and Tape resources at RAL
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-24 12:52 (11d ago)  |  Updated: 2026-04-01 13:09
Conversation (5 messages)
Dear colleagues,

Now that the 2026 data taking has started, we are reviewing the state of the storage
pledges of the LHCb T1 and T2 sites.

Earlier this year, we were informed that (with a few exceptions) the sites would have
the pledged disk and tape resources available to the experiment, and we hope you
can confirm that.

1 - Status of disk pledges

RAL pledges 25003 TB of disk space for 2026. However, the Storage Resource Record (SRR)
advertises only 22092 TB [1]. Do you know what explains that discrepancy?

2 - Status of tape pledges

The site pledges 56281 TB of tape space for 2026. Are these resources available
to LHCb already?

Thanks in advance for your efforts to make these crucial resources available
to us!

best regards, Jan van Eldik / LHCb Compute project lead

[1]
+---------+--------------+-----------+-----------+----------+
| Site | Share | Size | Used | Fraction |
+---------+--------------+-----------+-----------+----------+
| RAL | LHCb-Disk | 22092.0 | 20650.2 | 93.5% |
+---------+--------------+-----------+-----------+----------+
We are in the process of confiming capabilites and timescale for update to the pledges.
Thanks Brian
Hi Brian, do you have an update for us?
Many thanks, Jan
Initial imcrease in pledges with current installed capicities on curent hardware have been calculated and will be deployed. futher increase may come shortly.
WLCG #1002197 (id:1002197) Corrupted files at RAL disk storage
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-26 11:01 (9d ago)  |  Updated: 2026-03-31 08:29
Conversation (3 messages)
Dear site admins,
During the recent file integrity check we found 310 corrupted files on ECHO. The list of files is attached.
Could you help us to investigate the reasons behind the corruption?
Best Regards,
Alex
Will pass onto our storage experts to investigate
Hi,
The files from the list can be grouped as follows:
1. Files uploaded during storage instability periods
a. The most recent one is `buffer/lhcb/MC/2017/SIM/00340649/0009/00340649_00091464_1.sim` -- it was uploaded on the 20th of January this year. First there was an attempt to upload it by a job, which failed, then the file was uploaded to PIC and, after 8(!) unsuccessful attempts it was finally successfully copied to RAL. Job log, as well as FTS logs, are attached. Some FTS transfers reported stuck deletions, and they are probably the reason behind the corruption. On the 20th of January there was indeed an incident with ECHO [1].
b. Other 24 files from this category seem to correspond to 2023 incidents:
bash-5.1$ cat other_incidents.txt
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2011/ALLSTREAMS.DST/00188279/0000/00188279_00000118_5.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2012/B2DHH.STRIP.MDST/00179789/0000/00179789_00000139_1.b2dhh.strip.mdst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2016/ALLSTREAMS.DST/00185904/0000/00185904_00000197_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2016/ALLSTREAMS.DST/00185904/0000/00185904_00000625_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2016/ALLSTREAMS.DST/00185932/0000/00185932_00000034_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2017/ALLSTREAMS.DST/00185750/0000/00185750_00000394_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2017/ALLSTREAMS.DST/00185752/0000/00185752_00000294_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2017/ALLSTREAMS.DST/00185752/0000/00185752_00000421_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2017/ALLSTREAMS.DST/00185902/0000/00185902_00000608_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2017/ALLSTREAMS.DST/00185902/0000/00185902_00000622_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2017/ALLSTREAMS.DST/00185902/0000/00185902_00000630_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2017/ALLSTREAMS.DST/00185906/0000/00185906_00000596_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2017/ALLSTREAMS.DST/00186056/0000/00186056_00000103_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185898/0000/00185898_00000543_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185900/0000/00185900_00000217_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185900/0000/00185900_00000223_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185900/0000/00185900_00000568_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185900/0000/00185900_00000702_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185916/0000/00185916_00000040_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185918/0000/00185918_00000017_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185918/0000/00185918_00000045_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185920/0000/00185920_00000006_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185924/0000/00185924_00000053_7.AllStreams.dst
root://xrootd.echo.stfc.ac.uk//lhcb:prod/lhcb/MC/2018/ALLSTREAMS.DST/00185998/0000/00185998_00000625_7.AllStreams.dst
bash-5.1$Most of them were created between 13th and 16th of May 2023, when there were again some ECHO problems [2].
bash-5.1$ dirac-dms-lfn-metadata other_incidents.txt | grep 'CreationDate'
CreationDate : 2023-05-30 01:58:39
CreationDate : 2023-05-13 08:50:18
CreationDate : 2023-05-13 02:55:46
CreationDate : 2023-05-13 07:42:49
CreationDate : 2023-05-13 10:00:25
CreationDate : 2023-05-13 06:50:29
CreationDate : 2023-05-13 11:28:39
CreationDate : 2023-05-13 07:40:07
CreationDate : 2023-05-13 08:01:21
CreationDate : 2023-05-13 08:37:11
CreationDate : 2023-05-13 06:47:48
CreationDate : 2023-05-13 08:35:24
CreationDate : 2023-05-16 07:11:06
CreationDate : 2023-05-13 06:54:46
CreationDate : 2023-05-13 02:24:12
CreationDate : 2023-05-13 02:26:08
CreationDate : 2023-05-13 06:15:50
CreationDate : 2023-05-13 07:12:13
CreationDate : 2023-05-13 09:52:33
CreationDate : 2023-05-13 09:51:24
CreationDate : 2023-05-13 11:28:52
CreationDate : 2023-05-13 07:32:42
CreationDate : 2023-05-13 11:32:47
CreationDate : 2023-05-15 14:25:00
bash-5.1$The file `/lhcb/MC/2011/ALLSTREAMS.DST/00188279/0000/00188279_00000118_5.AllSt
Tier-2
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%63%96%97%100%100%100%97%100%93%87%97%
HammerCloud100%100%100%100%99%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (5)

WLCG tickets (5)
WLCG #681636 (id:1737) Request to deploy IPv6 on CEs and WNs at WLCG sites (Hephy-Vienna)
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 10:01 (430d ago)  |  Updated: 2026-03-31 08:00
Conversation (9 messages)
GGUS ID: 164348
Last modifier: Andrea Sciaba
Date: 2023-11-29 06:44:09

Public Diary:
Ciao Alessandra,
I know it's not yet the 25... but are there any news?
Thanks,
Andrea
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164348
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:36:46
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (Hephy-Vienna)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164348
Last modifier: Erich Birngruber
Date: 2023-12-12 13:57:36

Public Diary:
Dear Andrea,
Since enabling IPv6 for the Storage Element we have not extended our IPv6 roll-out plans.
Implementation for storage was limited in scope as we enabled IPv6 connectivity for the campus gateway and use NAT64 towards our IPv4 connected storage servers, and added limited IPv6 deployment in the DMZ.

Rolling out IPv6 for the compute nodes will mean to enable IPv6 throughout the campus network, involving 2 major steps:
IPv6 + OSPF3 on the core router
IPv6 + OSPF3 on the datacenter network

We can probably start the planning for this next year, but we cannot ensure that we will have end-to-end IPv6 connectivity in our datacenter by the end of 2024. We'll update here as soon as there is more information.

Best,
Erich

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164348
Last modifier: Andrea Sciaba
Date: 2024-12-03 13:51:18

Public Diary:
Hi Erich,
just to check if there is any new information.
Cheers,
Andrea
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164348
Last modifier: Andrea Sciaba
Date: 2023-12-12 17:10:44

Status: on hold
Responsible Unit: NGI_IT
Public Diary:
Hi Erich,
understood, thanks for the info. I'll put the ticket "on hold", feel free to move it to "in progress" when you feel it's appropriate to do so.
Cheers,
Andrea
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164348
Last modifier: Erich Birngruber
Date: 2025-01-15 10:54:32

Public Diary:
Hi Andrea,
We plan major renewals on our compute infrastructure, including IPv6 rollout for this year.
I'lkl update here once we have a refined timeline.

Best,
Erich
Internal Diary:
Escalated this ticket to NGI_IT
Hi Erich,
do you have more details on the timeline?
Cheers,
Andrea
Hi Andrea,
We expect the hardware renewal to happen towards the end of the year. When deploying the new hardware, we will do so with IPv6, so I will expect to have v6 connectivity by early 2026.

Best,
Erich
Hi Erich,
as we are past early 2026, I was wondering if there are any news/
Cheers,
Andrea
WLCG #1001859 (id:1001859) RHEL10 support for worker nodes
State: assigned  |  Priority: less urgent  |  Opened: 2026-02-19 13:01 (44d ago)  |  Updated: 2026-03-27 12:09
Conversation (19 messages)
We (Hephy-Vienna) are in the process of deploying a new HPC system. At some point the workloads should be migrated to this new system. The system should go into production around May/June. We want to deploy the HPC system with RHEL10 as the OS.
Currently a blocker is that the WLCG RHEL10 repo (https://linuxsoft.cern.ch/wlcg/el10/x86_64/) are missing some important packages (wn, ui, etc).
What is the ETA/roadmap for RHEL10 support ?

Best
Let me add Maarten, he might know the timescale for the WLCG EL10 repo.

For CMS, only EL packages and Singularity should be required, i.e. no dependency on the WLCG repo exists.

- Stephan
Hi all,I will start following up with the relevant product teams.
Some packages may only appear after the summer, but we will see...
Hi all,things actually look fairly promising; I will keep you updated.

That said, is CMS still using some MW provided directly by the WN?
Or does only HEP_OSlibs need to be installed?
That would be sufficient for ALICE...
CMS only needs a handful of RPMs (all in the EL repos) and Singularity in addition to the base OS.
- Stephan
Currently we have following pakages installed from the WLCG repos:
HEP_OSlibs.x86_64 7.3.6-1.el7.cern @wlcg
condor-classads.x86_64 9.0.20-1.el7 @wlcg
voms-api-java.noarch 3.3.3-1.el7 @wlcg
voms-clients-java.noarch 3.3.4-1.el7 @wlcg
wlcg-iam-lsc-alice.noarch 3.0.0-1.el7 @wlcg
wlcg-iam-lsc-cms.noarch 3.0.0-1.el7 @wlcg
wlcg-iam-vomses-alice.noarch 3.0.0-1.el7 @wlcg
wlcg-iam-vomses-atlas.noarch 3.0.0-1.el7 @wlcg
wlcg-iam-vomses-cms.noarch 3.0.0-1.el7 @wlcg
wlcg-iam-vomses-dteam.noarch 1.0.0-1.el7 @wlcg
wlcg-iam-vomses-lhcb.noarch 3.0.0-1.el7 @wlcg
wlcg-iam-vomses-ops.noarch 2.0.0-1.el7 @wlcg
We also install the UI/WN meta package from the UMD4 repo. There is an UMD5 repo but that is only for Almalinux9 (https://repository.egi.eu/sw/production/umd/5/)
We probably would also need UMD5 repos for RHEL10
is there maybe a guide/documentation of what packages are required for a UI/WN ?
Hi again,UMD-5 or -6 for EL10 will probably be there sometime this spring.

What is needed on the WN depends on the VOs you need to support.

The WN meta rpm for EL9 pulls in these packages:

c-ares
cvmfs
dcap
dcap-devel
dcap-libs
dcap-tunnel-gsi
dcap-tunnel-krb
dcap-tunnel-ssl
dcap-tunnel-telnet
fetch-crl
gfal2-all
gfal2-devel
gfal2-doc
gfal2-python3
globus-gass-copy-progs
globus-proxy-utils
gridsite-libs
openldap-clients
python3-ldap
uberftp
voms-clients-java
voms-devel
xrootd-client

For CMS (and ALICE) the WN needs CVMFS + HEP_OSlibs and maybe nothing else!

I will add CVMFS to the WLCG repo for EL10.
Our T2 site supports CMS, ALICE and Belle.
I added Hiroaki. Maybe he can provide some input regarding the requiried software on the worker nodes.
For CMS, you could download the wn_basic.sh SAM test and execute it manually to check all required commands/utilities are there. Instructions are on the CMS SAM twiki page, which is linked from the F&S page.
Thanks,
cheers, Stephan

https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsSAMTests
Ah good point, I think we do have the wn_basic.sh SAM tests in our reFrame testing framework so we can check this.
As of belle II, the release is not certified for RHEL10 we will run the payload in a container.
For the WN CVMFS + HEP_OSlibs should be enough however we have to test that the pilot run on RHEL10.
Hi all,
a first version of the WN meta package for EL10 is available now from the WLCG repo:

https://linuxsoft.cern.ch/wlcg/el10/x86_64/
So we did some tests.
In general it looks good.
However there is an issue with htcondor-ce.
When we try to install htcondor-ce-apel-25.0.1-1.el10.noarch package we get some missing dependencies:
- nothing provides apel-ssm needed by htcondor-ce-apel-25.0.1-1.el10.noarch from htcondor
- nothing provides apel-client >= 1.8.0 needed by htcondor-ce-apel-25.0.1-1.el10.noarch from htcondor
- nothing provides apel-parsers >= 1.8.0 needed by htcondor-ce-apel-25.0.1-1.el10.noarch from htcondor
They are missing in the https://linuxsoft.cern.ch/wlcg/el10/x86_64/ repo
Hi all,the APEL rpms remain unavailable for the time being. Current ETA: sometime in summer.

I would hope the CE does not have to be bound to the same constraints as the WNs?
Yes that's true. The CE does not have the same constraints as the WNs.
One thing I noticed is that the desy-voms-all-1.0.0-1.noarch.rpm is missing form the WLCG RHEL10 repo
Hi again,DESY no longer runs any VOMS service. For which VO(s) do you need the VOMS details?
Ah good to know. We need the VOMS config for belle II
We were looking at this document https://gitlab.desy.de/grid/voms-cert-transition but I guess that's outdated.
Looking at our current system we have two VOMS services for belle II:
belle-grid-voms.desy.de and belle-voms.cc.kek.jp
We might go the route and just set the X509_VOMS_DIR and X509_VOMSES but would still be good if there is also an RPM package.
The current VOMS server details for Belle II can be found here:
https://operations-portal.egi.eu/vo/view/voname/belle

They currently have these servers:

voms.cc.kek.jp
belle-auth.cc.kek.jp

I will check if Belle II would like to have an rpm for them in the WLCG repository.
Hi all,rpms with Belle-II LSC files and "vomses" are available now for EL8, EL9 and EL10:

https://linuxsoft.cern.ch/wlcg/

For example:

wlcg-lsc-belle-1.0.0-1.el10.noarch.rpm
wlcg-vomses-belle-1.0.0-1.el10.noarch.rpm
Mind: the EL10 repo uses a SHA-2 signing key, the others use the SHA-1 legacy key.
WLCG #1001559 (id:1001559) LCG.HEPHY.at Pilot job Submission failure
State: assigned  |  Priority: less urgent  |  Opened: 2026-01-16 09:27 (78d ago)  |  Updated: 2026-01-28 10:15
Conversation (9 messages)
We are experiencing Pilot job submission failure vs LCG.HEPHY.at
ce-1.grid.vbc.ac.at

In site Director SiteDirectorHTcondorCE2 we see the following errors
2026-01-15 20:12:42 UTC WorkloadManagement/SiteDirectorHTCondorCE2/ce-1.grid.vbc.ac.at ERROR: Failed to submit jobs to htcondor Command ['condor_submit', '-terse', '-pool', 'ce-1.grid.vbc.ac.at:9619', '-remote', 'ce-1.grid.vbc.ac.at', '/opt/dirac/pro/runit/WorkloadManagement/SiteDirectorHT/HTCondorCE_lp8htdbt.sub'] failed with: 1 - ERROR: Failed to connect to queue manager ce-1.grid.vbc.ac.at
2026-01-15 20:12:42 UTC WorkloadManagement/SiteDirectorHTCondorCE2/WorkloadManagement/SiteDirectorHTCondorCE2 ERROR: Failed submission to queue Queue ce-1.grid.vbc.ac.at_htcondorce-condor:
Command ['condor_submit', '-terse', '-pool', 'ce-1.grid.vbc.ac.at:9619', '-remote', 'ce-1.grid.vbc.ac.at', '/opt/dirac/pro/runit/WorkloadManagement/SiteDirectorHT/HTCondorCE_lp8htdbt.sub'] failed with: 1 - ERROR: Failed to connect to queue manager ce-1.grid.vbc.ac.at
Even the condor ping fail

(base) [spardi@ui-tier1 TOKEN]$ voms-proxy-info --all
subject : /C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo/CN=3748414437/CN=430235544/CN=3605940026/CN=9479455334
issuer : /C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo/CN=3748414437/CN=430235544/CN=3605940026
identity : /C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo/CN=3748414437/CN=430235544/CN=3605940026
type : RFC compliant proxy
strength : 2048 bits
path : /tmp/x509up_u21037
timeleft : 41:56:19
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO belle extension information ===
VO : belle
subject : /C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo
issuer : /C=JP/O=KEK/OU=CRC/CN=belle-auth.cc.kek.jp
attribute : /belle/Role=production/Capability=NULL
attribute : /belle/Role=NULL/Capability=NULL
attribute : nickname = dirac (belle)
timeleft : 128:26:30
uri : belle-auth.cc.kek.jp:15020
(base) [spardi@ui-tier1 TOKEN]$ condor_ce_ping -pool ce-1.grid.vbc.ac.at:9619 -name ce-1.grid.vbc.ac.at -verbose
WARNING: Missing <authz-level | command-name | command-int> argument, defaulting to WRITE.
WARNING: Missing daemon argument, defaulting to SCHEDD.
ERROR: couldn't locate ce-1.grid.vbc.ac.at!
We are using SSL authentication with DN /C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo/

May you please have a look.
Thnx
Silvio
Hi Silvio!
Our compute element appears to be fine, we're receiving job submissions from other experiments.
As far as I can see, we have a SSL identity mapping for "/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo"

Looking at the error message "ERROR: couldn't locate ce-1.grid.vbc.ac.at" makes me think that the host lookup fails, and the IP can't be resolved. Is the submitting host an IPv6 only system (or the nameserver only resolving over IPv6)?

Currently, ce-1.grid.vbc.ac.at does not have an IPv6 address.

Best,
Erich
>>Our compute element appears to be fine, we're receiving job submissions from other experiments.
Using SSL?

The machine in properly reachable on the door

(base) [spardi@ui-tier1 TOKEN]$ telnet ce-1.grid.vbc.ac.at 9619
Trying 193.171.188.212...
Connected to ce-1.grid.vbc.ac.at.
Escape character is '^]'.

The error is a general error of HTCondor that could be lead from many different reasons included authentication, mapping:

(base) [spardi@ui-tier1 TOKEN]$ condor_ce_ping -pool ce-1.grid.vbc.ac.at:9619 -name ce-1.grid.vbc.ac.at -verbose
WARNING: Missing <authz-level | command-name | command-int> argument, defaulting to WRITE.
WARNING: Missing daemon argument, defaulting to SCHEDD.
ERROR: couldn't locate ce-1.grid.vbc.ac.at!

We should go through some check try to guess the issue:
1) The Certification Authorities have been updated and Fetch-crl is updated
2) May give me the output of the following commands on CE
rpm -qa |grep KEK
rpm -qa |grep condor
rpm -qa |grep openssl

3) The output of
condor_ce_config_val -d |grep SSL

4) May I see exactly the content of file of mapping on CE?

In My case is something like
[root@htc-belle-ce02 ~]# cat /etc/condor-ce/mapfiles.d/20-ssl.conf
SSL "/DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Silvio Pardi spardi@infn.it" belle003
SSL "/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot1 - UEDA Ikuo" prdbelle002
SSL "/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo" prdbelle001
SSL "/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot 2 - UEDA Ikuo" prdbelle004

Thnx
Silvio
May you please run the check reported in the previuous message? Also which is the Certification Authority used for the host? There is an issue whith some CA that provides SHA1-only CRL.
Let us know toghether with the previous tests.
Thank you for your cooperation
Silvio
Hi Silvia,
1.) The Certification Authorities should be up2date and the fetch-crl is being run by a systemd timer
2.)

[root@ce-1 apel]# rpm -qa |grep KEKca_KEK2024-1.138-1.noarch
[root@ce-1 apel]# rpm -qa |grep condor
htcondor-release-23.x-1.el9.noarch
htcondor-ce-client-23.10.1-1.el9.noarch
htcondor-ce-slurm-23.10.1-1.el9.noarch
htcondor-ce-view-23.10.1-1.el9.noarch
condor-placeholder-0.0.0-0.el9.noarch
htcondor-ce-apel-23.10.1-1.el9.noarch
htcondor-ce-23.10.1-1.el9.noarch
condor-upgrade-checks-23.10.29-1.el9.x86_64
python3-condor-23.10.29-1.el9.x86_64
condor-23.10.29-1.el9.x86_64
[root@ce-1 apel]# rpm -qa |grep openssl
openssl-fips-provider-so-3.0.7-6.el9_5.x86_64
openssl-fips-provider-3.0.7-6.el9_5.x86_64
openssl-libs-3.2.2-6.el9_5.x86_64
openssl-3.2.2-6.el9_5.x86_64
xmlsec1-openssl-1.2.29-13.el9.x86_64
openssl-devel-3.2.2-6.el9_5.x86_64

3.)
[root@ce-1 apel]# condor_ce_config_val -d |grep SSL
AUTH_SSL_ALLOW_CLIENT_PROXY = True
AUTH_SSL_AUTOGENERATE_CERTFILE = $(ETC)/hostcert.pem
AUTH_SSL_AUTOGENERATE_KEYFILE = $(ETC)/hostkey.pem
AUTH_SSL_CLIENT_CADIR = /etc/grid-security/certificates
AUTH_SSL_CLIENT_CAFILE =
AUTH_SSL_CLIENT_CERTFILE = /etc/grid-security/hostcert.pem
AUTH_SSL_CLIENT_KEYFILE = /etc/grid-security/hostkey.pem
AUTH_SSL_CLIENT_USE_DEFAULT_CAS = true
AUTH_SSL_REQUIRE_CLIENT_CERTIFICATE = false
AUTH_SSL_REQUIRE_CLIENT_MAPPING = True
AUTH_SSL_SERVER_CADIR = /etc/grid-security/certificates
AUTH_SSL_SERVER_CAFILE =
AUTH_SSL_SERVER_CERTFILE = /etc/grid-security/hostcert.pem
AUTH_SSL_SERVER_KEYFILE = /etc/grid-security/hostkey.pem
AUTH_SSL_SERVER_USE_DEFAULT_CAS = true
AUTH_SSL_USE_CLIENT_PROXY_ENV_VAR = false
AUTH_SSL_USE_VOMS_IDENTITY = true
BOOTSTRAP_SSL_SERVER_TRUST = false
BOOTSTRAP_SSL_SERVER_TRUST_PROMPT_USER = true
COLLECTOR.SEC_ADVERTISE_STARTD_AUTHENTICATION_METHODS = FS,TOKEN,SCITOKENS,SSL
COLLECTOR.SEC_READ_AUTHENTICATION_METHODS = FS,TOKEN,SCITOKENS,SSL
COLLECTOR.SEC_WRITE_AUTHENTICATION_METHODS = FS,TOKEN,SCITOKENS,SSL
COLLECTOR_BOOTSTRAP_SSL_CERTIFICATE = false
GAHP_SSL_CADIR =
GAHP_SSL_CAFILE =
SCHEDD.SEC_READ_AUTHENTICATION_METHODS = FS,SCITOKENS,SSL
SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,SCITOKENS,SSL
SEC_CLIENT_AUTHENTICATION_METHODS = FS, TOKEN, SCITOKENS, SSL
SSL_SKIP_HOST_CHECK = false
4.)
[root@ce-1 apel]# cat /etc/condor-ce/mapfiles.d/99-ssl.conf
# EG Monitoring Service
SSL "/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-egi@grnet.gr" grid.ops.mon
SSL "/DC=EU/DC=EGI/C=HR/O=Robots/O=SRCE/CN=Robot:argo-egi@cro-ngi.hr" grid.ops.mon
#SSL "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=litmaath/CN=410032/CN=Maarten Litmaath" grid.ops.mon
SSL /\/DC=ch\/DC=cern\/OU=Organic Units\/OU=Users\/CN=litmaath\/CN=410032\/CN=Maarten Litmaath,.*/ grid.ops.mon
SSL /\/DC=EU\/DC=EGI\/C=GR\/O=Robots\/O=Greek Research and Technology Network\/CN=Robot:argo-egi@grnet.gr,.*/ grid.ops.mon
SSL /\/DC=EU\/DC=EGI\/C=HR\/O=Robots\/O=SRCE\/CN=Robot:argo-egi@cro-ngi.hr,.*/ grid.ops.mon
SSL /\/DC=EU\/DC=EGI\/C=GR\/O=Robots\/O=Greek Research and Technology Network\/CN=Robot:argo-secmon@grnet.gr,.*/ grid.ops.mon

# EGI Security monitoring
SSL "/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-secmon@grnet.gr" grid.ops.sec

# BELLE
SSL "/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo" grid.belle.pilot
SSL "/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot1 - UEDA Ikuo" grid.belle.pilot
#SSL "/C=JP/O=KEK/OU=CRC/CN=Hideki Miyake" grid.belle.prod

SSL "/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Production - UEDA Ikuo" grid.belle.prod

#SSL "/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo,/belle/Role=production/Capability=NULL,/belle/Role=NULL/Capability=NULL" grid.belle.prod

# ALICE
#SSL "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=maarten/CN=410032/CN=Maarten Litmaath" grid.alice.pool003
#SSL "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=maarten/CN=410032/CN=Maarten Litmaath,/alice/Role=NULL/Capability=NULL,/alice/alarm/Role=NULL/Capability=NULL,/alice/lcg1/Role=NULL/Capability=NULL,/alice/team/Role=NULL/Capability=NUL" grid.alice.pool003
Hi Eric,
you have set:
AUTH_SSL_USE_VOMS_IDENTITY = true on your condor CE. You can either put it as false and leave the mapping as it is, or If you want to work with this you should add the voms extension on the mapping file.
Following the discussione her ( https://helpdesk.ggus.eu/#ticket/zoom/1001454) you may use something like

SSL "/\/C=JP\/O=KEK\/OU=CRC\/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo,\/belle\/.*/"
grid.belle.pilot

SSL "/\/C=JP\/O=KEK\/OU=CRC\/CN=Robot: BelleDIRAC Pilot1 - UEDA Ikuo,\/belle\/.*/"
grid.belle.pilot

SSL "/\/C=JP\/O=KEK\/OU=CRC\/CN=Robot: BelleDIRAC Pilot 2 - UEDA Ikuo,\/belle\/.*/"
grid.belle.pilot

Let me know.
Thnx
Silvio
Hi Silvio,
thanks for the pointers. we added the 3 SSL mappings and restrarted the HTCondor-CE service.
Could you check, if it works ?
Thanks
Best
Ümit
Hi
Ümit,
I still cannot access to the cluster

(base) [spardi@ui-tier1 ~]$ condor_ce_ping -pool ce-1.grid.vbc.ac.at:9619 -name ce-1.grid.vbc.ac.at -verbose WARNING: Missing <authz-level | command-name | command-int> argument, defaulting to WRITE.
WARNING: Missing daemon argument, defaulting to SCHEDD.
ERROR: couldn't locate ce-1.grid.vbc.ac.at!

I think that you should add also the following parameter to your CE configuration
USE_VOMS_ATTRIBUTES = True
You may also follow this page for the full configuration
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToUseProxiesWithSsl

Thnx
Regards
Silvio
Let me report that now it looks to work
(base) [spardi@ui-tier1 ~]$ condor_ce_ping -pool ce-1.grid.vbc.ac.at:9619 -name ce-1.grid.vbc.ac.at -verbose WARNING: Missing <authz-level | command-name | command-int> argument, defaulting to WRITE.
WARNING: Missing daemon argument, defaulting to SCHEDD.
Destination: schedd ce-1.grid.vbc.ac.at

Remote Version: $CondorVersion: 23.10.29 2025-09-22 BuildID: 834959 PackageID: 23.10.29-1 GitSHA: ded6225d $
Local Version: $CondorVersion: 9.0.20 Nov 16 2023 $
Session ID: ce-1:2794806:1769595247:3184
Instruction: WRITE
Command: 60021
Encryption: AES
Integrity: AES
Authenticated using: SSL
All authentication methods: SSL
Remote Mapping: grid.belle.pilot@users.htcondor.org

Authorized: TRUE
and also several pilot jobs have been processed
WLCG #1000455 (id:1000455) Belle VOMS server connection timeout
State: assigned  |  Priority: less urgent  |  Opened: 2025-09-02 13:52 (214d ago)  |  Updated: 2025-09-02 13:52
Conversation (1 message)
During the VOMS Cert Transition (see https://gitlab.desy.de/grid/voms-cert-transition ) we installed the desy-voms-all-1.0.0 RPM which should contain all necessary files. However when we try to create a user grid cert with the belle VO extension we get following error message:
```
Contacting grid-voms.desy.de:15020 [/DC=org/DC=terena/DC=tcs/C=DE/ST=Hamburg/O=Deutsches Elektronen-Synchrotron DESY/CN=grid-voms.desy.de] "belle"...
Error contacting grid-voms.desy.de:15020 for VO belle: Connect timed out
None of the contacted servers for belle were capable of returning a valid AC for the user.
User's request for VOMS attributes could not be fulfilled.
```
Running an openssl cert check also results in a timeout:
```
openssl s_client -connect grid-voms.desy.de:15020 | grep -A2 "END CERTIFICATE"

```
We checked our firewall and the request is not blocked.
WLCG #681618 (id:1719) Upgrade to a supported HTCondor version and enable SSL authentication (Hephy-Vienna)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:59 (430d ago)  |  Updated: 2025-02-11 09:07
Conversation (22 messages)
GGUS ID: 163980
Last modifier: Alessandro Paolini
Date: 2023-11-03 11:25:55
Subject: Upgrade to a supported HTCondor version and enable SSL authentication (Hephy-Vienna)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Middleware
Description:
Dear site admins,

with this ticket we would like to follow-up the upgrade to a supported version of HTCondorCE and the migration from voms-based authentication with X509 certificates to AAI tokens for accessing the HTCondorCE endpoints.

The HTCondor team set-up an upgrade procedure to help sites and VOs with the migration from X509 personal certificates to tokens.
Essentially it was created an intermediate step where the plain SSL authentication can be used to authenticate a client' proxy, in addition to the GSI one or to the token one:
- https://confluence.egi.eu/x/EYAtDQ

In summary, the steps are:

- update to HTCondor 9.0.19
- enable the SSL authz (with priority over GSI)
- map the users' DNs
- test the SSL authz successfully
- update to latest HTCondor 10.x

You can find the HTCondor 9.0.19 version in WLCG repository for the time being, as explained in the instructions.

Please also note the usage in the last step of the HTCondor Feature channel (https://htcondor.org/htcondor/release-highlights/index.html#feature-channel) since it this the one supporting the EGI Check-in plugin from 10.4.0.
In this way the sites can accept clients’ proxies and tokens at the same time while waiting for the supported VOs moving completely to tokens.
You can find the latest HTCondor 10.x version in the HTCondor Feature Channel repository.

Please note that after the upgrade to HTCondor 10 version, you need to install and configure the EGI Check-in plugin in order to be compliant with the EGI tokens:
https://github.com/EGI-Federation/check-in-validator-plugin

Please get in contact with your supported communities to properly map the users' DNs to local accounts to ensure also the access via X509 personal certificates.

Concerning the ops VO, please map at least the following certificates:
- EGI Monitoring Service:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-egi@grnet.gr"
"/DC=EU/DC=EGI/C=HR/O=Robots/O=SRCE/CN=Robot:argo-egi@cro-ngi.hr"

- EGI Security monitoring:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-secmon@grnet.gr"

Please also configure properly the Accounting settings on the HTcondor 10 installation, as explained in the instructions.

Thanks for your collaboration,
EGI Operations
GGUS ID: 163980
Last modifier: Umit Seren
Date: 2023-11-14 10:42:02

Public Diary:
Hi Alessandro,

Our CE is running jobs for 3 experiments (CMS, belle, alice). and to reduce the operational complexity we configured our CE to use lcamps to map the various GSI certificates to local users. Do you by chance know if we can use the lcmaps GSI callout also with the SSL proxy certificates ?

Thanks in advance
Best
Ümit

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Alessandro Paolini
Date: 2023-11-17 08:47:45

Public Diary:
Hi Umit,

a fix for the fallback from SSL to GSI has been released with HTCondor 9.0.20 so you can update to it.
Please check carefully the instructions since there are a few new variables to set.

Cheers,
Alessandro
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Alessandro Paolini
Date: 2023-11-14 11:28:22

Status: in progress
Responsible Unit: NGI_IT
Public Diary:
Hi Umit,
I really don't know if it is possible: considering that with HTcondor10 GSI was dropped, maybe your implementation would not work.

You may want to check directly with the developers.

Cheers,
Alessandro
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Umit Seren
Date: 2023-11-17 08:53:22

Public Diary:
Hi Alessandro,
I was about to ask about the GSI fallback because yesterday we enabled SSL with a fallback to GSI and we haven't seen any GSI authentication messages in the htcondor-ce log.
Will update to the new version and observe it again
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Umit Seren
Date: 2023-11-17 10:26:46

Public Diary:
Hi Alessandro,

after upgrading condor 9.0.20 and added the additional configuration the fallack to GSI seems to work.
It seems that Belle is still authenticating via GSI (see [1]) instead of SSL. Is there anything from our side to do or do we have to contact Bellle to enable SSL proxy support on their side for our site ?




[1]
11/17/23 11:17:20 (cid:293) (D_AUDIT) AuthMethod=GSI, AuthId=/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo,/belle/Role=production/Capability=NULL,/belle/Role=NULL/Capability=NULL, CondorId=grid.belle.prod02@users.htcondor.org
11/17/23 11:17:24 (cid:296) (D_AUDIT) AuthMethod=GSI, AuthId=/C=JP/O=KEK/OU=CRC/CN=Robot: BelleDIRAC Pilot - UEDA Ikuo,/belle/Role=production/Capability=NULL,/belle/Role=NULL/Capability=NULL, CondorId=grid.belle.prod02@users.htcondor.org

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Alessandro Paolini
Date: 2023-11-17 10:33:33

Public Diary:
Hi Umit,

probably belle2 has still in the list CEs requiring only GSI so they have to keep GSI as first option for the time being.
Anyway you may want to inform them that your CE is ready for SSL as primary authentication method.

Cheers,
Alessandro
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Maarten Litmaath
Date: 2023-11-18 21:22:31

Public Diary:
Hallo Umit,
as the callout to LCMAPS is _not_ supported in condor >= 10.x,
your SSL mappings will have to be done by HTCondor itself.
For tokens, more flexibility is envisaged in the future.

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Umit Seren
Date: 2023-11-20 09:34:13

Public Diary:
Hi Maarten,

thanks for the info. We actually looked into it and it seems quite straightforward to move from LCMAPS to regular mapping via HTCondor.
When I checked the logs I noticed that Alice workloads come mostly via SCITOKEN, however every half an hour we see a workload with your proxy cert that is still mapped to GSI (see [1]). Do you have to change this on your side or do we have to change some mapping so that these workloads will be using SSL instead of GSI ?



[1]

11/20/23 10:28:17 (cid:419759) (D_AUDIT) AuthMethod=GSI, AuthId=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=maarten/CN=410032/CN=Maarten Litmaath,/alice/Role=NULL/Capability=NULL,/alice/alarm/Role=NULL/Capability=NULL,/alice/lcg1/Role=NULL/Capability=NULL,/alice/team/Role=NULL/Capability=NULL, CondorId=grid.alice.pool003@users.htcondor.org

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Maarten Litmaath
Date: 2023-11-20 23:10:11

Public Diary:
Hallo Umit,
my certificate is still used by the ALICE SAM tests.
A token version of those tests is in the work plan,
but remains low priority for the time being.
If you want to upgrade your CE to HTCondor v23 eventually
and the ALICE SAM tests still have not been adjusted yet,
you would need to map my DN via SSL instead, thanks!
In the meantime I will look into the HTCondor version
and configuration used by those SAM tests...

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Umit Seren
Date: 2023-11-21 13:42:07

Public Diary:
Hi Maarten,
thanks for the info. I tried to map your proxy via SSL with following mapping but it still falls back to GSI:

SSL "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=maarten/CN=410032/CN=Maarten Litmaath" grid.alice.pool003

SSL "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=maarten/CN=410032/CN=Maarten Litmaath,/alice/Role=NULL/Capability=NULL,/alice/alarm/Role=NULL/Capability=NULL,/alice/lcg1/Role=NULL/Capability=NULL,/alice/team/Role=NULL/Capability=NUL" grid.alice.pool003



Do you see anything wrong the mapping ?

Thanks in advance

Best
Ümit

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Maarten Litmaath
Date: 2023-11-21 13:51:16

Public Diary:
Hallo Umit,
that is due to the client side not having the right version and/or
configuration yet. I will follow up with the service manager.

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Maarten Litmaath
Date: 2024-05-01 17:24:14

Public Diary:
Hallo Vienna Team,
in the meantime, the ALICE SAM tests have been switched to tokens
and in any case do not matter, as you are a _T3_ site for ALICE
and hence are not included in the A/R reports for ALICE.
I have adjusted the configuration for those tests to be stopped...
For the record, the SSL mapping of my subject DN still works OK.
Do you already have all "belle" and "ops" subject DNs mapped via SSL?

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Umit Seren
Date: 2024-05-02 08:48:39

Public Diary:
Hi Maarten,

thanks for the update.

We can see in the HTCondor logs that the Alice jobs are authenticated with SCITOKENS.

Regarding belle: We have the mapping but they still fallback to GSI. I can't remember if we opened a GGUS ticket with them regarding this. I will have to check. If not I will open a ticket.

Best

Ümit

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Erich Birngruber
Date: 2024-05-02 11:39:35

Public Diary:

Hi Maarten,
Can you clarify our site status, as T3 is causing some confusion. We were under the impression we're a T2 site (at least it was our intention to become a T2 site). Storage monitoring is also showing us as T2, see http://alimonitor.cern.ch/?3222

What are the implications of T2 vs T3 and what would we need to change to become T2? We can continue this by email, if this more appropriate ( erich.birngruber@gmi.oeaw.ac.at )
Best,
Erich


Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Maarten Litmaath
Date: 2024-05-03 21:59:55

Public Diary:
Hallo Erich,
we will follow up on that aspect via e-mail.

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Erich Birngruber
Date: 2024-09-17 12:03:25

Status: solved
Responsible Unit: NGI_IT
Solution:
Hi all,
since we;ve clarified our site status to be T2, I think this is good to close.
Best,
Erich
GGUS ID: 163980
Last modifier: Maarten Litmaath
Date: 2024-09-18 10:14:04

Public Diary:
Hi all,
I will check the SSL mapping and update the ticket later today.

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Maarten Litmaath
Date: 2024-09-18 13:25:42

Public Diary:
Hi all,
the SSL mapping works OK, but the CE is still running 9.0.20 on CentOS 7!
That means the ticket has to stay open for now.

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Umit Seren
Date: 2025-01-22 09:24:18

Public Diary:
Short update:

We will update the opereating system of our CE to RHEL9 and in the process we will also update HTCondor-CE to 10.x

This will probably happen in the next week or so.

Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163980
Last modifier: Alessandro Paolini
Date: 2025-01-22 09:39:56

Public Diary:
Hi Umit,

thanks for providing the migration plan.
I'd suggest you install v23.0.x and you can also follow this guide written by Maarten:
https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv23EL9

Cheers,
Alessandro
Internal Diary:
Escalated this ticket to NGI_IT
Hi Maarten Alessandro
thanks for the upgrade guide.
We upgraded our Compute Element (ce-1.grid.vbc.ac.at) to RHEL9 & HTCondor-CE 23.x
We can see that belle, alice and cms successuflly authenticate and can submit jobs.
@Maarten: We would also like to upgrade our alien-1.grid.vbc.ac.at vobox to RHEL9.
We would setup a new VM with vobox on RHEL9: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGvoboxDeployment#EL9_instructions . We have some ansible playbooks that we used to setup the remaining alien services based on this: https://alien.web.cern.ch/content/alice-vo-box-setup-and-configuration. Is this guide still valid ?
Thanks in advance
Best
Ümit
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM42%100%95%92%100%95%76%79%83%100%100%100%100%100%100%100%
HammerCloud100%100%100%100%100%100%99%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #681824 (id:1925) Request to implement BGP tagging of LHCONE prefixes. (UERJ)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:47 (430d ago)  |  Updated: 2025-04-03 13:34
Conversation (13 messages)
GGUS ID: 168411
Last modifier: Renan Bernardo
Date: 2024-09-26 12:42:34

Public Diary:
Hello,

We will contact our provider about this. Thank you.

Best,
Renan
GGUS ID: 168411
Last modifier: Julia Andreeva
Date: 2024-09-23 15:45:28
Subject: Request to implement BGP tagging of LHCONE prefixes. (UERJ)
Ticket Type: USER
CC: ;edoardo.martelli@cern.ch
Status: assigned
Responsible Unit: USCMS
Issue type: Network problem
Description:
This ticket concerns all the sites connected to LHCONE.

In agreement with the WLCG Management Board, it has been decided to
implement the tagging of the IP prefixes announced to LHCONE.
The task consists of tagging the IP prefixes that your site announces to
LHCONE with all the BGP communities that identifies the experiments and
collaborations supported by your site. The initial goal is to document
the use of the network. In the longer term the tags may be used to
reduce the exposure on the LHCONE connection, by filtering unnecessary
prefixes.

You will find the values of the BGP communities to use and other
information in these pages:
- https://twiki.cern.ch/twiki/bin/view/LHCONE/MultiOneBGPcommunities
-
https://indico.cern.ch/event/1356138/contributions/6123461/attachments/2925447/5147273/WLCG-20240911-GDB-MultiONE-implementation.pdf

If you need any support on this task, please don't hesitate to ask your
NREN or LHCONE provider.
Or just reply to this ticket asking your questions; experts will guide
you in the implementation.

Please take this opportunity also to review the network information
related to your site in CRIC :
https://wlcg-cric.cern.ch/core/networkroute/list/

We ask you to perform the required action by the end of March 2025
GGUS ID: 168411
Last modifier: Julia Andreeva
Date: 2025-01-28 14:10:15

Public Diary:
Hello Julia,

We have contacted the network's technical staff and are still waiting for a response.

Best,
Eduardo
Internal Diary:
Escalated this ticket to USCMS
GGUS ID: 168411
Last modifier: Julia Andreeva
Date: 2025-01-28 14:10:32

Public Diary:
Any progress on this ticket?
Internal Diary:
Escalated this ticket to USCMS
GGUS ID: 168411
Last modifier: Eduardo Azevedo Revoredo
Date: 2024-12-02 14:11:14

Public Diary:
Hello Julia,

We have contacted the network's technical staff and are still waiting for a response.

Best,
Eduardo
GGUS ID: 168411
Last modifier: Renan Bernardo
Date: 2025-01-28 14:15:53

Public Diary:
Hi Julia,

We're pressing our provider about this, but so far no progress.

Best,
Internal Diary:
Escalated this ticket to USCMS
Hello,

We're working with our provider to implement these configurations. Should be applied by Wednesday.
We'll keep you updated.

Best regards,
Davi Jardim
Hello,
The changes were applied by our provider.

Best,
Renan
Hi
What are your IP prefixes?
I've checked 152.92.255.0/24 and 2804:1f10:8010:a032::/64, but I don't see any 61339: BGP communities.
Is this record the one of your site?
https://wlcg-cric.cern.ch/core/networkroute/detail/UERJ-LHCONE/

cheers
Edoardo
Assigning missing CMS site name error during import to new GGUS.
Jakrapop
Hello Edoardo.
Yes, you are right, these are our IP prefixes (152.92.255.0/24 and 2804:1f10:8010:a032::/64).

We are in contact with Rederio/RNP to find out more details about this.
Do you have any particular questions that need to be answered ?
We'll let them know that the 61339: BGP community is not being identified by you.

Attached are two images sent by them with the test results. Could you take a look ?

Thank you.
Best,
Eduardo
Hi
Now I see the tags on the Ipv4 prefix:

152.92.255.0/24 61339:3 61339:60001 61339:60002

But 2804:1f10:8010:a032::/64 doesn't have any.

cheers
Edoardo
Hello Edoardo.
Thank you for your comments.

I took the liberty of emailing you after speaking to the technicians at our ISP in Brazil.
Cheers,
Eduardo
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%97%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%99%99%99%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #681775 (id:1876) HTCondor CE configuration for dteam tokens (Dirac(X) tests and certification)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:20 (430d ago)  |  Updated: 2025-02-03 11:30
Conversation (14 messages)
GGUS ID: 168944
Last modifier: federico.stagni
Date: 2024-11-07 14:59:16

Public Diary:
One typo: should be dteam-auth.cern.ch not dteam-auth.web.cern.ch
Internal Diary:
Sent 1st reminder to ticket submitter (vaiva.zokaite@cern.ch) requesting input.
GGUS ID: 168944
Last modifier: federico.stagni
Date: 2024-11-07 14:40:19
Subject: HTCondor CE configuration for dteam tokens (Dirac(X) tests and certification)
Ticket Type: USER
CC:
Responsible Unit: TPM
Issue type: Middleware
Description:
Dear site admins,

could you please add support for WLCG token usage in dteam pilot/job submissions? The context for this request is that the dteam VO is used for Dirac(X) testing and certification purposes.

For example, in a file /etc/condor-ce/mapfiles.d/11-scitokens.conf:

SCITOKENS /^https:\/\/dteam-auth\.web\.cern\.ch\/,15d20b77-81be-41a4-a3d3-d4b3c0dd32f9$/ dteam-dirac

The token subject will be used to submit pilots/jobs from the Dirac(X) certification setup The account name is a placeholder, your name may be different.

To make the HTCondor CE aware of the additional mappings: condor_ce_reconfig

Please let us know when the CE configuration has been adjusted, such that we then may test it, thanks!

For the purpose of these tests we only configured 3 CEs (503, 504, 514).

Affected Site: CERN-PROD
GGUS ID: 168944
Last modifier: Josef Borik
Date: 2024-11-07 15:14:29

Status: assigned
Responsible Unit: ROC_CERN
Public Diary:
One typo: should be dteam-auth.cern.ch not dteam-auth.web.cern.ch
Internal Diary:
Sent 1st reminder to ticket submitter (vaiva.zokaite@cern.ch) requesting input.
GGUS ID: 168944
Last modifier: SYSTEM
Date: 2024-11-07 15:14:55

Public Diary:
CERN ticket reference: https://cern.service-now.com/nav_to.do?uri=u_request_fulfillment.do?sys_id=RQF2908220
Internal Diary:
Sent 1st reminder to ticket submitter (vaiva.zokaite@cern.ch) requesting input.
GGUS ID: 168944
Last modifier: valerio.buono@cern.ch
Date: 2024-11-07 15:17:12

Public Diary:
Assignemnt group: CERN GRID 2nd Line Support 3rd Line Support

Internal Diary:
Sent 1st reminder to ticket submitter (vaiva.zokaite@cern.ch) requesting input.
GGUS ID: 168944
Last modifier: Giacomo.Tenaglia@cern.ch
Date: 2024-11-08 15:10:46

Status: in progress
Responsible Unit: ROC_CERN
Public Diary:
Assignemnt group: CERN GRID 2nd Line Support 3rd Line Support

Internal Diary:
08-11-2024 16:10:46 - Giacomo Tenaglia (Work notes (Internal View))
As discussed.
Thanks!
Giacomo
GGUS ID: 168944
Last modifier: Antonio Delgado Peris
Date: 2024-11-08 15:31:01

Public Diary:
Hi, just to be sure. Should we accept tokens from "dteam-auth.web.cern.ch" or "dteam-auth.cern.ch", or both?

Cheers,
Antonio
Internal Diary:
08-11-2024 16:10:46 - Giacomo Tenaglia (Work notes (Internal View))
As discussed.
Thanks!
Giacomo
GGUS ID: 168944
Last modifier: Maarten Litmaath
Date: 2024-11-08 16:46:37

Public Diary:
Hi all,
good point, the suggested line actually names a non-existent host!
The ".web" subdomain needs to be removed:
SCITOKENS /^https:\/\/dteam-auth\.cern\.ch\/,15d20b77-81be-41a4-a3d3-d4b3c0dd32f9$/ dteam-dirac

Internal Diary:
08-11-2024 16:10:46 - Giacomo Tenaglia (Work notes (Internal View))
As discussed.
Thanks!
Giacomo
GGUS ID: 168944
Last modifier: Antonio Delgado Peris
Date: 2024-11-11 08:59:04

Public Diary:
Hi,

I have reconfigured ce514. Could you try that one? If the test is OK we can reconfigure all the other CEs.

Cheers,
Antonio
Internal Diary:
08-11-2024 16:10:46 - Giacomo Tenaglia (Work notes (Internal View))
As discussed.
Thanks!
Giacomo
GGUS ID: 168944
Last modifier: federico.stagni
Date: 2024-11-14 16:18:50

Public Diary:
Hello
we have a new client ID to add:


# dteam pilots (DIRAC certification instance)
SCITOKENS /^https\:\/\/dteam\-auth\.cern\.ch\/,8f610b59\-d2ab\-4a6f\-aacb\-8052ead48bfc$/ yourdteamaccountnamehere


Can you please add this one?

Cheers,
Federico
Internal Diary:
08-11-2024 16:10:46 - Giacomo Tenaglia (Work notes (Internal View))
As discussed.
Thanks!
Giacomo
GGUS ID: 168944
Last modifier: Antonio Delgado Peris
Date: 2024-11-15 08:33:51

Public Diary:
Hi, are these meant for different purposes? I.e. should they be mapped to different users, or can they be mapped to the same one?

Cheers,
Antonio
Internal Diary:
08-11-2024 16:10:46 - Giacomo Tenaglia (Work notes (Internal View))
As discussed.
Thanks!
Giacomo
GGUS ID: 168944
Last modifier: GGUS SYSTEM
Date: 2024-12-23 09:19:09

Public Diary:
09-12-2024 08:15:54 - Antonio Delgado Peris (Additional comments (Customer View))
Hi, could you please reply if the two subjects can be mapped to the same account, so that we can deploy this?

Internal Diary:
Sent 2nd reminder to ticket submitter (federico.stagni@cern.ch) requesting input.
GGUS ID: 168944
Last modifier: GGUS SYSTEM
Date: 2024-12-16 08:21:19

Public Diary:
09-12-2024 08:15:54 - Antonio Delgado Peris (Additional comments (Customer View))
Hi, could you please reply if the two subjects can be mapped to the same account, so that we can deploy this?

Internal Diary:
Sent 1st reminder to ticket submitter (federico.stagni@cern.ch) requesting input.
GGUS ID: 168944
Last modifier: GGUS SYSTEM
Date: 2024-12-30 09:19:20

Public Diary:
09-12-2024 08:15:54 - Antonio Delgado Peris (Additional comments (Customer View))
Hi, could you please reply if the two subjects can be mapped to the same account, so that we can deploy this?

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%85%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (3)

CMS tickets (1)
CMS #1002251 (id:1002251) Unknown ARC SAM test results at T2_CH_CSCS
State: assigned  |  Priority: less urgent  |  Opened: 2026-04-02 09:50 (2d ago)  |  Updated: 2026-04-02 09:53
Conversation (1 message)
Good morning, CSCS admins.
Since 22:00UTC yesterday (1 Apr). Your compute endpoints have several unknown results from all CE SAM tests [1]. The Glidein page and grafana dashboard show there are running job on these nodes [2] but ETF log indicated the previous was "found in status Queuing" [3]. Could you please take a look and check for any obstructions that prevent the SAM test job from running on your server? I also CC SI team for further investigation from their end.
Cheers,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_CH_CSCS
[2]
https://monit-grafana.cern.ch/d/requested-cpu/requested-cpu?orgId=11&var-site=T2_CH_CSCS&var-binning=1h&from=1774828800000&to=1775120399000

http://vocms0206.cern.ch/factory/monitor/factoryEntryStatusNow.html?entry=CMSHTPC_T2_CH_CSCS_arc-fort-3
[3]
https://etf-cms-prod.cern.ch/etf/check_mk/view.py?host=arc-fort-3.lcg.cscs.ch&service=org.sam.CONDOR-JobState-%2Fcms-ce-token&site=etf&view_name=service
https://etf-cms-prod.cern.ch/etf/check_mk/view.py?host=arc-fort-3.lcg.cscs.ch&service=org.sam.ARC-JobState-%2Fcms%2FRole%3Dlcgadmin&site=etf&view_name=service
WLCG tickets (2)
WLCG #681472 (id:1573) Request to deploy IPv6 on CEs and WNs at WLCG sites (CSCS-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:45 (430d ago)  |  Updated: 2026-02-26 10:39
Conversation (16 messages)
GGUS ID: 164335
Last modifier: Andrea Sciaba
Date: 2023-11-28 16:59:10

Public Diary:
Concerned VO has been changed from cms to none.

Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164335
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:36:13
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (CSCS-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_CH
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164335
Last modifier: Riccardo Di Maria
Date: 2024-01-15 08:59:12

Public Diary:
Dear Andrea,
just a quick reply to confirm this is acknowledged.
Updates will be provided in the next few months.
Best,
Riccardo
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164335
Last modifier: Andrea Sciaba
Date: 2024-01-15 09:29:18

Status: on hold
Responsible Unit: NGI_CH
Public Diary:
Ciao Riccardo!
great, I'll just change the status to "on hold", if it's OK for you.
Andrea
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164335
Last modifier: Andrea Sciaba
Date: 2024-04-08 14:52:44

Public Diary:
Ciao Riccardo,
any news, by chance?
Andrea
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164335
Last modifier: Riccardo Di Maria
Date: 2024-04-09 11:22:17

Public Diary:
Ciao Andrea,
unfortunately not.
I have pinged CSCS network again today.
Will let you know.
Apologies, bests,
Riccardo
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164335
Last modifier: Andrea Sciaba
Date: 2024-07-01 14:30:22

Public Diary:
Dear Riccardo,
the deadline has passed... Do you have any update?
Thanks,
Andrea
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164335
Last modifier: Riccardo Di Maria
Date: 2024-07-05 10:02:21

Public Diary:
Dear Andrea,
apologies for the delayed reply.

I can confirm the work started and is progressing, but CSCS cannot provide you with an ETA on the matter.

May you please let me know whether this is a showstopper for you or WLCG in general?

Best regards,
Riccardo
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164335
Last modifier: Andrea Sciaba
Date: 2024-07-05 12:00:50

Status: in progress
Responsible Unit: NGI_CH
Public Diary:
Ciao Riccardo,
fair enough. It is not a showstopper, WLCG did not yet discuss the consequences of not meeting the deadline. The most concrete effect is an increasing probability of some experiment services becoming IPv6-only, but this is not imminent, AFAIK. Still, having a deadline is necessary to put a timescale on the process and it is highly desirable to complete it as fast as realistically possible.
Andrea
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164335
Last modifier: Riccardo Di Maria
Date: 2024-10-27 21:32:45

Public Diary:
Unfortunately, still no news.
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
Ciao Riccardo,
any news?
Andrea
Ciao Riccardo,
any news?
Andrea
Hi,
it's a long time I don't have an update from CSCS...
Cheers,
Andrea
Hello
Apologies for the lack of replies. RIccardo is leaving CSCS and I'm stepping in.
I'll dig and come back to you.
Salvatore
Hello, in the current configuration Alps does not support IPV6. Its a planned feature, but not yet arrived.
On the other hands, IPV6 on ARC-CE, its probably doable. I'll start working on that.
Salvatore
Hi Salvatore,
many thanks, let me know when there are news.
Cheers,
Andrea
WLCG #1001421 (id:1001421) CSCS-LCG2 stopped reporting to APEL starting from July
State: assigned  |  Priority: urgent  |  Opened: 2025-12-18 14:08 (107d ago)  |  Updated: 2026-02-16 11:10
Conversation (19 messages)
CSC-LCG2 stopped reporting to APEL starting from July
As subject says , CSC-LCG2 stopped reporting to APEL starting from July.
Hello Julia,
could you kindly provide more details of what kind of reports need to be send and how? Last year we migrated both the WLCG ARCs and the Cluster, and perhaps something must have been lost. Actually i was not aware that we had to send any reporting. There is a documentation for that, so i can give a look?
Regards,
Salvatore Di Nardo
Hi Salvatore, in case you run ARC CEs, then ARC is reporting accounting data to APEL.
I am not an expert, below some info I found. In case things do not work as described, you might need to ask ARC support.
Cheers
Julia
1. ARC Accounting Subsystem (ARC6 & ARC7 Admin Guide)
This section of the ARC documentation explains how ARC’s accounting subsystem works and how to configure reporting to APEL:
🔗 Accounting Subsystem — NorduGrid ARC (ARC6)
https://www.nordugrid.org/arc/arc6/admins/details/accounting-ng.html
Provides instructions on how to enable and configure accounting and how to publish records (including to APEL).
🔗 Accounting Subsystem — NorduGrid ARC (ARC7)
https://www.nordugrid.org/arc/arc7/admins/details/accounting-ng.html
Same as above, updated for ARC7.
What you’ll find here:

Overview of accounting in ARC CE (local DB, record creation).

How to enable reporting using jura-ng within arc.conf.

Example configuration block for APEL.

2. ARC Configuration Reference — APEL Block
Detailed reference for the ARC configuration options related to accounting and APEL.
🔗 ARC Configuration Reference — NorduGrid ARC7
https://www.nordugrid.org/arc/arc7/admins/reference.html
Look for the [arex/jura/apel:targetname] block.
Key configuration parameters you’ll use:

[arex/jura/apel:targetname]
targeturl = <APEL broker URL>
apel_messages = summaries # or `urs` for per-job CAR records
vofilter = <VO name> # optional VO filtering
urbatchsize = 500 # size of batch sending
gocdb_name = <your_resource_name> # name as registered in GOCDB

These settings control where and how ARC sends accounting summaries to APEL.
Once configured, you can also manually trigger sending of accounting records using the ARC control tool:

# Republish accounting records to the configured APEL target defined in arc.conf:
arcctl accounting republish -b YYYY-MM-DD -e YYYY-MM-DD -t <targetname>
# Or specify parameters directly:
arcctl accounting republish \
--end-from YYYY-MM-DD --end-till YYYY-MM-DD \
--apel-url https://msg.argo.grnet.gr \
--gocdb-name "YOUR-RESOURCE" \
--apel-messages summaries \
--apel-topic gLite-APEL
Ok, thanks for the links. i'll check our Arc
Hello again, i checked the documentation, the current and documentation and the configuration we have before July ( the one that supposedly worked ). They seems to match:
[arex/jura/apel:EGI]
apel_messages = summaries
targeturl = https://msg.argo.grnet.gr
gocdb_name = CSCS-LCG2
urdelivery_frequency = 7200
and:
[queue:normal]
...
benchmark = HEPSPEC 22.46
...

but trying to resend data it fails:

[root@atlas-fort-arc-1-0 arc]# arcctl accounting republish -b 2026-01-01 -e 2026-01-31 -t EGI
[2026-02-02 16:47:50,915] [ARC.Accounting.AMS] [ERROR] [316282] [Failed to obtain AMS authentication token. Error code 500 returned: Internal Server Error]
[2026-02-02 16:47:50,918] [ARC.Accounting.AMS] [ERROR] [316282] [Cannot publish data without AMS auth token.]
[2026-02-02 16:47:52,919] [ARC.Accounting.AMS] [ERROR] [316282] [Failed to obtain AMS authentication token. Error code 500 returned: Internal Server Error]
[2026-02-02 16:47:52,921] [ARC.Accounting.AMS] [ERROR] [316282] [Cannot publish data without AMS auth token.]
[2026-02-02 16:47:52,921] [ARC.Accounting.AMS] [ERROR] [316282] [Failed to publish records to APEL AMS msg.argo.grnet.gr. See previous messages for details.]
[2026-02-02 16:47:52,922] [ARC.Accounting.Publisher] [ERROR] [316282] [Failed to publish messages to APEL target]
[2026-02-02 16:47:52,922] [ARCCTL.Accounting] [ERROR] [316282] [Failed to republish accounting data from 2026-01-01 00:00:00 till 2026-01-31 00:00:00 to APEL target EGI]

also a simple " curl https://msg.argo.grnet.gr returns "404 page not found" . Perhaps the target server is down or the APEL has been relocated to a different url? DO you have the correct configuration we need to apply, or do you know who should we ask?

Apologies for the lack of knowledge on this, but i just stepped in one month ago, replacing Riccardo who left CSCS, so i might not be aware of all the historical details.
I think i found where the issue is. In September 22 we got the mail from Alessandro Paolini (in attachment) , that we missed.
At CSCS we are still using ARC6, and tried to implement the changes suggested, but still doesn't work ( the changed lines are in the arc7 branch, and while i found the correct lines, it seems not making any effect).
It also looks to me that there has not been made any updated version of ARC6 with the fix built in, so I don't know how to fix this ( except for upgrading to arc7 which requires a lot of testing and time ).
Perhaps there are some experts that could help, but not sure how reach them.

Salvatore
Hi Salvatore, I would try ARC forums:

nordugrid-discuss@nordugrid.org — the generic ARC discussion list for questions about ARC middleware (install, configure, issues, general usage). You can subscribe and post questions there.

wlcg-arc-ce-discuss@cern.ch — a WLCG-specific discussion e-group for ARC CE topics within the WLCG community.

I would try the first one, since your question is not specific for WLCG. Provide all necessary details in your mail.
Hello Julia,
just to let you know that I'm still working on it ( no luck so far). The data are in arc, but i keep getting HTTP error 500 upon upload. Someone in nordugrid-discuss@nordugrid.org is helping me. Hopefully will get fixed soon and I'll be able to republish everything.
Regards,
Salvatore
update. I created a ticket because now it looks like the issue is in the other side (msg.argo.grnet.gr)
here's the ticket i opened: https://helpdesk.ggus.eu/#ticket/zoom/1001808

but i was unsure where to send it. I could not find anything
"APEL admins" related.
Let me know if its better to move is anywhere
else.

Regards,
Salvatore
Hello Julia, so far i got no reply from my ticket, perhaps you know who manages the msg.argo.grnet.gr server to give a look or if there's a more appropriate Ticket area where to relocate this ticket?
Alessandro and Adrian,
Could you, please, help Salvatore in resolving this issue.
Thank you
Hi Salvatore,
indeed the problems that started at the end of June might be related to the new authentication method implented by the Messaging service at that time, but then in July it was reverted since most of the ARC-CE weren't (yet) ready for that. In the meantime a fix was created and circulated, to be applied manually on ARC6 and with an update on ARC7. The new authentication method was then re-applied last month.

Anyway, the fact that your CEs didn't manage to publish the accounting data even after the rollback, might indicate issues of different nature.

What is the error message that you get now after you applied the fix?

Did you change the certificate of the ARC-CE endpoints recently?
Can you please verify that each glite-APEL endpoint registered in https://goc.egi.eu/portal/index.php?Page_Type=Site&id=150 have the right certificate subject?

Best regards,
Alessandro
the ticket https://helpdesk.ggus.eu/#ticket/zoom/1001808 was assigned to the wrong SU. I;ve reassigned it to Messaging.
maybe also a good idea to fix the site name in the ticket title so that local SUs aren't mislead about who is who and where to intervene.
Hello Alessandro. Lets stick with arc-fort-1.lcg.cscs.ch ( also the other arcs have the same problems, but lets debug just on this one)
the subject is this:
[root@atlas-fort-arc-1-0 /]# openssl x509 -in /etc/grid-security/hostcert.pem -subject -noout -nameopt compat
subject=/DC=com/DC=emSignGrid/C=CH/ST=Zurich/O=ETH Zurich/CN=arc-fort-1.lcg.cscs.ch
and this is what we put ( check the links):
https://goc.egi.eu/portal/index.php?Page_Type=Service&id=14526
https://goc.egi.eu/portal/index.php?Page_Type=Service&id=15050
https://goc.egi.eu/portal/index.php?Page_Type=Service&id=15043
I was not aware that we had to edit that part, so this for sure the reason we missed to provide the reports for the last months. But i have now fixed this and still i see no changes. I already asked help in the nordugrid mailing list , and to verify that the certificate works they suggested me to test it using curl:

# curl --capath /etc/grid-security/certificates --cert /etc/grid-security/hostcert.pem --key /etc/grid-security/hostkey.pem https://msg.argo.grnet.gr:8443/v1/service-types/ams/hosts/msg.argo.grnet.gr:authx509
{
"error": {
"message": "Internal Error: Unable to parse CRL Data",
"code": 500,
"status": "INTERNAL SERVER ERROR"
}

keep in mind that we changed CA, so I'm wondering if by any chance the server does nor trust the new CA we used for the new certificate.
In attachment the whole republish in debug mode ( with the fix in AccountingPublishing.py implemented).
Regards,
Salvatore Di Nardo
I've fixed the site name in the ticket title. Let's wait for a feedback in the other ticket from the ARGO Messaging team.
by the way, when you update the information in GOCDB, it takes several hours for being propagated to the other tools.If you run the script right after the information change, the failure was expected.

By the way, the host DN should be registered for the gLite-APEL service endpoints (for example https://goc.egi.eu/portal/index.php?Page_Type=Service&id=15043 ), which I see anyway it is correct.
Just to be sere I updated it everywhere and waited for over a day, but its good to know.
yeah, let's see waht happens with https://helpdesk.ggus.eu/#ticket/zoom/1001808
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM91%80%89%85%87%85%91%79%69%80%87%81%76%80%87%93%
HammerCloud98%99%100%89%100%99%98%99%99%100%100%99%99%100%99%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (10)

CMS tickets (3)
CMS #1000219 (id:1000219) corrupt / binary CRL file for IHEP CA
State: in progress  |  Priority: less urgent  |  Opened: 2025-07-29 20:07 (248d ago)  |  Updated: 2025-08-12 15:43
Conversation (7 messages)
Dear Beijing admins,
the CRL URL for the IHEP CA points to http://cagrid.ihep.ac.cn/cacrl.crl

But this URL returns a binary or corrupted file. It contains the string
"Institute of High Energy Physics Certification Authority" though.
Can you please check with your CA team / get this corrected?
Many Thanks,
cheers, Stephan
Dear Stephan,
Our CA admin reported that the 'cacrl.crl' is resumed now. I have checked it with command "openssl crl -inform DER -in cacrl.crl -noout -text" and seems correct.
Please have a check.
Cheers,
Xuantong
Hallo Xuantong,
thanks for looking into this and your reply.
Ahh, the IHEP CRL is in DER format, i hadn't thought of this. Looks like this is a valid CRL format.
Ok, then the site/OSG needs to check the fetch procedures to add any format conversion as needed.
Thanks for clarifying things! Closing out.
cheers, Stephan
Hello Xuantong,
The real cause is the default openssl fetch-crl uses on Linux put some constraints on applications. Temporarily, enabling SHA1 on our local systems could make fetch-crl work, but this can't be a long-term solution. Could IHEP update its root CA to move away from SHA1 as discussed here:

https://github.com/dlgroep/fetch-crl/issues/4
Regards,
Yujun
Should be "Linux 9"
Hello Yujun,
We indeed have a plan to upgrade root CA from SHA1 to SHA256, but it needs a new CA PKI system and new system needs to get certificated by Asian Grid CA Federation.
I hope this upgrade could be finished in the end of this year, but it actually depends on the progress of our CA teams.
Please keep allowing SHA1 encryption until we finish the SHA256 upgrade. Sorry for the inconvenience.
Regards,
Xuantong
Hello Xuantong,
Thank you for your explanation. That's ok. Stephan has asked us to allow SHA1 on our CMS systems at Fermilab. We are not sure how long our computing security will be ok with this. Hope IHEP CA has been upgraded to SHA256 by then.

Regards,
Yujun
CMS #681496 (id:1597) Failed to send accounting data from new CE
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:48 (430d ago)  |  Updated: 2025-03-27 15:09
Conversation (33 messages)
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-06-02 06:56:59
Subject: Failed to send accounting data from new CE
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: APEL client & Accounting Repository
Issue type: Accounting
Description:
Dear administrators,

This is posted from Beijing site.

We launched a new CE (condorce02.ihep.ac.cn) and set the configuration of APEL same with the old CE (condorce01.ihep.ac.cn). These two CEs are both using for WLCG in parallel. But it seems the accounting data on condorce02 didn't send to apel server correctly. Now, the ssmsend log is alerting errors like [1].

Could you please help us to check it from expert side? Thanks!

Best Regards,
Xiaowei

[1] -
2023-06-02 14:03:16,808 - ssmsend - INFO - Starting sending SSM version 3.2.1.
2023-06-02 14:03:16,808 - ssmsend - INFO - Setting up SSM with protocol: AMS
2023-06-02 14:03:16,808 - ssmsend - INFO - No AMS token provided, using cert/key pair instead.
2023-06-02 14:03:16,808 - ssmsend - INFO - No server certificate supplied. Will not encrypt messages.
2023-06-02 14:03:16,808 - ssmsend - INFO - No path type defined, assuming dirq.
2023-06-02 14:03:16,839 - ssmsend - INFO - Messages will be signed using /C=CN/O=HEP/O=IHEP/OU=CC/CN=condorce02.ihep.ac.cn
2023-06-02 14:03:16,864 - urllib3.connectionpool - INFO - Starting new HTTPS connection (1): msg.argo.grnet.gr
2023-06-02 14:03:17,835 - urllib3.connectionpool - DEBUG - "GET /v1/service-types/ams/hosts/msg.argo.grnet.gr:authx509 HTTP/1.1" 404 95
2023-06-02 14:03:17,839 - ssmsend - ERROR - Unexpected exception in SSM: {'status': u'NOT FOUND', 'status_code': 404, 'error': 'While trying the [auth_x509]: Binding was not found'}
2023-06-02 14:03:17,839 - ssmsend - ERROR - Exception type:
2023-06-02 14:03:17,839 - ssmsend - INFO - SSM has shut down.

Affected ROC/NGI: NGI_CHINA
Affected Site: BEIJING-LCG2
GGUS ID: 162185
Last modifier: Adrian Coveney
Date: 2023-06-02 08:44:55

Status: in progress
Responsible Unit: APEL client & Accounting Repository
Public Diary:
previously I mentioned BEJING-LCG2, but I actually the failures are occurring at BEIJING-T1 (sorry for the confusion):
https://argo.egi.eu/egi/report-status/Critical/SITES/metrics/celhcb01.ihep.ac.cn/argo.certificate.validity-htcondorce/2025-01-06T09:32:28Z/UNKNOWN/org.opensciencegrid.htcondorce

Could you please verify the status of your CE? which condor version is installed?

Best regards,
Alessandro
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-06-02 09:35:24

Status: in progress
Responsible Unit: APEL client & Accounting Repository
Public Diary:
Thanks! I have added the glite-apel service in gocdb for the new CE.
I will keep an eye on the ssmsender until the problem is gone.
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 162185
Last modifier: Adrian Coveney
Date: 2023-06-02 09:40:59

Status: solved
Responsible Unit: APEL client & Accounting Repository
Solution:
Ok. I'll close this now. Feel free to reopen if there's still an issue.
GGUS ID: 162185
Last modifier: JIANG Xiaowei
Date: 2023-06-06 09:57:02

Status: in progress
Responsible Unit: APEL client & Accounting Repository
Public Diary:
Thanks for reminding! I have set the sending interval to 'latest'. Before it is a gap of one month. we lack of the correct accounting data for March, April and May, the gap interval is used to complete these data. I will be careful to do that. - Xiaowei
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-06-06 10:14:31

Public Diary:
Hi Adrian, is there a kind way suggested to complete the accounting data? It seems I failed to send the missed accounting data with the gap interval mode that I have tried to resend March and April data. Thanks! - Xiaowei
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 162185
Last modifier: Adrian Coveney
Date: 2023-06-06 11:59:10

Public Diary:
Ok, thanks for fixing that. Let's give it at least a day for the data to be fully processed and we'll see if there are any remaining gaps. I'll get back to you later in the week.

In future, if you want to republish so much data, can you please open a ticket to coordinate with us first.
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 162185
Last modifier: GGUS SYSTEM
Date: 2023-06-21 10:12:44

Public Diary:
There is still a huge number of records coming from this CE as well as a lot from condorce01.

Could you pause publishing from condorce02 for the moment?

What version of the APEL client is that CE using? Could you run the following on the database?

show create table JobRecords\G

Internal Diary:
Sent 2nd reminder to ticket submitter (jiangxw@ihep.ac.cn) requesting input.
GGUS ID: 162185
Last modifier: GGUS SYSTEM
Date: 2023-06-14 10:12:25

Public Diary:
There is still a huge number of records coming from this CE as well as a lot from condorce01.

Could you pause publishing from condorce02 for the moment?

What version of the APEL client is that CE using? Could you run the following on the database?

show create table JobRecords\G

Internal Diary:
Sent 1st reminder to ticket submitter (jiangxw@ihep.ac.cn) requesting input.
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-07-21 02:02:33

Public Diary:
Hi Adrian,

The version of apel client is 'APEL client 1.9.0'.

The database with the suggested command is [1].

And the condor-ce-apel.service and the condor-ce-apel.timer are both disabled.

Do you still see the new accounting data come into the server?

Cheers,
Xiaowei



[1] -
MariaDB [apelclient]> show create table JobRecords\G
*************************** 1. row ***************************
Table: JobRecords
Create Table: CREATE TABLE `JobRecords` (
`UpdateTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`SiteID` int(11) NOT NULL,
`SubmitHostID` int(11) NOT NULL,
`MachineNameID` int(11) NOT NULL,
`QueueID` int(11) NOT NULL,
`LocalJobId` varchar(255) NOT NULL,
`LocalUserId` varchar(255) DEFAULT NULL,
`GlobalUserNameID` int(11) NOT NULL,
`FQAN` varchar(255) DEFAULT NULL,
`VOID` int(11) NOT NULL,
`VOGroupID` int(11) NOT NULL,
`VORoleID` int(11) NOT NULL,
`WallDuration` bigint(20) unsigned DEFAULT NULL,
`CpuDuration` bigint(20) unsigned DEFAULT NULL,
`NodeCount` int(10) unsigned NOT NULL DEFAULT '0',
`Processors` int(10) unsigned NOT NULL DEFAULT '0',
`MemoryReal` bigint(20) unsigned DEFAULT NULL,
`MemoryVirtual` bigint(20) unsigned DEFAULT NULL,
`StartTime` datetime NOT NULL,
`EndTime` datetime NOT NULL,
`EndYear` int(11) DEFAULT NULL,
`EndMonth` int(11) DEFAULT NULL,
`InfrastructureDescription` varchar(100) DEFAULT NULL,
`InfrastructureType` varchar(20) DEFAULT NULL,
`ServiceLevelType` varchar(50) NOT NULL,
`ServiceLevel` decimal(10,3) NOT NULL,
`PublisherDNID` int(11) NOT NULL,
PRIMARY KEY (`MachineNameID`,`LocalJobId`,`EndTime`),
KEY `SummaryIdx` (`SiteID`,`VOID`,`GlobalUserNameID`,`VOGroupID`,`VORoleID`,`EndYear`,`EndMonth`,`InfrastructureType`,`SubmitHostID`,`ServiceLevelType`,`ServiceLevel`,`NodeCount`,`Processors`,`EndTime`,`WallDuration`,`CpuDuration`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Internal Diary:
Added attachment sender.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117917
GGUS ID: 162185
Last modifier: GGUS SYSTEM
Date: 2023-06-28 10:28:10

Public Diary:
There is still a huge number of records coming from this CE as well as a lot from condorce01.

Could you pause publishing from condorce02 for the moment?

What version of the APEL client is that CE using? Could you run the following on the database?

show create table JobRecords\G

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-07-21 01:51:39

Status: in progress
Responsible Unit: APEL client & Accounting Repository
Public Diary:
Hi Adrian,

The version of apel client is 'APEL client 1.9.0'.

The database with the suggested command is [1].

And the condor-ce-apel.service and the condor-ce-apel.timer are both disabled.

Do you still see the new accounting data come into the server?

Cheers,
Xiaowei



[1] -
MariaDB [apelclient]> show create table JobRecords\G
*************************** 1. row ***************************
Table: JobRecords
Create Table: CREATE TABLE `JobRecords` (
`UpdateTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`SiteID` int(11) NOT NULL,
`SubmitHostID` int(11) NOT NULL,
`MachineNameID` int(11) NOT NULL,
`QueueID` int(11) NOT NULL,
`LocalJobId` varchar(255) NOT NULL,
`LocalUserId` varchar(255) DEFAULT NULL,
`GlobalUserNameID` int(11) NOT NULL,
`FQAN` varchar(255) DEFAULT NULL,
`VOID` int(11) NOT NULL,
`VOGroupID` int(11) NOT NULL,
`VORoleID` int(11) NOT NULL,
`WallDuration` bigint(20) unsigned DEFAULT NULL,
`CpuDuration` bigint(20) unsigned DEFAULT NULL,
`NodeCount` int(10) unsigned NOT NULL DEFAULT '0',
`Processors` int(10) unsigned NOT NULL DEFAULT '0',
`MemoryReal` bigint(20) unsigned DEFAULT NULL,
`MemoryVirtual` bigint(20) unsigned DEFAULT NULL,
`StartTime` datetime NOT NULL,
`EndTime` datetime NOT NULL,
`EndYear` int(11) DEFAULT NULL,
`EndMonth` int(11) DEFAULT NULL,
`InfrastructureDescription` varchar(100) DEFAULT NULL,
`InfrastructureType` varchar(20) DEFAULT NULL,
`ServiceLevelType` varchar(50) NOT NULL,
`ServiceLevel` decimal(10,3) NOT NULL,
`PublisherDNID` int(11) NOT NULL,
PRIMARY KEY (`MachineNameID`,`LocalJobId`,`EndTime`),
KEY `SummaryIdx` (`SiteID`,`VOID`,`GlobalUserNameID`,`VOGroupID`,`VORoleID`,`EndYear`,`EndMonth`,`InfrastructureType`,`SubmitHostID`,`ServiceLevelType`,`ServiceLevel`,`NodeCount`,`Processors`,`EndTime`,`WallDuration`,`CpuDuration`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-07-21 02:02:34

Public Diary:
And since June 19th, the apel service has been stopped. We hope to recover publishing the accounting data to the server. I'm not sure if we can start up the apel service now. The apel configurations are attached, please help us to check if there are corruptions in the configuration! Thanks! - xiaowei
Internal Diary:
Added attachment sender.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117917
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-07-21 02:02:55

Public Diary:
And since June 19th, the apel service has been stopped. We hope to recover publishing the accounting data to the server. I'm not sure if we can start up the apel service now. The apel configurations are attached, please help us to check if there are corruptions in the configuration! Thanks! - xiaowei
Internal Diary:
Added attachment receiver.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117918
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-07-21 02:03:19

Public Diary:
And since June 19th, the apel service has been stopped. We hope to recover publishing the accounting data to the server. I'm not sure if we can start up the apel service now. The apel configurations are attached, please help us to check if there are corruptions in the configuration! Thanks! - xiaowei
Internal Diary:
Added attachment parser.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117919
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-07-21 02:03:39

Public Diary:
And since June 19th, the apel service has been stopped. We hope to recover publishing the accounting data to the server. I'm not sure if we can start up the apel service now. The apel configurations are attached, please help us to check if there are corruptions in the configuration! Thanks! - xiaowei
Internal Diary:
Added attachment client.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117920
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-08-21 08:49:03

Public Diary:
Hi Adrian,

I found a wrong setting in condorce02 that was triggering published per hour. It has been limited to once per day.

I am not sure if the data volume from our site is still too large. To ensure the ALTAS and CMS accounting data is complete, could I request to republish the history accounting data since March?

Currently, on condorce02, the client is alerting:

2023-08-21 11:35:08,928 - client - INFO - Starting SSM.
2023-08-21 11:35:08,928 - client - INFO - Setting up SSM with protocol: AMS
2023-08-21 11:35:08,928 - client - INFO - No AMS token provided, using cert/key pair instead.
2023-08-21 11:35:08,928 - client - INFO - No server certificate supplied. Will not encrypt messages.
2023-08-21 11:35:08,928 - client - INFO - No path type defined, assuming dirq.
2023-08-21 11:35:08,962 - client - INFO - Messages will be signed using /C=CN/O=HEP/O=IHEP/OU=CC/CN=condorce02.ihep.ac.cn
2023-08-21 11:35:08,987 - urllib3.connectionpool - INFO - Starting new HTTPS connection (1): msg.argo.grnet.gr
2023-08-21 11:35:40,268 - client - ERROR - Unexpected exception in SSM: While trying the [auth_x509]: ConnectionError(ProtocolError('Connection aborted.', BadStatusLine("''",)),)
2023-08-21 11:35:40,269 - client - ERROR - Exception type:
2023-08-21 11:35:40,270 - client - INFO - SSM has shut down.
2023-08-21 11:35:40,270 - client - INFO - ========================================
2023-08-21 11:35:40,270 - client - INFO - SSM stopped.
2023-08-21 11:35:40,271 - client - INFO - =====================
2023-08-21 11:35:40,271 - client - INFO - Client finished

I have checked the host certificate, it seems they are working well (the other service are using the same host cert).

Do you know how to clean up this error? Any suggestion or clue is appreciated!

Thanks,
Xiaowei
Internal Diary:
Added attachment client.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117920
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-09-22 01:45:19

Status: in progress
Responsible Unit: APEL client & Accounting Repository
Public Diary:
Hi Adrian,

Thanks for your reply!

The versions of APEL and SSM installed are?
apel-lib-1.9.0-1.el7.noarch
htcondor-ce-apel-5.1.6-1.el7.noarch
apel-ssm-3.2.1-1.el7.noarch
apel-client-1.9.0-1.el7.noarch
apel-parsers-1.9.0-1.el7.noarch

Thanks,
Xiaowei

Internal Diary:
Added attachment client.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117920
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2024-01-11 08:02:09

Public Diary:
Hi Adrian,

It seems the apel data is still failing to report to the central side. The error log is like [1].
Looking forward to getting your help!

Thanks,
Xiaowei

[1] -
2024-01-10 20:49:21,671 - client - INFO - Starting SSM.
2024-01-10 20:49:21,672 - client - INFO - Setting up SSM with protocol: AMS
2024-01-10 20:49:21,672 - client - INFO - No AMS token provided, using cert/key pair instead.
2024-01-10 20:49:21,672 - client - INFO - No server certificate supplied. Will not encrypt messages.
2024-01-10 20:49:21,672 - client - INFO - No path type defined, assuming dirq.
2024-01-10 20:49:21,709 - client - INFO - Messages will be signed using /C=CN/O=HEP/O=IHEP/OU=CC/CN=condorce02.ihep.ac.cn
2024-01-10 20:49:21,737 - urllib3.connectionpool - INFO - Starting new HTTPS connection (1): msg.argo.grnet.gr
2024-01-10 20:49:24,431 - ssm.ssm2 - INFO - Found 969 messages.
2024-01-10 20:49:24,433 - ssm.ssm2 - INFO - Sending message: 659e8d14/659e8d1d303afc
2024-01-10 20:49:24,468 - ssm.crypto - ERROR - unable to write 'random state'

2024-01-10 20:49:27,280 - ssm.ssm2 - INFO - Sent 659e8d14/659e8d1d303afc, Argo ID: 14583713
2024-01-10 20:49:27,282 - ssm.ssm2 - INFO - Sending message: 659e8d14/659e8d1e65780c
2024-01-10 20:49:27,323 - ssm.crypto - ERROR - unable to write 'random state'

2024-01-10 20:49:30,055 - ssm.ssm2 - INFO - Sent 659e8d14/659e8d1e65780c, Argo ID: 14583714
2024-01-10 20:49:30,057 - ssm.ssm2 - INFO - Sending message: 659e8d14/659e8d208f392c
2024-01-10 20:49:30,096 - ssm.crypto - ERROR - unable to write 'random state'

2024-01-10 20:49:32,840 - ssm.ssm2 - INFO - Sent 659e8d14/659e8d208f392c, Argo ID: 14583715
2024-01-10 20:49:32,842 - ssm.ssm2 - INFO - Sending message: 659e8d14/659e8d23ba33dc
2024-01-10 20:49:32,883 - ssm.crypto - ERROR - unable to write 'random state'


Internal Diary:
Added attachment client.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117920
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2023-09-22 08:28:29

Status: in progress
Responsible Unit: APEL client & Accounting Repository
Public Diary:
Thanks Andrian!

argo-ams version is python-argo-ams-library-0.5.4-1.el7.noarch.

Internal Diary:
Added attachment client.cfg
https://ggus.eu/index.php?mode=download&attid=ATT117920
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2024-01-15 01:51:13

Status: in progress
Responsible Unit: APEL client & Accounting Repository
Public Diary:
These lines can be ignored: "ssm.crypto - ERROR - unable to write 'random state'"

The lines like this confirm that the message at least made it into the Argo Messaging Service: "Sent 659e8d14/659e8d1d303afc, Argo ID: 14583713"

Also I can see the records successfully reaching us and being loaded.

You do seem to be sending a lot of messages. Could I see the client log and unloader config please?
Internal Diary:
Added attachment client.log
https://ggus.eu/index.php?mode=download&attid=ATT118583
GGUS ID: 162185
Last modifier: Xiaowei JIANG
Date: 2024-01-15 02:02:35

Public Diary:
Hi Adrian,

Thanks for taking care of the issue.
The client.log was uploaded and the uploader config is appended in [1].
We cumulated a lot of accounting data left on the CE host and not sent to the global side.
I'm not sure if I changed the interval to 'latest', that casued a lot of messages to be sent.
According to the status of accounting monitor (http://goc-accounting.grid-support.ac.uk/rss/BEIJING-LCG2_Sync.html), it seems our apel accounting data have not arrived at the global side for several months.

Cheers,
Xiaowei

[1] -
[unloader]
enabled = true
dir_location = /var/spool/apel/

# You may send only summaries of your data to the APEL server,
# rather than individual job records.
# This reduces the network load.
send_summaries = false

# You may send 'withheld' instead of the user's DN in the
# GlobalUserName field. This is only valid for individual
# job records.
withhold_dns = false

# Optional: send ONLY these VOs to the APEL server.
# This overrides exclude_vos.
#include_vos = atlas,cms,lhcb

# Optional: do not send these VOs to the APEL server.
# This does not take effect if include_vos is set.
#exclude_vos = atlas,cms

# Which records to send:
# latest - just send the new records to the server
# gap - send records from between the specified dates (inclusive)
# this is only for individual job records
# all - send all records to the server. Don't do this for individual
# job records without talking to the apel team!
interval = latest
#interval = gap
## only used if interval = gap
#gap_start = 2023-04-01
#gap_end = 2023-04-02

# Send CAR-format records - only for job records
send_ur = false
Internal Diary:
Added attachment client.log
https://ggus.eu/index.php?mode=download&attid=ATT118583
GGUS ID: 162185
Last modifier: GGUS SYSTEM
Date: 2024-06-12 11:13:12

Public Diary:
Thanks to another Condor site, we now have a possible solution from this ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=165483#update#15

To modify those steps for your site I believe you need to do the following:

1) Make a note of the current LastUpdated.UpdateTime value in your APEL database.
2) Update the UpdateTime to NOW().
3) Clear out the outgoing message directories of any messages.
4) Restart your publishing.
5) Look out for any message size errors in the ssmsend log.
6) Once that's working ok, ask us to check the sync records for the period from the original UpdateTime to now to see if there's any gaps.

Internal Diary:
Sent 1st reminder to ticket submitter (jiangxw@ihep.ac.cn) requesting input.
GGUS ID: 162185
Last modifier: GGUS SYSTEM
Date: 2024-06-26 12:12:09

Public Diary:
Thanks to another Condor site, we now have a possible solution from this ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=165483#update#15

To modify those steps for your site I believe you need to do the following:

1) Make a note of the current LastUpdated.UpdateTime value in your APEL database.
2) Update the UpdateTime to NOW().
3) Clear out the outgoing message directories of any messages.
4) Restart your publishing.
5) Look out for any message size errors in the ssmsend log.
6) Once that's working ok, ask us to check the sync records for the period from the original UpdateTime to now to see if there's any gaps.

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 162185
Last modifier: GGUS SYSTEM
Date: 2024-06-19 11:17:02

Public Diary:
Thanks to another Condor site, we now have a possible solution from this ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=165483#update#15

To modify those steps for your site I believe you need to do the following:

1) Make a note of the current LastUpdated.UpdateTime value in your APEL database.
2) Update the UpdateTime to NOW().
3) Clear out the outgoing message directories of any messages.
4) Restart your publishing.
5) Look out for any message size errors in the ssmsend log.
6) Once that's working ok, ask us to check the sync records for the period from the original UpdateTime to now to see if there's any gaps.

Internal Diary:
Sent 2nd reminder to ticket submitter (jiangxw@ihep.ac.cn) requesting input.
GGUS ID: 162185
Last modifier: Adrian Coveney
Date: 2024-06-26 13:17:10

Status: in progress
Responsible Unit: APEL client & Accounting Repository
Public Diary:
Thanks to another Condor site, we now have a possible solution from this ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=165483#update#15

To modify those steps for your site I believe you need to do the following:

1) Make a note of the current LastUpdated.UpdateTime value in your APEL database.
2) Update the UpdateTime to NOW().
3) Clear out the outgoing message directories of any messages.
4) Restart your publishing.
5) Look out for any message size errors in the ssmsend log.
6) Once that's working ok, ask us to check the sync records for the period from the original UpdateTime to now to see if there's any gaps.

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 162185
Last modifier: Adrian Coveney
Date: 2024-06-26 13:21:27

Public Diary:
Can you please have a look at update#29 and see if this helps reduce the amount of data you're sending to us.
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
Added attachment parser.cfg
Added attachment sender.cfg
Added attachment receiver.cfg
Added attachment client.cfg
Added attachment client.log
Is this ticket still relevant? -- Thank you, Noy
CMS #681487 (id:1588) Beijing Disk IAM configuration
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 09:46 (430d ago)  |  Updated: 2025-01-30 13:44
Conversation (3 messages)
GGUS ID: 168002
Last modifier: Rahul Chauhan
Date: 2024-09-03 08:18:17
Subject: Beijing Disk IAM configuration
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_CHINA
Issue type: File Transfer
Description:
Hello Beijing site admins

Rucio (and FTS) is unable to authenticate/write to Beijing (T2_CN_Beijing) Disk during our testing. Could you please look at your configuration for audience and user mappings? Please let me know if I can provide more information. FTS Request for the file: The token fields are available in the transfer logs.

https://fts-cms-002.cern.ch:8449/var/log/fts3/transfers/2024-09-03/eoscms.cern.chcceos.ihep.ac.cn/2024-09-03-0758eoscms.cern.chcceos.ihep.ac.cn4396929948__31e2d0c8-69ca-11ef-a67e-fa163e9e00e6

Here is a individual davix command test for the same file.

? ~ export BEIJING_DEST_URL=davs://cceos.ihep.ac.cn:9000/eos/ihep/cms/store/test/rucio/store/mc
? ~ echo $BEIJING_MODIFY_TOKEN | jq -R 'split(".") | .[1] | @base64d | fromjson' | grep -E 'scope|aud|client_id'
"aud": "cceos.ihep.ac.cn",
"scope": "offline_access storage.modify:/store/test/rucio/store/mc/RunIISummer20UL17MiniAODv2/DoubleMuonGun_Pt3To150/MINIAODSIM/NoPU_106X_mc2017_realistic_v9-v2/ storage.read:/store/test/rucio/",
"client_id": "ab9316f0-e0a0-4a8c-bd72-dbb17087b1da"
? ~ davix-http -X MKCOL -H "Authorization: Bearer ${BEIJING_MODIFY_TOKEN}" --trace header ${BEIJING_DEST_URL} --insecure
> MKCOL /eos/ihep/cms/store/test/rucio/store/mc HTTP/1.1
> User-Agent: libdavix/0.8.7 neon/0.0.29
> Keep-Alive:
> Connection: Keep-Alive
> TE: trailers
> Host: cceos.ihep.ac.cn:9000
> Authorization: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Content-Length: 0
>

Cheers Rahul
GGUS ID: 168002
Last modifier: Rahul Chauhan
Date: 2024-09-04 07:16:50

Public Diary:
Priority has been changed from very urgent to less urgent.
Hello

This https://twiki.cern.ch/twiki/bin/viewauth/CMS/IAMTokens page shows the IAM webdav token status for the site as OK, because of which I assumed T2_CN_Beijing was ready. You can refer to the documentation linked on the page above. I see this for EOS for e.g. https://eos-docs.web.cern.ch/configuration/http_tpc.html

If you need further clarifications for configurations, I would say Stephan would be better able to help you with that once Fermilab is re-opened.

The acceptable audience in the token should match the hostname in siteconf.

The rucio user should be mapped correctly to the existing user on the storage.



I am lowering the priority for this since it requires more than a config change and is not needed immediately.



Cheers
Rahul



Internal Diary:
Added attachment EGI VO WeNMR SLA report 2024-01 - 2024-06.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119395
GGUS ID: 168002
Last modifier: ZHANG Xuantong
Date: 2024-09-04 01:57:02

Public Diary:
Hello Rahul,

We still doesn't support IAM-token in Beijing Site SE, as it was not mandatory when we migrate from DPM to EOS in May of last year. Is IAM-token support mandatory now? If it is, could you please show us any documentation or examples to configure so that we could support IAM-token soon?

Cheers,
??? Xuantong Zhang
??: 18910379101 Tel: +86 18910379101
???????????????? Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences
Internal Diary:
Added attachment EGI VO WeNMR SLA report 2024-01 - 2024-06.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119395
WLCG tickets (7)
WLCG #1002115 (id:1002115) Upgrade your HTCondorCE endpoints to 24.0.x series (BEIJING-LCG2)
State: in progress  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-20 09:09
Conversation (1 message)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
WLCG #681492 (id:1593) Request to implement BGP tagging of LHCONE prefixes. (BEIJING-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:48 (430d ago)  |  Updated: 2025-02-03 14:16
Conversation (4 messages)
GGUS ID: 168330
Last modifier: Julia Andreeva
Date: 2024-09-23 15:42:25
Subject: Request to implement BGP tagging of LHCONE prefixes. (BEIJING-LCG2)
Ticket Type: USER
CC: ;edoardo.martelli@cern.ch
Status: assigned
Responsible Unit: NGI_CHINA
Issue type: Network problem
Description:
This ticket concerns all the sites connected to LHCONE.

In agreement with the WLCG Management Board, it has been decided to
implement the tagging of the IP prefixes announced to LHCONE.
The task consists of tagging the IP prefixes that your site announces to
LHCONE with all the BGP communities that identifies the experiments and
collaborations supported by your site. The initial goal is to document
the use of the network. In the longer term the tags may be used to
reduce the exposure on the LHCONE connection, by filtering unnecessary
prefixes.

You will find the values of the BGP communities to use and other
information in these pages:
- https://twiki.cern.ch/twiki/bin/view/LHCONE/MultiOneBGPcommunities
-
https://indico.cern.ch/event/1356138/contributions/6123461/attachments/2925447/5147273/WLCG-20240911-GDB-MultiONE-implementation.pdf

If you need any support on this task, please don't hesitate to ask your
NREN or LHCONE provider.
Or just reply to this ticket asking your questions; experts will guide
you in the implementation.

Please take this opportunity also to review the network information
related to your site in CRIC :
https://wlcg-cric.cern.ch/core/networkroute/list/

We ask you to perform the required action by the end of March 2025
GGUS ID: 168330
Last modifier: Julia Andreeva
Date: 2025-01-28 14:35:13

Public Diary:
Any progress on this ticket?
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 168330
Last modifier: ??
Date: 2024-10-24 08:23:01

Public Diary:
I will finished as soon as possible

Tao Cui

-----????-----
???:"Yan Xiaofei"
????:2024-10-24 14:55:51 (???)
???: CT
??: Fwd: GGUS-Ticket-ID: #168330 Ticket for site "BEIJING-LCG2" "none" "Request to implement BGP tagging of LHCONE prefixes. (BEIJING-LCG2)"

-------- Forwarded Message --------
| Subject: | GGUS-Ticket-ID: #168330 Ticket for site "BEIJING-LCG2" "none" "Request to implement BGP tagging of LHCONE prefixes. (BEIJING-LCG2)" |
| Resent-Date: | Mon, 23 Sep 2024 23:45:01 +0800 (CST) |
| Resent-From: | lcg-admin@ihep.ac.cn |
| Date: | Mon, 23 Sep 2024 15:44:54 +0000 (UTC) |
| From: | helpdesk@ggus.org |
| Reply-To: | helpdesk@ggus.org |
| To: | lcg-admin@ihep.ac.cn |

Dear support staff,
Ticket #168330 for site "BEIJING-LCG2" is ASSIGNED to you.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168330
SUBJECT: Request to implement BGP tagging of LHCONE prefixes. (BEIJING-LCG2)
TICKET INFORMATION:
DESCRIPTION:
This ticket concerns all the sites connected to LHCONE.
In agreement with the WLCG Management Board, it has been decided to
implement the tagging of the IP prefixes announced to LHCONE.
The task consists of tagging the IP prefixes that your site announces to
LHCONE with all the BGP communities that identifies the experiments and
collaborations supported by your site. The initial goal is to document
the use of the network. In the longer term the tags may be used to
reduce the exposure on the LHCONE connection, by filtering unnecessary
prefixes.
You will find the values of the BGP communities to use and other
information in these pages:
- https://twiki.cern.ch/twiki/bin/view/LHCONE/MultiOneBGPcommunities
-
https://indico.cern.ch/event/1356138/contributions/6123461/attachments/2925447/5147273/WLCG-20240911-GDB-MultiONE-implementation.pdf
If you need any support on this task, please don't hesitate to ask your
NREN or LHCONE provider.
Or just reply to this ticket asking your questions; experts will guide
you in the implementation.
Please take this opportunity also to review the network information
related to your site in CRIC :
https://wlcg-cric.cern.ch/core/networkroute/list/
We ask you to perform the required action by the end of March 2025
NOTIFIED SITE: BEIJING-LCG2
CONCERNED VO: none
PRIORITY: urgent
ISSUE TYPE: Network problem
SUBMITTER: Julia Andreeva
*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 168330
Last modifier: Julia Andreeva
Date: 2025-01-28 14:35:31

Status: in progress
Responsible Unit: NGI_CHINA
Public Diary:
Any progress on this ticket?
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
WLCG #681493 (id:1594) URGENT - CVMFS stale on alicevobox.alice.ihep.ac.cn
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 09:48 (430d ago)  |  Updated: 2025-01-30 14:42
Conversation (1 message)
GGUS ID: 169680
Last modifier: Maarten Litmaath
Date: 2025-01-27 16:42:55
Subject: URGENT - CVMFS stale on alicevobox.alice.ihep.ac.cn
Ticket Type: TEAM
CC:
Status: assigned
Responsible Unit: NGI_CHINA
Issue type: Middleware
Description:
Dear IHEP Team,
CVMFS on your ALICE VObox is very stale:
[alicesgm@alicevobox ~]$ attr -g revision /cvmfs/alice.cern.ch/
Attribute "revision" had a 5 byte value for /cvmfs/alice.cern.ch/:
20965
As I write this, that value needs to be at least 20985.
This issue may well be due to your Stratum-1 service being dysfunctional,
as there have been complaints on the Stratum-1 monitoring list:
{"msg": "last successful snapshot 48 hours ago", "repo": "alice.cern.ch"},.....
While it is not working correctly, please ensure that it _refuses_ any
connections from clients that will otherwise think all is fine...
For example, temporarily block the service in its host firewall.
That is a CVMFS client misfeature / bug which has been reported.
WLCG #681481 (id:1582) Upgrade your VOMS endpoint(s) to EL9 (BEIJING-LCG2)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 09:46 (430d ago)  |  Updated: 2025-01-30 13:45
Conversation (1 message)
GGUS ID: 167790
Last modifier: Alessandro Paolini
Date: 2024-08-05 11:30:11
Subject: Upgrade your VOMS endpoint(s) to EL9 (BEIJING-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_CHINA
Issue type: Middleware
Description:
Dear Site administrators,

with this ticket we are going to track the migration of your VOMS endpoint(s) to EL9.

The relesse notes of the new version can be found in https://italiangrid.github.io/voms/releases.html

VOMS documentations: https://italiangrid.github.io/voms/documentation.html

Clean installation guide: https://italiangrid.github.io/voms/documentation/sysadmin-guide/3.0.14/clean-installation.html

Packages available on the product team repository: https://italiangrid.github.io/voms-repo/

EL9 stable repository: https://repo.cloud.cnaf.infn.it/service/rest/repository/browse/voms-rpm-stable/redhat9/

repo file: https://italiangrid.github.io/voms-repo/repofiles/rhel/voms-stable-el9.repo

You should do a dump of each VO database on the current server and then restore them once the new server is up and running.

Please try to complete the migration before the end of the month.

Best regards,
Alessandro
WLCG #681494 (id:1595) Benchmark results for AMD EPYC 9654
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 09:48 (430d ago)  |  Updated: 2025-01-30 13:44
Conversation (9 messages)
GGUS ID: 167681
Last modifier: alastair.dewhurst
Date: 2024-07-26 09:42:29
Subject: Benchmark results for AMD EPYC 9654
Ticket Type: USER
CC:
Responsible Unit: TPM
Issue type: Benchmarking
Description:
Hi IHEP

At RAL-LCG2 we have just finished benchmarking our new procurement of CPU. We have 36 Lenovo SR645 V3 servers with the following specification:
• 2 x AMD EPYC 9654 96C 360W 2.4GHz Processor
• 24 x 64GB TruDDR5 4800MHz (2Rx4) 10x4 RDIMM-A, giving total 1.5TB memory (8GB per core)
• 2 x 2.5" U.2 P5620 6.4TB Mixed Use NVMe PCIe 4.0 x4 HS SSD, giving total 12.8TB (68GB per core>50GB per core)
• 1 x M.2 ER3 480GB Read Intensive SATA 6Gb NHS SSD (for OS)
• 1 x Mellanox ConnectX-6 Lx 10/25GbE SFP28 2-Port OCP Ethernet Adapter

For the benchmarking we ran HS23 test 10 times on each server, the servers were configured with Rocyk8, inventory install, stock kernel (4.18.0-553.8.1.el8_10.x86_64). The stats are as follows:
SMT off:
HS23 mean: 5870.36 per system, 30.57 per core (192 cores per system).

SMT on:
HS23 mean: 6989.31 per system, 18.20 per core, (384 cores per system)

From looking at the HEPscore table:
https://w3.hepix.org/benchmarking/scores_HS23.html
It appears you have also benchmarked these CPUs however you have a significantly different results to us (You have almost 1000 HS23 less). I was wondering if we could compare notes on how we did the benchmarking to understand where the discrepancy is coming from?

Did you use any particular settings when you benchmarked the hardware? What OS were you on? Did you have any power throttling settings or anything like that?

Thanks

Alastair
Affected Site: BEIJING-T1
GGUS ID: 167681
Last modifier: Lukas Pacher
Date: 2024-07-26 16:56:49

Status: assigned
Responsible Unit: NGI_CHINA
Public Diary:
Any progress on this ticket?
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 167681
Last modifier: Lukas Pacher
Date: 2024-07-26 16:57:55

Status: assigned
Responsible Unit: Benchmarking
Public Diary:
Any progress on this ticket?
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 167681
Last modifier: SYSTEM
Date: 2024-07-26 16:58:20

Public Diary:
CERN ticket reference: https://cern.service-now.com/nav_to.do?uri=incident.do?sysparm_query=number=INC3995611
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 167681
Last modifier: alastair.dewhurst
Date: 2024-07-30 09:42:11

Status: assigned
Responsible Unit: NGI_CHINA
Public Diary:
Hi

I can't see the Service Now ticket so I don't know if there is any response there but I am slightly worried this ticket has gone astray. I just want to ask the "IHEP site" (which I assume is the Beijing-T1?) how they did their benchmarking and have no other way of contacting them (other than GGUS). I have changed the notify site to include Beijing-T1.

Alastair
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 167681
Last modifier: Yan Xiaofei
Date: 2024-08-07 15:06:02

Public Diary:
Hi Alastair,
We do not have any particular setting. The OS is Centos7.9.
You have more powerful hardware than ours.
Hers is The detail hardware:
FusionServer 1258H V7
? 2 x AMD EPYC 9654 96C 360W 2.4GHz Processor
? 12 x 64GB DDR5 4800MHz (4GB per core)
? 1 x U.2 Intel SSDPF2KX038T1 3.84 TB (only one disk.)
? 1 x Mellanox MCX4121A-ACAT 10/25GbE SFP28 2-Port?
Regards
Xiaofei
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 167681
Last modifier: Yan Xiaofei
Date: 2024-08-29 09:29:02

Public Diary:
Hi,
I'll check how to change this setting. But all the servers are in
production. It will take some times to drain the server.
Best Regards
Xiaofei
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 167681
Last modifier: alastair.dewhurst
Date: 2024-08-29 09:05:50

Public Diary:
Hi

I am pasting a message from Domenico:
"I was on vacation when you discussed it, and I discover it now by chance investigating
a similar discrepancy with recent results of JP-KEK-CRC-02 for AMD EPYC 9654 in
https://w3.hepix.org/benchmarking/scores_HS23.html

Their score for SMT OFF is 5996.997 similar to the one of RAL-LCG2

I notice that at IHEP the governor was set to the conservative configuration,
instead of being performance. It could be that the frequency of the processor
was lower than the max available.

@Xiaofei: would it be possible to check this configuration and run again on a few servers?"

From a general ticket point of view, I don't need this to be kept open. You provided the information I requested so thank you.

Alastair
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
GGUS ID: 167681
Last modifier: Yan Xiaofei
Date: 2024-10-21 02:10:02

Public Diary:
Hi,
I run HEP-SCORE on new machine which is same brand with yours.
SR645 V3 servers with the following specification:
? 2 x AMD EPYC 9654 96C 360W 2.4GHz Processor
? 12 x 64GB TruDDR5 4800MHz (2Rx4) 10x4 RDIMM-A, giving total 768GB memory.
? 1x nvme 3.84TB
? 1 x Mellanox ConnectX-6 Lx 10/25GbE SFP28 2-Port OCP Ethernet Adapter

The OS is almalinux 9.4 kernel(5.14.0-427.26.1.el9_4.x86_64), The Lenove
machine have different BIOS setting:
Maximum Efficiency enabled and SMT off the result is 4279.1671
Maximum Performance enabled and SMT off the result is 5735.119
I'm contacting other brand's manufacturer to optimize the BIOS setting
and rerun the HEPSCORE.
Best Regards
Xiaofei
Internal Diary:
Sent 2nd reminder to ticket submitter (giuseppe.larocca@egi.eu) requesting input.
WLCG #681482 (id:1583) Upgrade to a supported HTCondor version and enable SSL authentication (BEIJING-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:46 (430d ago)  |  Updated: 2025-01-30 13:43
Conversation (3 messages)
GGUS ID: 163969
Last modifier: Alessandro Paolini
Date: 2023-11-03 11:25:26
Subject: Upgrade to a supported HTCondor version and enable SSL authentication (BEIJING-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_CHINA
Issue type: Middleware
Description:
Dear site admins,

with this ticket we would like to follow-up the upgrade to a supported version of HTCondorCE and the migration from voms-based authentication with X509 certificates to AAI tokens for accessing the HTCondorCE endpoints.

The HTCondor team set-up an upgrade procedure to help sites and VOs with the migration from X509 personal certificates to tokens.
Essentially it was created an intermediate step where the plain SSL authentication can be used to authenticate a client' proxy, in addition to the GSI one or to the token one:
- https://confluence.egi.eu/x/EYAtDQ

In summary, the steps are:

- update to HTCondor 9.0.19
- enable the SSL authz (with priority over GSI)
- map the users' DNs
- test the SSL authz successfully
- update to latest HTCondor 10.x

You can find the HTCondor 9.0.19 version in WLCG repository for the time being, as explained in the instructions.

Please also note the usage in the last step of the HTCondor Feature channel (https://htcondor.org/htcondor/release-highlights/index.html#feature-channel) since it this the one supporting the EGI Check-in plugin from 10.4.0.
In this way the sites can accept clients’ proxies and tokens at the same time while waiting for the supported VOs moving completely to tokens.
You can find the latest HTCondor 10.x version in the HTCondor Feature Channel repository.

Please note that after the upgrade to HTCondor 10 version, you need to install and configure the EGI Check-in plugin in order to be compliant with the EGI tokens:
https://github.com/EGI-Federation/check-in-validator-plugin

Please get in contact with your supported communities to properly map the users' DNs to local accounts to ensure also the access via X509 personal certificates.

Concerning the ops VO, please map at least the following certificates:
- EGI Monitoring Service:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-egi@grnet.gr"
"/DC=EU/DC=EGI/C=HR/O=Robots/O=SRCE/CN=Robot:argo-egi@cro-ngi.hr"

- EGI Security monitoring:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-secmon@grnet.gr"

Please also configure properly the Accounting settings on the HTcondor 10 installation, as explained in the instructions.

Thanks for your collaboration,
EGI Operations
GGUS ID: 163969
Last modifier: Xiaowei JIANG
Date: 2023-11-07 12:37:32

Status: in progress
Responsible Unit: NGI_CHINA
Public Diary:
I hold this ticket to keep track the progress and avoid creating duplicate tickets.

Regards,
Jakrapop
Internal Diary:
Sent 2nd reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
GGUS ID: 163969
Last modifier: Alessandro Paolini
Date: 2025-01-23 12:31:04

Status: in progress
Responsible Unit: NGI_CHINA
Public Diary:
yes, the migration hasn't been done yet.
please upgrade to 23.0.x version:
https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv23EL9
Internal Diary:
Sent 2nd reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
WLCG #681484 (id:1585) Replicate CVMFS repository omnibenchmark.egi.eu
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 09:46 (430d ago)  |  Updated: 2025-01-30 13:39
Conversation (3 messages)
GGUS ID: 168311
Last modifier: Jose Caballero
Date: 2024-09-23 08:52:19
Subject: Replicate CVMFS repository omnibenchmark.egi.eu
Ticket Type: USER
CC:
Responsible Unit: TPM
Issue type: Operations
Description:
Hello,

please add the CVMFS repository omnibenchmark.egi.eu at RAL's Stratum-0 to your Stratum-1.

Thanks,
Jose
Affected ROC/NGI: NGI_CHINA
Affected Site: BEIJING-LCG2
GGUS ID: 168311
Last modifier: Petr Prochazka
Date: 2024-09-23 14:45:01

Status: assigned
Responsible Unit: NGI_CHINA
Public Diary:
yes, the migration hasn't been done yet.
please upgrade to 23.0.x version:
https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv23EL9
Internal Diary:
Added attachment EGI VO WeNMR SLA report 2024-01 - 2024-06.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119395
GGUS ID: 168311
Last modifier: Yan Xiaofei
Date: 2024-10-18 02:20:02

Public Diary:
Hi
It has been done.
Cheers
Xiaofei
Internal Diary:
Added attachment EGI VO WeNMR SLA report 2024-01 - 2024-06.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119395
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM96%97%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (6)

WLCG tickets (6)
WLCG #1001795 (id:1001795) DESY conditions data read speed low
State: assigned  |  Priority: less urgent  |  Opened: 2026-02-09 10:58 (54d ago)  |  Updated: 2026-04-01 10:10
Conversation (2 messages)
I have looked at the data rates at which jobs at DESY-HH and DESY-ZN receive conditions data and they are not good.Conditions data come through Varnish servers. The closest one to your sites is at FZK-LCG2 and I see nodes like: batchXXXX.desy.de and muxXX.zeuthen.desy.de connecting to it.
Question is which way your connection to http://v4f.hl-lhc.net:6082 goes. Can you please check the trace?
If easier and if you want the best possible performance, you can simply install your own Varnish proxy (<30min to do). Let me know if you want to do that.

On a similar topic: CRIC settings for DESY-HH lists 6 squids for conditions. Is there a reason for so many?

Best,
Ilija
Any news about this issue?
WLCG #1002189 (id:1002189) LHCb Disk resources at DESY-HH
State: assigned  |  Priority: less urgent  |  Opened: 2026-03-24 14:53 (11d ago)  |  Updated: 2026-03-27 09:04
Conversation (3 messages)
Dear colleagues,

Now that the 2026 data taking has started, we are reviewing the state of the storage
pledges of the LHCb T1 and T2 sites.

Earlier this year, we were informed that (with a few exceptions) the sites would have
the pledged disk and tape resources available to the experiment, and we hope you
can confirm that.

DESY-HH pledges 500.0 TB of disk space for 2026. However, the Storage Resource Record (SRR)
advertises only 445.2 TB [1]. Do you know what explains that discrepancy?

Thanks in advance for your efforts to make these crucial resources available
to us!

best regards, Jan van Eldik / LHCb Compute project lead

[1]+---------+--------------+-----------+-----------+----------+| Site | Share | Size | Used | Fraction |
+---------+--------------+-----------+-----------+----------+
| DESYHH | LHCb-Disk | 445.2 | 428.9 | 96.4% |
+---------+--------------+-----------+-----------+----------+
Hi Jan,
apologies, we were and to a degree still are juggling in the replacement hardware as well decommissioning our old hardware. I've just allocated additional space for you, it's now

Used: 429.59TiB
Free: 171.83TiB
Total: 601.42TiB

Apologies for the inconvenience. We will like reduce it again a bit, but we will keep it above 500TiB.

Thanks a lot,
Christian
Hi Christian, thanks for the quick action!
Would it be possible to already publish the final (reduced) number in the SRR, to avoid a situation where we would needto removed data? We have been bitten by this on other storage elements...
Many thanks for your support!

Cheers, Jan
WLCG #681504 (id:1605) Request to implement BGP tagging of LHCONE prefixes. (DESY-HH)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:49 (430d ago)  |  Updated: 2025-02-28 15:29
Conversation (5 messages)
GGUS ID: 168340
Last modifier: Julia Andreeva
Date: 2024-09-23 15:42:48
Subject: Request to implement BGP tagging of LHCONE prefixes. (DESY-HH)
Ticket Type: USER
CC: ;edoardo.martelli@cern.ch
Status: assigned
Responsible Unit: NGI_DE
Issue type: Network problem
Description:
This ticket concerns all the sites connected to LHCONE.

In agreement with the WLCG Management Board, it has been decided to
implement the tagging of the IP prefixes announced to LHCONE.
The task consists of tagging the IP prefixes that your site announces to
LHCONE with all the BGP communities that identifies the experiments and
collaborations supported by your site. The initial goal is to document
the use of the network. In the longer term the tags may be used to
reduce the exposure on the LHCONE connection, by filtering unnecessary
prefixes.

You will find the values of the BGP communities to use and other
information in these pages:
- https://twiki.cern.ch/twiki/bin/view/LHCONE/MultiOneBGPcommunities
-
https://indico.cern.ch/event/1356138/contributions/6123461/attachments/2925447/5147273/WLCG-20240911-GDB-MultiONE-implementation.pdf

If you need any support on this task, please don't hesitate to ask your
NREN or LHCONE provider.
Or just reply to this ticket asking your questions; experts will guide
you in the implementation.

Please take this opportunity also to review the network information
related to your site in CRIC :
https://wlcg-cric.cern.ch/core/networkroute/list/

We ask you to perform the required action by the end of March 2025
GGUS ID: 168340
Last modifier: Thomas Hartmann
Date: 2024-09-24 13:20:27

Status: in progress
Responsible Unit: NGI_DE
Public Diary:
Are there any news about this?
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 168340
Last modifier: Thomas Hartmann
Date: 2025-01-23 09:38:37

Public Diary:
Set remind-on-date: 2025-02-03

Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
the BGP community IDs have been deployed at us end of last year.

can you check, if the IDs are propagated properly?
Hi

All the prefixes declared here https://wlcg-cric.cern.ch/core/networkroute/detail/DESY-HH-LHCONE/
they are marked with 61339:2 61339:3 61339:6 (ATLAS, CMS, BelleII)

It looks good to me.
Thanks!
Edoardo
WLCG #681362 (id:1460) ILC VO Test from CERN
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-24 09:09 (435d ago)  |  Updated: 2025-02-14 09:51
Conversation (13 messages)
Hi there Thomas!
Your request (GGUS-Ticket-ID: #681362) has been received and will be reviewed by our support staff.

To provide additional information, please reply to this email or click on the following link (for initial login, please request a new password):
https://helpdesk.ggus.eu/#ticket/zoom/1460

Your GGUS Helpdesk Team
Hi Thomas,

Why can't I select ILC as "affected VO"?

Cheers,
Andre
Hi Andre,
I am a bit lost tbh - where are you selecting ILC?
On the worker nodes and workgroup servers the VOMS would need to be pointed at the IAM.

Cheers,
Thomas
or do you mean in Zammad here?
I can confirm, that I am also not seeing ILC as available VO in the 'Affected VO' drop down for this ticket. I can also not assign the ticket to an owner as the drop down list is empty for me.

I am adding GGUS in the notified groups section - maybe they can help us and clarify, why we cannot assign ILC to a ticket or vice versa.
Probably we are missing some roles?
Cheers,
Thomas
Dear Andre, dear Thomas,

I'm sorry for late reply! And I'm sorry that we didn't include ilc in the new configuration, for reason which I can't understand now. Anyway I added ilc VO.

Please let us know, if you need any workflow related to ILC VO. E.g. if ILC is chosen in the field "Affected VO" what should happen, e.g. the notification should be sent to list a@ b@ c@.

In our old list we see that only Andre has ilc vo role.

Be aware, that there is also second level->VO Support->ilc support unit with mailing list ilc-vo-support@desy.de which you can search for in Group field:

and assign ticket to it. That is independent to "affected VO". So the ticket can be assigned to "Second Level"->VO Support->ILC in this case the notification mail will be sent to ilc-vo-support@desy.de.

If you mark "Affected VO" ilc, for now nothing will happen, but you could propose any mailing list which should be notified/or any other workflow, if that happens.

@Thomas, the NGI_DE is group which has no owners, so in general if the list of owners is empty that means that the group members are notified via mailing list attached to the group mailing list. If you expect something else let me know. Do we still want to have a meeting? Sorry if I missed your answer on that.
Hi Pavel,

many thanks for the help :)

@Andre
do we want to try another test ticket run, i.e., you/ILC assigning us as site a ticket and vice versa?

@Pavel

regarding NGI_DE it unfortunately got burried under a lot of other stuff as well - I will ping you next week
I have just created a ticket with ILC as affected VO

https://helpdesk.ggus.eu/#ticket/zoom/2316

@Andre
can you check, if you have received a notification mail for #2316?

Cheers,
Thomas
Hi Thomas,

I created https://helpdesk.ggus.eu/#ticket/zoom/2317

I did not get an email for 2316

Cheers,
Andre
I got a notification for https://helpdesk.ggus.eu/#ticket/zoom/2317 - so the route ILC-->Site looks good

but https://helpdesk.ggus.eu/#ticket/zoom/2316 for Site --> ILC seeems still to be broken. Or maybe the role assignment is not complete and Andre and maybe other ILC admins have not yet received the ILC hat in Zamad?
Hi Thomas,

Could you explain a few things?

1. What do you mean by ILC hat?

2. What do you mean by "but https://helpdesk.ggus.eu/#ticket/zoom/2316 for Site --> ILC seeems still to be broken." ?

Just to clarify, we didn't implement any underlying link or "hat" by default. So please explain what should happen wihen "affected VO" chosen in the ticket?

I assume the ILC people should get notified,right? If it was implemented in GGUS let me know how. We can add Andre or any mailing list to be notified in case the Affected VO is ILC. Let me know.
Hi Pavel,

ah, sorry for the confusion.

I have been suspecting, that Andre is missing some role assignments for the ILC VO (aka a ILC hat ;) ) as he was not receiving notifications with respect to the ticket I had opened for ILC.

If I get you right, the ILC VO contact information needs to be updated - i.e., you will need from Andre the contact mailing list or address for the ILC VO, right?

Cheers,
Thomas

Cheers,
Thomas
Since I am CMS mini-admin, I can see what roles user have, however not sure, if I see all of them. In Thomas's case I see only the role 'common'. To receive notification for a VO another role is required. In my case I have (among other roles) 'cms' and I get to see CMS tickets. Who is mini-admin for ILC? The mini-admin can/should equip relevant folks with sufficient roles for the VO in question.
Cheers, Christoph
WLCG #682189 (id:2316) Test Ticket from Site DESY-HH to VO ILC
State: in progress  |  Priority: less urgent  |  Opened: 2025-02-13 09:50 (415d ago)  |  Updated: 2025-02-13 09:50
Conversation (1 message)
Hi,

just a test ticket as follow up to https://helpdesk.ggus.eu/#ticket/zoom/1460/8185

Cheers,
Thomas
WLCG #681506 (id:1607) Upgrade your VOMS endpoint(s) to EL9 (DESY-HH)
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 09:49 (430d ago)  |  Updated: 2025-01-30 13:51
Conversation (7 messages)
GGUS ID: 167791
Last modifier: Alessandro Paolini
Date: 2024-08-05 11:30:13
Subject: Upgrade your VOMS endpoint(s) to EL9 (DESY-HH)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_DE
Issue type: Middleware
Description:
Dear Site administrators,

with this ticket we are going to track the migration of your VOMS endpoint(s) to EL9.

The relesse notes of the new version can be found in https://italiangrid.github.io/voms/releases.html

VOMS documentations: https://italiangrid.github.io/voms/documentation.html

Clean installation guide: https://italiangrid.github.io/voms/documentation/sysadmin-guide/3.0.14/clean-installation.html

Packages available on the product team repository: https://italiangrid.github.io/voms-repo/

EL9 stable repository: https://repo.cloud.cnaf.infn.it/service/rest/repository/browse/voms-rpm-stable/redhat9/

repo file: https://italiangrid.github.io/voms-repo/repofiles/rhel/voms-stable-el9.repo

You should do a dump of each VO database on the current server and then restore them once the new server is up and running.

Please try to complete the migration before the end of the month.

Best regards,
Alessandro
GGUS ID: 167791
Last modifier: Thomas Hartmann
Date: 2024-08-19 09:00:10

Status: on hold
Responsible Unit: NGI_DE
Public Diary:
setting to on hold.

Due to limited manpower and vacation season an ad hoc deployment is not feasible within this month.

Continuing the service is open.
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 167791
Last modifier: Alessandro Paolini
Date: 2024-10-29 15:28:06

Public Diary:
Hi Thomas,

in the past weeks we notified the VOMS team about this issue and they are going to fix it soon.

Cheers,
Alessandro
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 167791
Last modifier: Thomas Hartmann
Date: 2024-10-29 14:50:40

Public Diary:
there seem to be dependencies on python 2.7 packages, which cannot be resolved due to python 3.* the only supported releases
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 167791
Last modifier: Thomas Hartmann
Date: 2024-12-02 14:15:47

Public Diary:
what is the status here?
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 167791
Last modifier: Alessandro Paolini
Date: 2024-12-02 16:09:22

Public Diary:
we are waiting for the official release of a newer voms-admin version.
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 167791
Last modifier: Thomas Hartmann
Date: 2025-01-23 09:38:16

Public Diary:
Set remind-on-date: 2025-02-03

Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM87%80%100%100%100%100%100%93%99%100%100%100%98%100%100%100%
HammerCloud97%97%96%100%98%100%98%99%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

WLCG tickets (2)
WLCG #681509 (id:1610) Request to deploy IPv6 on CEs and WNs at WLCG sites (RWTH-Aachen)
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 09:49 (430d ago)  |  Updated: 2026-04-01 13:44
Conversation (15 messages)
GGUS ID: 164397
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:38:51
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (RWTH-Aachen)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_DE
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164397
Last modifier: Andrea Sciaba
Date: 2023-11-29 06:43:11

Public Diary:
Hi Andrea,
Unfortunately slow progress in the last two months: we still need to have the topology structure validated in terms of security. I hope to write back with some news at the end of January.
Cheers,
Raffaele
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 164397
Last modifier: Dr. Andreas Nowack
Date: 2024-01-11 15:23:48

Status: in progress
Responsible Unit: NGI_DE
Public Diary:
Hi Andrea, hi Alexander,

all our regular grid nodes run a dual-stack configuration since September 2017.

I assume that also the opportunistic resources at Claix support dual stack.
@Alexander: Could please verify this?

Best regards,
Andreas
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 164397
Last modifier: Andrea Sciaba
Date: 2024-01-12 15:22:06

Status: solved
Responsible Unit: NGI_DE
Solution:
Waiting for your verification, after which I'll close the ticket if possible.
Cheers,
Andrea
GGUS ID: 164397
Last modifier: Andrea Sciaba
Date: 2024-01-12 15:23:26

Status: in progress
Responsible Unit: NGI_DE
Public Diary:
Sorry, I closed the ticket by mistake.
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 164397
Last modifier: Thomas Kress
Date: 2024-01-22 15:21:21

Public Diary:
Involve person(s) has been changed to alexander.jung@rwth-aachen.de; alexander.schmidt@physik.rwth-aachen.de;terboven@itc.rwth-aachen.de; manuel.giffels@kit.edu.
Hello Andrea,
on our dedicated CMS Tier2 T2_DE_RWTH we have dual stack IPv4+IPv6 on all WNs configured since a long time. Our HTCondor CEs are remotely operated at our German Tier-1 at KIT, and have also dual stack.

However, we have now transparently integrated nodes from the local Aachen HPC cluster (CLAIX) and perpectively during the next years the HPC center will take over the WLCG pledges. So the next few years we will run on both sub-clusters. CLAIX is working on an IPv6 setup for their worker nodes but this is likely not going to be in place this year 2024. At the moment we have only O(100) job slots running there, which is irrelevant for the resources but important for us to monitor whether CMS jobs are always running there and to be able to adapt the setting if necessary.

Any idea how to deal with this situation ?

Greetings, Thomas.

Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 164397
Last modifier: Andrea Sciaba
Date: 2024-01-29 09:21:29

Public Diary:
Hi Thomas,
very good question. These are my thoughts:
- in general the ticket should be closed only when there is a 100% probability for a job to land on a WN with IPv6.
- if we close the ticket now, we risk "forgetting" about the special situation at Aachen (not you of course, but myself and CMS/WLCG :-) ).
I propose to keep the ticket open (also to be fair to other sites which could be, or were in a similar situation), but putting a note in the deployment summary table (linked in the ticket description) clarifying that you are practically done but waiting for the HPC resources.
I hope it's a reasonable compromise for you.
Cheers,
Andrea
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 164397
Last modifier: Andrea Sciaba
Date: 2024-06-06 08:07:27

Public Diary:
Hi Thomas,
thanks for the explanation. Funny to learn that Nvidia doesn't consider IPv6 important!
Cheers,
Andrea
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 164397
Last modifier: Thomas Kress
Date: 2024-05-23 07:28:06

Public Diary:
Hi Andrea,

update:

the problem is that the new Aachen HPC cluster (CLAIX) is using Nvidia Skyway Infiniband gateways. At least directly this translater hardware does not support IPv6. Our colleagues have discussed this with Nvidia recently and it seems that Nvidia has realized that IPv6 is necessary. No idea how long it will take before this comes into their software.

Since these Nvidia gateways are very popular for HPC sites I expect that more sites could face this problem in the future. Workarounds might be possible but this is even harder to get deployed.

Cheers, Thomas.
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 164397
Last modifier: Dr. Andreas Nowack
Date: 2025-01-23 14:57:22

Public Diary:
Hi Guenter,

the situation has not yet changed.
In this sense, the request is still open for the opportunistic resources.

Best regards,
Andreas
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
GGUS ID: 164397
Last modifier: Dr. Andreas Nowack
Date: 2025-01-23 14:57:43

Status: on hold
Responsible Unit: NGI_DE
Public Diary:
Hi Guenter,

the situation has not yet changed.
In this sense, the request is still open for the opportunistic resources.

Best regards,
Andreas
Internal Diary:
Ticket is marked as slave of GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168756
Description of master:
SWT2_CPB: transfer failures as a destination
Hi,
was there any new development on this topic?
Cheers,
Andrea
Hi Andrea,the newest RWTH Claix HPC WN hardware now have Infiniband/Ethernet gateways that should be possible to use IPv6. And we could restrict our jobs to run only on this node partition. However the deployment is not advanced, I expect it will take another few months, we will try to push.

Cheers, Thomas.
Hi Thomas,
these are really good news, thanks. Given that there are a few other sites having the same problem with the Nvidia Infiniband gateways, would it be possible to know the exact model to see if this can help them? Also by email, if you don't want to write it in the ticket.
Cheers,
Andrea
Hi Andrea,our HPC center says there is basically only one model: https://www.nvidia.com/en-in/networking/infiniband/skyway/

We thought that for the new hardware, which does no longer use these NVIDIA skyway gateways ,they will be able to provide IPv6 for the new worker nodes, but it seems that also with the new hardware from the company Cornelis they do not manage to setup IPv6, the company claims that it is still not yet possible to run a stable operation with IPv6.
Greetings, Thomas.
WLCG #681516 (id:1617) Enable site network monitoring (RWTH-Aachen)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:51 (430d ago)  |  Updated: 2025-01-30 13:56
Conversation (7 messages)
GGUS ID: 162970
Last modifier: Julia Andreeva
Date: 2023-08-02 09:03:24
Subject: Enable site network monitoring (RWTH-Aachen)
Ticket Type: USER
CC: ;smckee@umich.edu
Status: assigned
Responsible Unit: NGI_DE
Issue type: Monitoring
Description:
WLCG Sites / Site Administrators / Networking Support, As presented at the WLCG Ops Coordinations meeting on April 6, 2023, our WLCG Monitoring Task Force is initiating a campaign to enable site network monitoring and gathering associated network information in preparation for Data Challenge 2024 (DC24).
Our primary targets are the Tier-1’s and larger Tier-2s but we would like to see as many sites participating as possible.
You can find detailed instructions in Gitlab at CERN: https://gitlab.cern.ch/wlcg-doma/site-network-information
In case you do not have access to the project Gitlab at CERN, 3 PDF files are attached to the twiki page:
https://twiki.cern.ch/twiki/bin/view/LCG/SiteNetworkMonitoring
They capture information from the project Gitlab at CERN. These three PDFs provide the overview of the project, an example site network information template to be filled out and information detailing how to provide your site’s network metrics.
We would like site’s to complete this by the end of September 2023 to give us time to verify the data and provide any fixes well in advance of DC24.
To interact with Gitlab does require a CERN account. If you have issues adding your site files to Gitlab or WLCG-CRIC, please contact us. Best regards, The WLCG
Monitoring Task Force
GGUS ID: 162970
Last modifier: Dr. Andreas Nowack
Date: 2023-08-09 15:03:37

Status: in progress
Responsible Unit: NGI_DE
Public Diary:
Priority has been changed from urgent to less urgent.

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 162970
Last modifier: Dr. Andreas Nowack
Date: 2023-09-26 10:16:09

Public Diary:
Set remind-on-date: 2023-10-09
Current status:
- The transmission of throughput values works since last Friday.
- The description of the network structure is not yet ready and will be published in a few weeks.
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 162970
Last modifier: Julia Andreeva
Date: 2024-01-15 14:47:18

Public Diary:
Any progress on this ticket?
Thank you
Julia
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 162970
Last modifier: GGUS SYSTEM
Date: 2024-11-11 15:00:27

Public Diary:
Dear Julia,
Is this ticket still relevant or can it be closed?
Thanks.

GGUS ticket monitoring
Internal Diary:
Sent 2nd reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 162970
Last modifier: GGUS SYSTEM
Date: 2024-11-04 15:00:06

Public Diary:
Dear Julia,
Is this ticket still relevant or can it be closed?
Thanks.

GGUS ticket monitoring
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 162970
Last modifier: GGUS SYSTEM
Date: 2024-11-18 15:01:47

Public Diary:
Dear Julia,
Is this ticket still relevant or can it be closed?
Thanks.

GGUS ticket monitoring
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%95%100%99%100%100%100%
HammerCloud99%99%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (4)

CMS tickets (2)
CMS #682293 (id:2420) Please check AAA Config at T2_ES_CIEMAT
State: in progress  |  Priority: very urgent  |  Opened: 2025-02-21 10:42 (407d ago)  |  Updated: 2025-10-30 11:42
Conversation (12 messages)
Hi CIEMAT Colleagues,I was checking AAA failure with CRAB jobs and found a failing job that seems to be due to some possible AAA issue at CIEMAT.
For example, please check this CRAB job output: https://cmsweb.cern.ch:8443/scheddmon/0194/nbinnorj/250220_230122:nbinnorj_crab_WPMJJWPMJJjj_EWK_LO_4f_mmjj150_mg_py8_RunIII2024Summer2DR_AODSIM_v3p0/job_out.1034.2.txt
The job was running at CIEMAT but fails because of
== CMSSW: ----- Begin Fatal Exception 20-Feb-2025 23:35:22 UTC-----------------------
== CMSSW: An exception of category 'FallbackFileOpenError' occurred while
== CMSSW: [0] Constructing the EventProcessor
== CMSSW: [1] Constructing input source of type PoolSource
== CMSSW: [2] Calling RootInputFileSequence::initTheFile()
== CMSSW: [3] Calling StorageFactory::open()
== CMSSW: [4] Calling XrdFile::open()
== CMSSW: Exception Message:
== CMSSW: Failed to open the file 'root://xcachecms.pic.es//store/user/nbinnorj/VBSSignalProdv3p0/CRABOUTPUT/WPMJJWPMJJjj_EWK_LO_4f_mmjj150_madgraph_pythia8/RunIII2024Summer2DR_RAWSIM_v3p0/250218_185157/0004/GENSIMRAW_4133.root'
== CMSSW: Additional Info:
== CMSSW: [a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://gaexrdoor.ciemat.es:2094//store/user/nbinnorj/VBSSignalProdv3p0/CRABOUTPUT/WPMJJWPMJJjj_EWK_LO_4f_mmjj150_madgraph_pythia8/RunIII2024Summer2DR_RAWSIM_v3p0/250218_185157/0004/GENSIMRAW_4133.root
== CMSSW: [b] Input file root://xcachecms.pic.es:1094//store/user/nbinnorj/VBSSignalProdv3p0/CRABOUTPUT/WPMJJWPMJJjj_EWK_LO_4f_mmjj150_madgraph_pythia8/RunIII2024Summer2DR_RAWSIM_v3p0/250218_185157/0004/GENSIMRAW_4133.root could not be opened.
== CMSSW: [c] XrdCl::File::Open(name='root://xcachecms.pic.es//store/user/nbinnorj/VBSSignalProdv3p0/CRABOUTPUT/WPMJJWPMJJjj_EWK_LO_4f_mmjj150_madgraph_pythia8/RunIII2024Summer2DR_RAWSIM_v3p0/250218_185157/0004/GENSIMRAW_4133.root', flags=0x10, permissions=0660) => error '[FATAL] Socket timeout' (errno=0, code=103)
== CMSSW: [d] Remote server already encountered a fatal error; no redirections were performed.
== CMSSW: ----- End Fatal Exception -------------------------------------------------
The file is in fact at T2_FI_HIP:
xrdcp -d 1 -f root://hip-cms1.csc.fi:1094//store/user/nbinnorj/VBSSignalProdv3p0/CRABOUTPUT/WPMJJWPMJJjj_EWK_LO_4f_mmjj150_madgraph_pythia8/RunIII2024Summer2DR_RAWSIM_v3p0/250218_185157/0004/GENSIMRAW_4133.root /dev/null

As you can see, it never tries to reach files available at T2_FI_HIP through AAA.
Can you check?

Kind regards,

Bockjoo Kim

************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
Ticket Link: https://helpdesk.ggus.eu/#ticket/zoom/2420
I forgot to ask:According to
/cvmfs/cms.cern.ch/SITECONF/T2_ES_CIEMAT/JobConfig/site-local-config.xml and
/cvmfs/cms.cern.ch/SITECONF/T2_ES_CIEMAT/PhEDEx/storage.xml, jobs are supposed to fall back to

<lfn-to-pfn protocol="xrdspain"
path-match="/+store/(.*)" result="root://xrootd-es.pic.es:1096//store/$1"/>

Was the job above running at a subsite at "xcache", "xcachepic", or "acme"?

Perhaps, do you need to add
<catalog url="trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrdspain"/>

to all these files:
/cvmfs/cms.cern.ch/SITECONF/T2_ES_CIEMAT/xcache/JobConfig/site-local-config.xml
/cvmfs/cms.cern.ch/SITECONF/T2_ES_CIEMAT/xcachepic/JobConfig/site-local-config.xml
/cvmfs/cms.cern.ch/SITECONF/T2_ES_CIEMAT/acme/JobConfig/site-local-config.xml
?
Thanks,
Bockjoo
Hi Bockjoo

In this case we are using a fallback to PIC xcache instance.

We have contacted PIC and service was not running properly. They are restarted xcache service and now files are downloading successfully

Cheers,

Javier
Thanks Javier!I will try to close this one next week.
Hi Javier,It happened again on the worker node gaew0231.ciemat.es : https://cmsweb.cern.ch:8443/scheddmon/0122/flombard/250222_213029:flombard_crab_LbyL_UPC_5p36TeV_UPCgen_pythia8_Spring23MiniAOD_NoPU_HiForest_time/job_out.1094.0.txt
Perhaps, it would help for jobs running at your worker nodes to add the 3rd fallback

<lfn-to-pfn protocol="xrdspain"
path-match="/+store/(.*)" result="root://xrootd-es.pic.es:1096//store/$1"/>

to all these files:

/cvmfs/cms.cern.ch/SITECONF/T2_ES_CIEMAT/xcache/JobConfig/site-local-config.xml
/cvmfs/cms.cern.ch/SITECONF/T2_ES_CIEMAT/xcachepic/JobConfig/site-local-config.xml
/cvmfs/cms.cern.ch/SITECONF/T2_ES_CIEMAT/acme/JobConfig/site-local-config.xml

or make the xcache service more resilient?
Thanks,
Bockjoo
Hi, Bockjoo

I have asked people at PIC to look for the origin of these problems.

Certainly if we are going to make heavily use of this type of service it must be resilient enough.

In our cluster not all the workers are using xcache (we were testing that service and recollecting data to make a comparison between scenarios with/without xcache) so we can switch off this option without problem.

We are discussing this issue with PIC people and will take a decision as soon as possible.

Cheers, Javier
Hi Javier,Thanks for the info.
Do you have any update?
Thanks,
Bockjoo
Hi Bockjoo

People at PIC have made several changes to the configuration of xcache service (all related to CAs, voms and certificate use to access remote storage or redirectors).
Also have implemented a monitor to detect future failures and to restart the service as soon it fails.

Hope this be enough to stabilize service

Cheers, Javier
Thanks Javier!Indeed, I don't see anymore xcache related xrootd open errors today.
I will close this one.

Bockjoo
Hi Javier,I am seeing the xcache error agan when the jobs run at CIETMAT or PIC:
inputroot=root://xcachecms.pic.es//store/data/Run2024C/ParkingDoubleMuonLowMass1/RAW/v1/000/380/001/00000/d0caacf2-4189-4a01-9d4d-2d2421f1fc8f.root

GLIDEIN_CMSSite="T2_ES_CIEMAT"
GLIDEIN_SiteWMS_Slot="slot1_3@gaew0226.ciemat.es"
CMS_CampaignType="UNKNOWN"
TaskType="DataProcessing"
WMAgent_TaskType="DataProcessing"
WMAgent_RequestName="cmsunified_ACDC0_Run2024C_ParkingDoubleMuonLowMass1_RawSkim_250305_071355_5408"
Args="cmsunified_ACDC0_Run2024C_ParkingDoubleMuonLowMass1_RawSkim_250305_071355_5408-Sandbox.tar.bz2 3071749 0"
x509userproxysubject="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues"
LastRemoteHost="slot1_1@glidein_1922519_379541618@gaew0226.ciemat.es"
CMSSW_Versions="CMSSW_14_0_19_patch2"
Chirp_WMCore_cmsRun_Exception_Message=Chirp_WMCore_cmsRun_Exception_Message": "b"An exception of category "FileReadError" occurred while\n [0] Constructing the EventProcessor\n [1] Constructing input source of type PoolSource\n [2] Reading branch EventAuxiliary\n [3] Calling XrdFile::readv()\n [4] XrdAdaptor::ClientRequest::HandleResponse() failure while running connection recovery\n [5] In XrdAdaptor::RequestManager::requestFailure()\nException Message:\nXrdAdaptor::RequestManager::requestFailure Open(name="root://xcachecms.pic.es//store/data/Run2024C/ParkingDoubleMuonLowMass1/RAW/v1/000/380/001/00000/d0caacf2-4189-4a01-9d4d-2d2421f1fc8f.root", flags=0x10, permissions=0660, old source=xcachecms.pic.es:1094 (site pic), new source=xcachecms.pic.es:1094 (site pic)) => Xrootd server returned an excluded source\n Additional Info:\n [a] Original error: "[ERROR] Operation expired" (errno=0, code=206, source=xcachecms.pic.es:1094 (site pic)).\n [b] Disabled source: xcachecms.pic.es:1094\n [c] Active source: xcachecms.pic.es:1094 (site pic)

Just FYI.
Thanks,
Bockjoo
Hello Bockjoo and Javier. Is this ticket still valid?
Thank you,
Noy
Hello Noy

As far as the CIEMAT is concerned, we have not detected any other error that affects our fallback to PIC.
So I think this ticket could be closed.
Cheers,
Javier
CMS #682712 (id:2845) Request for XRootD Upgrade and Network Packet Labeling Configuration (CIEMAT-LCG2)
State: in progress  |  Priority: urgent  |  Opened: 2025-03-18 14:46 (382d ago)  |  Updated: 2025-08-20 12:10
Conversation (6 messages)
Dear Site
Administrators,



CMS will resume data
taking next month and expand the use of tokens. Significant improvements and
bug fixes for tokens have been made in XRootD. We kindly request all CMS sites
using native XRootD to upgrade their XRootD services to the latest version (5.7.3),
including site redirectors and storage services.



We would also like to
take this opportunity to encourage sites to enable network packet labelling by
adding the following four configuration lines to both XRootD and redirectors.



xrootd.pmark ffdest
eu.scitags.org:10514

xrootd.pmark domain
any

xrootd.pmark defsfile
curl https://scitags.docs.cern.ch/api.json

xrootd.pmark map2exp
path /<path-to-store>/store cms



If your site supports
multiple VOs on the same XRootD service or requires additional details, please
refer to the SciTag Network Packet Labeling Twiki page: https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesXrootdScitagPacketLabeling.

Our target date for
completing the upgrade and configuration update is April 5th, in preparation
for the LHC commissioning for 2025.



Thank you for your
cooperation. Please let us know if you have any questions or concerns.



Best regards,

Jakrapop and Noy

CMS Site Support
************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
Ticket Link: https://helpdesk.ggus.eu/#ticket/zoom/2842
Hello CIEMAT admins, I see your XRootd version is 5.7.3 [1]. Could you please put new configurations for Scitag?
Thanks,
Noy
[1]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-3version&var-dst_hostname=gaexroot01.ciemat.es&var-timestamp=1748018317000
Hi, Noy

Indeed we upgrade our server to deploy the new Scitag configuration.

We will do it as soon as possible.

Cheer, Javier
Good afternoon, Javier. Could you please provide update about Scitag configuration on your site?
Thank you,
Noy
Hi, Noy

Since we have Xrootd as redirectors but we are using dCache-native xrootd doors to serve data I'm fear that scitag configuration in our Xrootd redirectors won't work as expected (if I have understood correctly the documentation).

To activate dCache network markers and Scitag support we must upgrade our current dCache version to at least 10.2 version.

That is a major upgrade we have planned before end of the year because our current dCache version will stop being supported at that date.

Anyway I can reconfigure our XrootD redirectors to include CMS configuration and restart service tomorrow morning.

Best regards,

Javier
Hello Javier, You're right, thank you so much.
Cheers,
Noy
WLCG tickets (2)
WLCG #1002119 (id:1002119) Upgrade your HTCondorCE endpoints to 24.0.x series (CIEMAT-LCG2)
State: in progress  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-20 09:07
Conversation (3 messages)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
Dear EGI Operatiors Team

We will update our version as soon as possible although we don't have yet an exact date for the upgrade.
Thanks a lot for the upgrade guide.

Best regards,

Javier
Hi Javier,
thanks for the reply. Please keep us posted.

Best regards,
Alessandro
WLCG #681564 (id:1665) Request to implement BGP tagging of LHCONE prefixes. (CIEMAT-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:55 (430d ago)  |  Updated: 2025-03-27 13:35
Conversation (5 messages)
GGUS ID: 168338
Last modifier: Julia Andreeva
Date: 2024-09-23 15:42:44
Subject: Request to implement BGP tagging of LHCONE prefixes. (CIEMAT-LCG2)
Ticket Type: USER
CC: ;edoardo.martelli@cern.ch
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Network problem
Description:
This ticket concerns all the sites connected to LHCONE.

In agreement with the WLCG Management Board, it has been decided to
implement the tagging of the IP prefixes announced to LHCONE.
The task consists of tagging the IP prefixes that your site announces to
LHCONE with all the BGP communities that identifies the experiments and
collaborations supported by your site. The initial goal is to document
the use of the network. In the longer term the tags may be used to
reduce the exposure on the LHCONE connection, by filtering unnecessary
prefixes.

You will find the values of the BGP communities to use and other
information in these pages:
- https://twiki.cern.ch/twiki/bin/view/LHCONE/MultiOneBGPcommunities
-
https://indico.cern.ch/event/1356138/contributions/6123461/attachments/2925447/5147273/WLCG-20240911-GDB-MultiONE-implementation.pdf

If you need any support on this task, please don't hesitate to ask your
NREN or LHCONE provider.
Or just reply to this ticket asking your questions; experts will guide
you in the implementation.

Please take this opportunity also to review the network information
related to your site in CRIC :
https://wlcg-cric.cern.ch/core/networkroute/list/

We ask you to perform the required action by the end of March 2025
GGUS ID: 168338
Last modifier: Javier Calonge
Date: 2024-09-23 16:19:40

Status: in progress
Responsible Unit: NGI_IBERGRID
Public Diary:
Hi Maarten,

This should be actionable by the Site or NGI admins.

Site/NGI admins, as per https://confluence.egi.eu/display/EGIPP/PROC11+Resource+Centre+Decommissioning can you mark your site as closed (the 90 day period of log retention has past)?

Thanks,
Greg
Internal Diary:
Added attachment EGI VO CLARIN SLA report 2024-05 - 2024-10.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119686
GGUS ID: 168338
Last modifier: Julia Andreeva
Date: 2025-01-28 14:47:33

Public Diary:
Any progress on this ticket?
Internal Diary:
Added attachment EGI VO CLARIN SLA report 2024-05 - 2024-10.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119686
Hi, Julia

We have contacted our network infrastructure staff to implement BGP tagging.
Right now they are studying what changes are needed in our external router.

We hope that throughout the month February needed changes can be implemented

Best regards,

Javier
Dear all

We have already implemented BGP tagging (currently only CMS tag, 61339:3) in our LHCOne link.

Best regards,

Javier
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM97%100%100%100%100%97%100%96%95%100%100%97%99%100%100%99%
HammerCloud98%97%100%100%99%50%98%100%100%100%95%91%98%98%100%99%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (13)

CMS tickets (5)
CMS #1000586 (id:1000586) File access errors at T2_ES_IFCA
State: in progress  |  Priority: urgent  |  Opened: 2025-09-18 15:38 (198d ago)  |  Updated: 2026-03-19 09:50
Conversation (3 messages)
Hi,

the
25dataaccess test returns a WARNING since a very long time, see for example:
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.WN-25dataaccess-/cms-ce-token&var-dst_hostname=arcce01.ifca.es&var-timestamp=1758141987000
It produces errors when trying to access a local file using CMSSW 12 or 14.
In the case of CMSSW 12, it uses storage.xml?protocol=file and fails with
'Permission denied'In the case of CMSSW 14, it uses storage.json and fails with
'No such file or directory'I suspect that there is something wrong with storage.json and probably also with storage.xml. Could you check if they match with the version in SITECONF and if so if this is correct for local file access?

Thanks,
Andrea
Hello IFCA admins. This issue's still going on with CMSSW 14. The log file shows "Failed to open the file '///cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/AE237916-5D76-E711-A48C-FA163EEEBFED.root'" [1]. I can access the file using XRootD and WebDAV [2]. This could related with PhEDEx/storage.xml that Stephan mention on the other ticket [3]. Could you please take a look?
Thank you,
Noy
[1]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.WN-25dataaccess-/cms-ce-token&var-dst_hostname=arcce01.ifca.es&var-timestamp=1761813122000
[2]
[crungphi@lxplus800 ~]$ xrdcp -f root://gridftp.ifca.es:1094//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/AE237916-5D76-E711-A48C-FA163EEEBFED.root /dev/null
[721.7MB/721.7MB][100%][==================================================][90.21MB/s]
[crungphi@lxplus800 ~]$ gfal-ls -l davs://webdav.ifca.es:1094/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/
-rwxrwxrwx 0 0 0 1293057986 Oct 14 2017 CE860B10-5D76-E711-BCA8-FA163EAA761A.root
-rwxrwxrwx 0 0 0 756769798 Oct 14 2017 AE237916-5D76-E711-A48C-FA163EEEBFED.root
-rwxrwxrwx 0 0 0 240461986 Oct 14 2017 A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root

[3]
https://helpdesk.ggus.eu/#ticket/zoom/3473
Hello IFCA admins. The issue persists. Could you please take a look and check the configuration?
Thank you,
Noy
CMS #1001525 (id:1001525) 50% failure rate at T2_ES_IFCA
State: assigned  |  Priority: urgent  |  Opened: 2026-01-13 16:53 (81d ago)  |  Updated: 2026-01-13 16:53
Conversation (1 message)
I am seeing a 50% failure rate at T2_ES_IFCA since 1/10 with the following error message:
Job Name: ae94610c-f625-4cf7-bd28-d435ea6a4040-455

WMBS job id: 2649303

Workflow: cmsunified_task_GEN-Run3Summer22EEwmLHEGS-00402__v1_T_251112_134101_6157

Task: /cmsunified_task_GEN-Run3Summer22EEwmLHEGS-00402__v1_T_251112_134101_6157/GEN-Run3Summer22EEwmLHEGS-00402_0

Status: jobfailed
Input dataset:

Site: T2_ES_IFCA

Agent: cmsgwms-submit12.fnal.gov

ACDC URL: https://cmsweb.cern.ch/couchdb/acdcserver

State Transition
jobfailed: 2026/1/11 (Sun) 02:11:04 UTC, T2_ES_IFCA

Exit code: 8001

Retry count: 0

Errors:cmsRun1
cmsRun2
cmsRun3

cmsRun4

Fatal Exception (Exit Code: 8001)
An exception of category 'PluginLibraryLoadError' occurred while
[0] Constructing the EventProcessor
[1] While attempting to load plugin EcalElectronicsMappingBuilder
Exception Message:
unable to load /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_13/lib/el8_amd64_gcc11/pluginGeometryEcalMappingPlugins.so because libnvidia-ml.so.1: cannot open shared object file: No such file or directory

cmsRun5

Skipping this step due to a failure in a previous one. (Exit Code: 99108)
Error in CMSSW: 99108
<@========== WMException Start ==========@>
Exception Class: CmsRunFailure
Message: Skipping this step due to a failure in a previous one.
ClassName : None
ModuleName : WMCore.WMSpec.Steps.WMExecutionFailure
MethodName : __init__
ClassInstance : None
FileName : /srv/job/WMCore.zip/WMCore/WMSpec/Steps/WMExecutionFailure.py
LineNumber : 18
ErrorNr : 99108
Traceback:
<@---------- WMException End ----------@>

stageOut1
logArch1
CMS #683338 (id:3473) bad protocol entry "file" at T2_ES_IFCA
State: assigned  |  Priority: urgent  |  Opened: 2025-05-20 22:00 (318d ago)  |  Updated: 2025-10-30 10:17
Conversation (1 message)
Dear IFCA admins,
it looks like you have a bad protocol specification for "file" in PhEDEx/storage.xml
at line 27,28 where you prefix /cms and chain the protocol to direct which adds
another /cms prefix on line 10,11.
SAM WN-25dataaccess detects and flags this.

https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Darcce01.ifca.es%26service%3Dorg.cms.WN-25dataaccess-%252Fcms-ce-token%26site%3Detf%26view_name%3Dservice

Can you please take a look and correct?
Thanks,
cheers, Stephan
CMS #681579 (id:1680) CVMFS quota for /cvmfs/cms.cern.ch at T2_ES_IFCA
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:56 (430d ago)  |  Updated: 2025-10-09 09:44
Conversation (3 messages)
GGUS ID: 167555
Last modifier: Stephan Lammel
Date: 2024-07-10 23:09:47
Subject: CVMFS quota for /cvmfs/cms.cern.ch at T2_ES_IFCA
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Other
Description:
Dear IFCA admins,
we added the new SAM worker node test to ETF production today. The CVMFS test
shows the cms.cern.ch area hafing a quota of only 15 GB assigned. (Other CVMFS
areas have 32.2 GB.) The probe run on worker node wngrid033.prv.cloud.ifca.es .
For 48 core nodes we recommend about a 100 GB cache.
https://twiki.cern.ch/twiki/bin/view/CMSPublic/FacilitiesServicesDocumentation
Maybe something you can adjust when upgrading/re-installing the machines with EL9.
Could you please take a look / increase the quota for cms.cern.ch.
Thanks,
cheers, Stephan

https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_ES_IFCA
GGUS ID: 167555
Last modifier: Miguel Angel Nu?ez Vega
Date: 2024-07-22 07:14:49

Status: in progress
Responsible Unit: NGI_IBERGRID
Good morning, Miguel. Could you please provide any updates on the progress of adjusting the quota as per Stephan’s request??
Thank you
Noy
CMS #681967 (id:2068) High failure rate with stage out at IFCA
State: in progress  |  Priority: less urgent  |  Opened: 2025-02-03 09:45 (425d ago)  |  Updated: 2025-02-03 09:47
Conversation (3 messages)
GGUS ID: 162647
Last modifier: Jennifer Adelman-McCarthy
Date: 2023-07-04 13:57:00
Subject: High failure rate with stage out at IFCA
Ticket Type: USER
CC: cms-comp-ops-workflow-team@cern.ch;cms-comp-ops-transfer-team@cern.ch;cms-comp-ops-workflow-team@cern.ch;
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: CMS_Central Workflows
Description:
T2_ES_IFCA is having a 98% failure rate:
https://monit-grafana.cern.ch/d/u_qOeVqZk/wmarchive-monit?orgId=11&var-campaign=All&var-jobtype=LogCollect&var-jobtype=Merge&var-jobtype=Processing&var-jobtype=Production&var-host=All&var-site=T2_ES_IFCA&var-jobstate=All&var-exitCodes=All&from=now-12h&to=now

With the following error message:
Job Name: b62a8f5f-92d2-4a58-9bea-a6e4950b42b5-131
WMBS job id: 6007366
Workflow: cmsunified_task_GEN-Run3Summer22EEwmLHEGS-00130__v1_T_230630_202212_5623
Task: /cmsunified_task_GEN-Run3Summer22EEwmLHEGS-00130__v1_T_230630_202212_5623/GEN-Run3Summer22EEwmLHEGS-00130_0
Status: jobcooloff
Input dataset:
Site: T2_ES_IFCA
Agent: cmsgwms-submit7.fnal.gov
ACDC URL: https://cmsweb.cern.ch/couchdb/acdcserver
State Transition
jobcooloff: 2023/7/4 (Tue) 08:07:59 UTC, T2_ES_IFCA


Exit code: 50513
Retry count: 0
Errors:
cmsRun1


cmsRun2


cmsRun3


cmsRun4

SCRAMScriptFailure (Exit Code: 50513)


SCRAM scripts failed to run!
Exception Class: PreScriptFailure
Message: Error running command
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH
/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/bin/python3 -m WMCore.WMRuntime.ScriptInvoke WMTaskSpace.cmsRun4 SetupCMSSWPset
1
b''
b'Traceback (most recent call last):\n File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/runpy.py", line 193, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/runpy.py", line 86, in _run_code\n exec(code, run_globals)\n File "/srv/job/WMCore.zip/WMCore/WMRuntime/ScriptInvoke.py", line 23, in \n File "", line 259, in load_module\n File "/srv/job/WMCore.zip/WMCore/WMRuntime/Bootstrap.py", line 18, in \n File "", line 259, in load_module\n File "/srv/job/WMCore.zip/WMCore/FwkJobReport/Report.py", line 19, in \nModuleNotFoundError: No module named \'Utils\'\n'
ClassName : None
ModuleName : WMCore.WMSpec.Steps.WMExecutionFailure
MethodName : __init__
ClassInstance : None
FileName : /srv/job/WMCore.zip/WMCore/WMSpec/Steps/WMExecutionFailure.py
LineNumber : 18
ErrorNr : 50513
Traceback:








cmsRun5

Skipping this step due to a failure in a previous one. (Exit Code: 99108)


Error in CMSSW: 99108
Exception Class: CmsRunFailure
Message: Skipping this step due to a failure in a previous one.
ClassName : None
ModuleName : WMCore.WMSpec.Steps.WMExecutionFailure
MethodName : __init__
ClassInstance : None
FileName : /srv/job/WMCore.zip/WMCore/WMSpec/Steps/WMExecutionFailure.py
LineNumber : 18
ErrorNr : 99108
Traceback:








stageOut1


logArch1

LogArchiveFailure (Exit Code: 60307)



Exception Class: StageOutFailure
Message: Failure for fallback stage out:
Exception Class: StageOutError
Message: Command exited non-zero, ExitCode:20
Output: stdout: Tue Jul 4 09:49:57 CEST 2023
Copying 4076521 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/disk//store/unmerged/logs/prod/2023/7/4/cmsunified_task_GEN-Run3Summer22EEwmLHEGS-00130__v1_T_230630_202212_5623/GEN-Run3Summer22EEwmLHEGS-00130_0/0002/0/b62a8f5f-92d2-4a58-9bea-a6e4950b42b5-131-0-logArchive.tar.gz
gfal-copy exit status: 20
ERROR: gfal-copy exited with 20
Cleaning up failed file:
Tue Jul 4 09:49:58 CEST 2023
srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/disk//store/unmerged/logs/prod/2023/7/4/cmsunified_task_GEN-Run3Summer22EEwmLHEGS-00130__v1_T_230630_202212_5623/GEN-Run3Summer22EEwmLHEGS-00130_0/0002/0/b62a8f5f-92d2-4a58-9bea-a6e4950b42b5-131-0-logArchive.tar.gz MISSING
stderr: /srv/startup_environment.sh: line 8: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 15: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 33: EUID: readonly variable
/srv/startup_environment.sh: line 145: PPID: readonly variable
/srv/startup_environment.sh: line 157: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 209: UID: readonly variable
gfal-copy error: 20 (Not a directory) - DESTINATION MAKE_PARENT srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/disk//store/unmerged/logs/prod/2023/7/4/cmsunified_task_GEN-Run3Summer22EEwmLHEGS-00130__v1_T_230630_202212_5623/GEN-Run3Summer22EEwmLHEGS-00130_0/0002/0 it is a file
/srv/startup_environment.sh: line 8: BASHOPTS: readonly variable
/srv/s
GGUS ID: 162647
Last modifier: Iban Cabrillo
Date: 2024-01-29 11:56:46

Status: in progress
Responsible Unit: NGI_IBERGRID
GGUS ID: 162647
Last modifier: Jennifer Adelman-McCarthy
Date: 2025-01-23 15:48:05

Status: in progress
Responsible Unit: NGI_IBERGRID
Public Diary:
It's a little hard to tell in that we are not running a lot of jobs at the site.
I am still seeing a high failure rate, but with a different error, however the same error can be seen for this workflow at other sites for this workflow, even though we have had 180000 successful jobs run with this step, so I don't think it's a workflow problem. Can we keep the ticket open a bit longer as a reminder to keep an eye on the site?
Job Name: 2f200529-7498-430a-a6f0-019855a0cf01-326
WMBS job id: 5114234
Workflow: cmsunified_task_BPH-RunIII2024Summer24GS-00035__v1_T_250122_104203_4006
Task: /cmsunified_task_BPH-RunIII2024Summer24GS-00035__v1_T_250122_104203_4006/BPH-RunIII2024Summer24GS-00035_0
Status: jobcooloff
Input dataset:
Site: T2_ES_IFCA
Agent: cmsgwms-submit11.fnal.gov
ACDC URL: https://cmsweb.cern.ch/couchdb/acdcserver
State Transition
jobcooloff: 2025/1/23 (Thu) 11:36:02 UTC, T2_ES_IFCA


Exit code: 50513
Retry count: 0
Errors:
cmsRun1


cmsRun2


cmsRun3


cmsRun4

SCRAMScriptFailure (Exit Code: 50513)


SCRAM scripts failed to run!
Exception Class: PreScriptFailure
Message: Error running command
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH
/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/bin/python3 -m WMCore.WMRuntime.ScriptInvoke WMTaskSpace.cmsRun4 SetupCMSSWPset
1
Traceback (most recent call last):
File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/srv/job/WMCore.zip/WMCore/WMRuntime/ScriptInvoke.py", line 23, in
File "", line 259, in load_module
File "/srv/job/WMCore.zip/WMCore/WMRuntime/Bootstrap.py", line 19, in
File "", line 259, in load_module
File "/srv/job/WMCore.zip/WMCore/FwkJobReport/Report.py", line 19, in
ModuleNotFoundError: No module named 'Utils'

ClassName : None
ModuleName : WMCore.WMSpec.Steps.WMExecutionFailure
MethodName : __init__
ClassInstance : None
FileName : /srv/job/WMCore.zip/WMCore/WMSpec/Steps/WMExecutionFailure.py
LineNumber : 18
ErrorNr : 50513
Traceback:
WLCG tickets (8)
WLCG #681559 (id:1660) Request to deploy IPv6 on CEs and WNs at WLCG sites (IFCA-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:55 (430d ago)  |  Updated: 2026-03-31 08:28
Conversation (10 messages)
GGUS ID: 164352
Last modifier: Andrea Sciaba
Date: 2023-11-29 06:44:04

Public Diary:
Hola Francisco,
could you please answer the ticket?
Thanks,
Andrea
Internal Diary:
Escalated this ticket to NGI_IBERGRID
GGUS ID: 164352
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:36:56
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (IFCA-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164352
Last modifier: Andrea Sciaba
Date: 2024-04-08 14:32:23

Public Diary:
Hello,
would it be possible to have some information?
Thanks!
Andrea
Internal Diary:
Escalated this ticket to NGI_IBERGRID
GGUS ID: 164352
Last modifier: Miguel Angel Nu?ez Vega
Date: 2024-07-22 07:10:28

Status: in progress
Responsible Unit: NGI_IBERGRID
Public Diary:
Hello,
would it be possible to have some information?
Thanks!
Andrea
Internal Diary:
Escalated this ticket to NGI_IBERGRID
GGUS ID: 164352
Last modifier: Andrea Sciaba
Date: 2024-09-18 09:39:06

Public Diary:
Hi,
once again, could you please provide some information about the required IPv6 deployent on CEs and WNs?
Thanks, Andrea
Internal Diary:
Escalated this ticket to NGI_IBERGRID
GGUS ID: 164352
Last modifier: Andrea Sciaba
Date: 2025-01-24 16:02:59

Public Diary:
Hola Miguel Ángel,
are there any news?
Andrea
Internal Diary:
Escalated this ticket to NGI_IBERGRID
Dear Andrea,

We currently have IPv6 enabled on all our main services, including the CEs, xrootd, and Squids.

Regarding the Worker Nodes, they do not have public network access and, by design, they should not have it. Therefore, we are not entirely sure if enabling IPv6 directly on the WNs is strictly necessary, as they are located within our private network.

We would like to confirm whether the request refers to enabling IPv6 directly on the WNs or simply ensuring IPv6 connectivity through a gateway, which we could implement if that is the actual requirement. Could you please provide us with more details about the specific issue you are experiencing with our network? This will help us better understand the situation and determine the most appropriate solution.

Kind regards,
Pablo
Hola Pablo,
the use case for IPv6 on the WNs is to be able to open connections with off-site IPv6-only servers (like storage or central services), so any solution that allows that would be fine. If the WNs are in a private network there is no need to have public IPv6 addresses for them.
Cheers,
Andrea
Hola Pablo,
did you find a solution that allows your WNs to access external services via IPv6?
Cheers,
Andrea
Hi Pablo,
are there any news?
Saludos,
Andrea
WLCG #681580 (id:1681) CE arcce01.ifca.es is not working for Biomed users
State: waiting for submitter's reply  |  Priority: less urgent  |  Opened: 2025-01-29 09:56 (430d ago)  |  Updated: 2026-03-31 07:19
Conversation (5 messages)
GGUS ID: 166180
Last modifier: Sorina Pop
Date: 2024-04-09 15:35:07
Subject: CE arcce01.ifca.es is not working for Biomed users
Ticket Type: TEAM
CC:
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Other
Description:
Dear sites admins,

CE arcce01.ifca.es is not working for Biomed users. The incident was detected from the Biomed ARGO box that you may want to check to see the status: https://biomed.ui.argo.grnet.gr/biomed/report-status/CORE/SITES/IFCA-LCG2/ARC-CE/arcce01.ifca.es

According to the announcement in https://operations-portal.egi.eu/broadcast/archive/3021, the biomed VOMS server host certificate changed on Friday, 29th of March, requiring client updates and it's likely that errors are due to this change.Could you please have a look ?

Thank you in advance for your support,
Sorina for the Biomed VO.
GGUS ID: 166180
Last modifier: Miguel Angel Nu?ez Vega
Date: 2024-07-22 07:12:23

Status: in progress
Responsible Unit: NGI_IBERGRID
GGUS ID: 166180
Last modifier: Akos Szlavecz
Date: 2024-05-01 10:28:53

Public Diary:
Dear site admins,

Could you check this issue, please?

Regards
Ákos
GGUS ID: 166180
Last modifier: Sorina Pop
Date: 2024-08-21 12:15:54

Public Diary:
Dear site admins,

Any news on this issue?

Best regards,
Sorina
Dear Sorina

While reviewing older open tickets, we came across this one regarding the Biomed users' certificate issue at CE arcce01.ifca.es. We would like to follow up to ensure the matter is fully resolved.

Currently, our system uses the following VO for Biomed users: voms-biomed.ijclab.in2p3.fr

Could you please confirm whether this VO is still the correct one for your users? If it is not, kindly let us know which VO should be used so we can update our configuration accordingly.

If the VO is correct, and the issue has already been resolved, we will proceed to close this ticket.

We look forward to your confirmation.

Best regards,
Pablo Izquierdo
WLCG #681573 (id:1674) Enable new monitoring flow for xrootd remote access (IFCA-LCG2)
State: in progress  |  Priority: urgent  |  Opened: 2025-01-29 09:56 (430d ago)  |  Updated: 2026-03-02 11:16
Conversation (3 messages)
GGUS ID: 164117
Last modifier: Julia Andreeva
Date: 2023-11-09 10:40:18
Subject: Enable new monitoring flow for xrootd remote access (IFCA-LCG2)
Ticket Type: USER
CC: ;borja.garrido.bear@cern.ch
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Monitoring
Description:
According to wlcg CRIC your site is running xrootd storage. WLCG Monitoring Task Force implemented a new monitoring flow which should monitor remote data access more reliably. We request the sites to setup and configure the new component 'shoveler' which has to be deployed at the site. Please, follow up the instructions which can be found on the twiki:
https://twiki.cern.ch/twiki/bin/view/LCG/MonitoringTaskForce#Shoveler
Please, accomplish this task before the end of 2023.
GGUS ID: 164117
Last modifier: Miguel Angel Nu?ez Vega
Date: 2024-07-22 07:10:11

Status: in progress
Responsible Unit: NGI_IBERGRID
Public Diary:
Hi,

Those VOs are missing from the Caso service. We will also migrate the service to a new one and will include the missing VO's.

Cheers
Joao Pina
Internal Diary:
Added Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=167003
Good morning,

Thank you for the notification and for the instructions regarding the deployment of the shoveler component for the new monitoring flow.

While reviewing the installation and configuration documentation, we noticed that a valid certificate is required in order to deploy and operate the monitoring service at our site. Could you please provide guidance on how to request or obtain the appropriate certificate for this purpose?

If there are specific requirements (certificate type, DN information, host certificate vs. service certificate, or the responsible CA), please let us know so we can proceed accordingly.

Thank you for your support.

Best regards,

Pablo Izquierdo
WLCG #1000593 (id:1000593) Lack of cloud accounting - IFCA-LCG2
State: in progress  |  Priority: urgent  |  Opened: 2025-09-19 12:36 (197d ago)  |  Updated: 2025-09-22 06:46
Conversation (1 message)
Dear IFCA-LCG2 cloud admins,
There are no cloud records published in the Accounting Portal since beginning of July 2025. Few attempts to contact you by email were ignored therefore I am raising this ticket so you investigate and report here your findings.Many thanks!
Catalin
WLCG #681577 (id:1678) Missing CPU accounting data in the EGI portal for May. Monthly accounting validation has not been performed either. (IFCA-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:56 (430d ago)  |  Updated: 2025-01-30 14:42
Conversation (3 messages)
GGUS ID: 167564
Last modifier: Julia Andreeva
Date: 2024-07-12 09:20:33
Subject: Missing CPU accounting data in the EGI portal for May. Monthly accounting validation has not been performed either. (IFCA-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Other
Description:
Hello,
You get this ticket because accounting data for your site for May do not show up in the EGI accounting portal. In case you have troubles with the accounting reporting (which should be followed up with APEL support team), you should use monthly accounting validation in order to provide accounting metrics from your local accounting for the WLCG monthly accounting report. It has not been done for your site for May. In case your were not notified for monthly accounting validation please contact me (julia.andreeva@cern.ch), so that we update mailing list for notification. Please, make sure that for June your numbers are provided and that your reporting to APEL is fixed.
GGUS ID: 167564
Last modifier: Miguel Angel Nu?ez Vega
Date: 2024-07-22 07:15:06

Status: in progress
Responsible Unit: NGI_IBERGRID
GGUS ID: 167564
Last modifier: Pablo Izquierdo Gonzalez
Date: 2025-01-28 08:24:32

Public Diary:
Hello Julia,

We were waiting for the APEL support team to fix some error with the accounting portal. Our data is correctly published now. Can you confirm that it is already resolved to close this ticket ?

Cheers
Pablo
WLCG #681575 (id:1676) Set VO property in the OpenStack projects used for EGI
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:56 (430d ago)  |  Updated: 2025-01-30 14:40
Conversation (2 messages)
GGUS ID: 163724
Last modifier: Enol Fernandez
Date: 2023-10-19 11:48:30
Subject: Set VO property in the OpenStack projects used for EGI
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Configuration
Description:
Dear IFCA-LCG2 admins,

We are moving to a better discovery of VOs in the cloud by using the project properties in OpenStack so you can use any kind of naming convention you may have internally for your OpenStack projects, but we can still automatically discover which VOs these projects are supporting. For that we request that you set a property (named VO) in those projects used for EGI so the cloud-info-provider service account can publish the correct information about them. The OpenStack command to use would look like this:

openstack project set --property VO=

The cloud-info-provider is already configured to leverage this property. Once enabled we can get rid of the site information at the fecloud-catchall-operations repo.

Thanks Enol

Affected Site: IFCA-LCG2
GGUS ID: 163724
Last modifier: Miguel Angel Nu?ez Vega
Date: 2024-07-22 07:09:14

Status: in progress
Responsible Unit: NGI_IBERGRID
Public Diary:
I think that it can be closed as the decommision has been completed.
Internal Diary:
Added Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=167003
WLCG #681570 (id:1671) Swift service checks failing at IFCA-LCG2
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:56 (430d ago)  |  Updated: 2025-01-30 14:38
Conversation (2 messages)
GGUS ID: 166874
Last modifier: Aida Palacio
Date: 2024-05-27 09:33:53
Subject: Swift service checks failing at IFCA-LCG2
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Monitoring
Description:
Dear,

Since one week ago, the checks of our swift service are failing and we can't find the reason of the problem. The issue from argo-mon is the following that maybe for you clarifies more than for us:

```
Critical: Could not create new OpenStack Swift Container: container-9b146512-6931-4155-b211-ff1f6d9e75ff: 400 Client Error: Bad Request 400 Client Error: Bad Request
```

In our internal logs, problems are only related to the POST call, the GETs are OK. Can you let us know if there is a problem in our site? Or in other case, what is the issue related to?

Thanks in advance!
Cheers,
A.

Affected ROC/NGI: NGI_IBERGRID
Affected Site: IFCA-LCG2
GGUS ID: 166874
Last modifier: Miguel Angel Nu?ez Vega
Date: 2024-07-22 07:14:14

Status: in progress
Responsible Unit: NGI_IBERGRID
Public Diary:
Hi,

We have added voms to list of servers to be upgraded.

Cheers
Joao Pina
Internal Diary:
Added attachment EGI VO CLARIN SLA report 2024-05 - 2024-10.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119686
WLCG #681565 (id:1666) Application credentials at IFCA Openstack for enmr.eu VO
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 09:55 (430d ago)  |  Updated: 2025-01-30 14:36
Conversation (1 message)
GGUS ID: 167702
Last modifier: Andrei Tsaregorodtsev
Date: 2024-07-29 12:47:54
Subject: Application credentials at IFCA Openstack for enmr.eu VO
Ticket Type: USER
CC: A.M.J.J.Bonvin@uu.nl;
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Computing Services
Description:
Hello,
I would like to use the IFCA Cloud for the enmr.eu VO calculations using automated VM creation by the EGI Workload Manager (DIRAC). To do that I need to create Application Credentials associated with the VO:enmr.eu project. Unfortunately, I fail with the error:

Error: Unable to create application credential. Details
Invalid application credential: Could not find role assignment with role: 404b3eb5254f49d3bf2e68456ae7671c, user or group: e8a1877d5cb835a6e62fde5fbfff85d8f303098067b6d8069853668d6957b5e3, project, domain, or system: b6b1c395b7da4e8aa1d0895bd695b0ba. (HTTP 400) (Request-ID: req-838e3bc1-b385-4416-8d6c-2afbe40dd255)

This is usually because my account is not granted the "reader" and the "member" roles. Can you please add these roles to my account. I use EGI SSO credentials with the following ID:

916ace3e7347dae2e0bb16e96444cafaea7706e7b233a93ad76881b6ff23ab80@egi.eu

Regards,
Andrei
Affected ROC/NGI: NGI_IBERGRID
Affected Site: IFCA-LCG2
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%97%100%100%100%99%
HammerCloud97%98%97%98%99%97%98%99%98%95%98%95%99%98%99%97%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (6)

CMS tickets (5)
CMS #1002095 (id:1002095) Enabling the ARC CE arc2lumi.hip.fi
State: assigned  |  Priority: less urgent  |  Opened: 2026-03-17 11:44 (18d ago)  |  Updated: 2026-04-01 16:19
Conversation (16 messages)
Hello!

The ARC CE arc2lumi.hip.fi has been upgraded and reconfigured to submit again jobs to the LUMI HPC.
Could you enable it first for commissioning and if things look OK then later enable it for production.
Please use these parameters for a start. The 4 cores / job is for studying the scheduling issue for fapptainer jobs running with HTCondor.
ARC CE: arc2lumi.hip.fi

Subsite: T2_FI_HIP/HIP-LUMI
SLURM queue name: small (queue=small)
Request for ARC RTE: ENV/CMS-LUMI (runtimeenvironment=ENV/CMS-LUMI)
4 core jobs, to start with
RAM: 2 GB/core
Max runtime 48 hours

Best regards, Tomas
Hello Tomas,

Thanks for the parameters. I'll test this configuration tomorrow and update you on the results.

Best regards,
Luís.
Hello Tomas,

I've enabled the entry for arc2lumi.hip.fi entry in our ITB environment but wasn't able to get any pilots running there. I'll describe what I've tried and observed. Could you check if any change is needed in the site?

At first I enabled the entry with sckitokens authentication, but all pilots remained pending in the factory reporting "Detected Down Grid Resource" in the Condor activity log.

I then reverted back to using x509 proxy authentication for this entry, and then all new pilots went into a held state reporting "ARC job failed: Job submission to LRMS failed" [1]. I've also attached a section of the Condor gridmanager log for the GAHP in this message [2], as doing arc_gahp tests in the factory didn't return any job IDs to query.

I also found an ETF condor token test that has been failing for this CE [3].

Can you check it on your end and let me know if any changes are needed?

Best regards,
Luís.

SI Factory Ops team.

---

[1]
...
028 (318271.002.000) 2026-03-18 16:09:51 Job ad information event triggered.
Cluster = 318271
EventTime = "2026-03-18T16:09:51.534"
EventTypeNumber = 28
GlideinOverloadEnabled = true
HoldReason = "ARC job failed: Job submission to LRMS failed"
HoldReasonCode = 0
HoldReasonSubCode = 0
MyType = "JobHeldEvent"
Proc = 2
Subproc = 0
TriggerEventTypeName = "ULOG_JOB_HELD"
TriggerEventTypeNumber = 12
...
012 (318271.003.000) 2026-03-18 16:09:51 Job was held.
ARC job failed: Job submission to LRMS failed
Code 0 Subcode 0
...

[2]
03/18/26 16:09:49 [92362] GAHP[92538] <- 'ARC_JOB_INFO 21 arc2lumi.hip.fi bfc753cf6561'
03/18/26 16:09:49 [92362] GAHP[92538] -> 'S'
03/18/26 16:09:49 [92362] GAHP[92538] <- 'RESULTS'
03/18/26 16:09:49 [92362] GAHP[92538] -> 'R'
03/18/26 16:09:49 [92362] GAHP[92538] -> 'S' '1'
03/18/26 16:09:49 [92362] GAHP[92538] -> '18' '200' 'OK' '{
"ID": "urn:caid:arc2lumi.hip.fi:org.nordugrid.arcrest:bfc554591421",
"Type": "single",
"Error": "Job submission to LRMS failed",
"Owner": "/DC=ch/DC=cern/OU=computers/CN=cmspilot04\\/vocms0802.cern.ch",
"Queue": "small",
"State": [
"nordugrid:FAILED::SUBMIT",
"file:submit",
"arcrest:FAILED"
],
"StdIn": "/dev/null",
"StdErr": "_arc_stderr",
"StdOut": "_arc_stdout",
"OtherInfo": "SubmittedVia=org.nordugrid.arcrest",
"LocalOwner": "grid",
"_attributes": {
"BaseType": "Activity",
"Validity": "10800",
"CreationTime": "2026-03-18T15:09:01Z"
},
"Associations": {
"ComputingShareID": "urn:ogf:ComputingShare:arc2lumi.hip.fi:small"
},
"RestartState": [
"nordugrid:SUBMIT",
"file:submit",
"arcrest:SUBMITTING"
],
"IDFromEndpoint": "urn:idfe:bfc554591421",
"JobDescription": "emies:adl",
"RequestedSlots": "4",
"SubmissionHost": "137.138.53.124",
"SubmissionTime": "2026-03-18T15:07:49Z",
"ProxyExpirationTime": "2026-03-21T15:00:02Z",
"WorkingAreaEraseTime": "2026-03-20T15:07:57Z",
"RequestedTotalWallTime": "691200",
"RequestedApplicationEnvironment": "ENV/CMS-LUMI"
}'
03/18/26 16:09:50 [92362] (318271.3) gm state change: GM_EXIT_INFO -> GM_CANCEL_CLEAN
03/18/26 16:09:50 [92362] GAHP[92538] <- 'ARC_JOB_CLEAN 25 arc2lumi.hip.fi bfc753cf6561'
03/18/26 16:09:50 [92362] GAHP[92538] -> 'S'
03/18/26 16:09:50 [92362] GAHP[92538] <- 'RESULTS'
03/18/26 16:09:50 [92362] GAHP[92538] -> 'R'
03/18/26 16:09:50 [92362] GAHP[92538] -> 'S' '1'
03/18/26 16:09:50 [92362] GAHP[92538] -> '22' '202' 'Queued for cleaning'

[3]
https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Darc2lumi.hip.fi%26service%3Dorg.sam.CONDOR-JobSubmit-%252Fcms-ce-token%26site%3Detf%26view_name%3Dservice
Hello Luis!

Thanks for the report. There was a missing configuration file that was added. Hopefully that will improve things.

Best regards, Tomas
Hello Tomas,

Thanks for the update. I see from our monitoring that since today at approximately 01:30 AM the pilots were able to make it to the CE, although they are still idle and waiting to be started in the CE. I believe you should be able to see the pilot jobs in the CE now, although they are not yet started.
I also noted that since 01:30 AM there were some periods in which the pilots went into a held state again. I'll keep monitoring things on our side and will keep you updated on any progress.

Best regards,
Luís.
Hello Luis!

There was some more configuration issues to correct, but now there is a glidein job running on LUMI. I suspect the job is running in the TMPDIR which points to /tmp and is in fact in RAM memory. If that is the case then that has still to be corrected.

Best regards, Tomas
Hello Tomas,

The jobs seem to have ran successfully. Our test script prints out some information about the environment [1]. The entry is indeed set to TMPDIR, let me know if any changes are needed in this regard. I'll change the authentication method to scitokens and see if we can get the pilots to run on the entry as well.

Best regards,
Luís.

---

[1]

[lsimasde@vocms0811 lsimasde]$ condor_history 34175.0
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
34175.0 lsimasde 3/18 10:22 0+00:04:25 C 3/20 16:18 sleep.sh 60
[lsimasde@vocms0811 lsimasde]$ cat out/sleep.out.34175.0
I slept for 60 seconds on:
nid002441
Linux
nid002441 6.4.0-150600.23.73_15.0.14-cray_shasta_c #1 SMP Tue Oct 21
20:32:25 UTC 2025 (89d3c98) x86_64 x86_64 x86_64 GNU/Linux
Fri Mar 20 17:18:06 EET 2026
OSG_WN_TMP: /tmp
CONDOR_SCRATCH_DIR: /srv
HOME: /srv
PATH: /srv/.gwms.d/bin:/tmp/glide_CdnHIf/main/condor/libexec:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD: /srv
Hello Luis!

I think we can fix the TMPDIR on our end.

I realized just today that SLURM on LUMI has been set up so that the 2 GB/core nodes only have 1.75 GB/core available. So for 4 core jobs the corresponding limit would be 7168 MB. Could you please change that in the glidein factory entry from the current assumption of 2 Gb/core?

Best regards, Tomas
Hello Tomas,

I've adjusted the memory limit for the entry. If you need a re-test after
changing the TMPDIR just let me know. Do you plan on using 8 core slots
when we move this to production?
Also, I had success with using scitokens authentication for this entry in my latest test.

Best regards,
Luís.
Hello Luis!

The TMPDIR should point to the right place now, but it would be good to confirm that with some testjobs.

There is a scheduling problem with fapptainer/HTCondor/CMSSW that needs to be debugged. I hope to get som insight by running initially 4-core jobs on arc2lumi and compare with the 8 core jobs on snowarc.

For some reason the SAM tests are worse today on arc2lumi and the reason for the failed token tests is not yet understood.

Best regards, Tomas
Hello Tomas,

I sent another batch of test jobs this morning and they all ran fine [1]. I don't have much context about the SAM tests, but let me know if you need anything. So as I understand the plan is to fix the SAM tests, move this entry to production and then later on increase the slot size to 8 cores. Is that correct?

Best regards,
Luís.

---

[1]

[lsimasde@vocms0811 lsimasde]$ condor_history 37132
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
37132.11 lsimasde 3/25 08:55 0+00:04:03 C 3/25 09:26 sleep.sh 60
37132.13 lsimasde 3/25 08:55 0+00:03:04 C 3/25 09:25 sleep.sh 60
37132.14 lsimasde 3/25 08:55 0+00:03:03 C 3/25 09:25 sleep.sh 60
37132.10 lsimasde 3/25 08:55 0+00:04:01 C 3/25 09:25 sleep.sh 60
37132.12 lsimasde 3/25 08:55 0+00:01:30 C 3/25 09:23 sleep.sh 60
37132.9 lsimasde 3/25 08:55 0+00:04:13 C 3/25 09:23 sleep.sh 60
37132.15 lsimasde 3/25 08:55 0+00:01:23 C 3/25 09:23 sleep.sh 60
37132.8 lsimasde 3/25 08:55 0+00:03:57 C 3/25 09:22 sleep.sh 60
37132.7 lsimasde 3/25 08:55 0+00:04:00 C 3/25 09:21 sleep.sh 60
37132.6 lsimasde 3/25 08:55 0+00:03:38 C 3/25 09:19 sleep.sh 60
37132.5 lsimasde 3/25 08:55 0+00:03:54 C 3/25 09:18 sleep.sh 60
37132.4 lsimasde 3/25 08:55 0+00:03:53 C 3/25 09:17 sleep.sh 60
37132.3 lsimasde 3/25 08:55 0+00:04:09 C 3/25 09:16 sleep.sh 60
37132.2 lsimasde 3/25 08:55 0+00:03:51 C 3/25 09:14 sleep.sh 60
37132.1 lsimasde 3/25 08:55 0+00:02:36 C 3/25 09:13 sleep.sh 60
37132.0 lsimasde 3/25 08:55 0+00:01:18 C 3/25 09:11 sleep.sh 60
[lsimasde@vocms0811 lsimasde]$ cat out/sleep.out.37132.0
I slept for 60 seconds on:
nid002450
Linux
nid002450 6.4.0-150600.23.73_15.0.14-cray_shasta_c #1 SMP Tue Oct 21
20:32:25 UTC 2025 (89d3c98) x86_64 x86_64 x86_64 GNU/Linux
Wed Mar 25 10:11:50 EET 2026
OSG_WN_TMP: /tmp
CONDOR_SCRATCH_DIR: /srv
HOME: /srv
PATH: /srv/.gwms.d/bin:/tmp/glide_aEOFql/main/condor/libexec:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD: /srv
Hello Luis!

Thanks for running the glidein tests!

Yes, you are right about the plan.

Best regards, Tomas
Hello Tomas,

Ok, just tell me whenever you're ready to move this entry to production. If you need anything else in the meantime just let me know.

Best regards,
Luís.
Hello Luis!

The test org.sam.CONDOR-JobSubmit-/cms-ce-token fails and the tests dependant on that test are not run. But I think we can enable arc2lumi in production while trying to fix that test.

Best regards, Tomas
Hi Tomas!

I'll enable it in production after the Easter holidays and will keep you updated. I discussed a bit about the SAM test with other colleagues and they gave me a lead that you might find worth investigating:

The CE maintains an allowlist of UUIDs that are allowed to access it. It might be that your CE is authorizing our factory, but not the SAM tests infrastructure, as they use different UUIDs. The UUID used by SAM tests is 08ca855e-d715-410e-a6ff-ad77306e1763.

Let me know if you need anything. I'll get back to you once I move the entry to production.

Cheers,
Luís.
Hello Luis!

Thanks!

The UUID used by SAM testsis not the problem as it is already in the configuration.

Best regards, Tomas
CMS #1002193 (id:1002193) CMS ETF_WN-xrootd-access and ETF_WN-analysis tests fail on some ARC sites
State: assigned  |  Priority: urgent  |  Opened: 2026-03-25 09:11 (10d ago)  |  Updated: 2026-03-25 09:11
Conversation (1 message)
Since the 24th of March there has been a problem with the CMS ETF_WN-xrootd-access and ETF_WN-analysis tests at T2_FI_HIP on our three independent ARC CE:s, but apprarently the same problem has appeared at T2_FR_IPHC at the same time also, so it seems not to be a site specific problem but a problem with the monitoring itself.
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_FI_HIP
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_FR_IPHC
Best regards, Tomas
CMS #1001938 (id:1001938) Mostly single cores used on eight core jobslots on Mahti HPC
State: assigned  |  Priority: less urgent  |  Opened: 2026-03-02 15:28 (33d ago)  |  Updated: 2026-03-24 16:45
Conversation (3 messages)
Hello!

The snowarc.hip. fi ARC CE has been submitting production pilots to the CSC Mahti HPC for a few weeks now. The pilots request 8 core job slots with 1.875 GB/core and there is 15 360 MB in total for each 8 core job slot. Currently on average there is about 20 jobs running with a total of 160 cores available. The problem is that the 8 cores are not fully utilized because typically only one core out of 8 is used. Sometimes there are cmsRun processes using 2 or 4 cores, but mostly there is only a single core used.

This example shows top for a node running for jobs that have requested 8 cores each, so in total 32 cores are expected to be used, but only five cores are actually used.
185914 rb_2006+ 20 0 22.4g 3.1g 704760 R 198.0 1.2 8:01.61 cmsRun
156116 rb_2006+ 20 0 21.3g 1.6g 655712 R 99.7 0.6 38:35.89 cmsRun
171883 rb_2006+ 20 0 21.3g 1.6g 655724 R 99.7 0.6 26:15.81 cmsRun
141607 rb_2006+ 20 0 21.3g 1.6g 654800 R 99.3 0.6 50:44.42 cmsRun

Could someone who has access to HTCondor try to see why the cores are not scheduled for a full load with the 8 cores by running something like this:
condor_q -better-analyze -machine CMSHTPC_T2_FI_HIP_snowarc

Best regards, Tomas
To study the scheduling issue actually one should also add as a parameter the job id of an idle job that should match to the given Execution Point. The EP name can be prefixed with 'slotXXX@' if wanted.
So the command could look like this
condor_q -better-analyze -machine slotXXX@CMSHTPC_T2_FI_HIP_snowarc idle_job_id_matching_Execution_Point
CMS #683468 (id:3603) APEL data not updated since end of May and June data missing
State: assigned  |  Priority: urgent  |  Opened: 2025-06-09 08:37 (299d ago)  |  Updated: 2025-06-09 08:37
Conversation (1 message)
Hello!
APEL data has not been updated since end of May 2025 and the June data is missing for at least T2_FI_HIP.
https://accounting.egi.eu/tier2/country/Finland/normcpu/VO/DATE/2025/1/2025/6/lhc/onlyinfrajobs/undefined/
This problem has happened previously: https://helpdesk.ggus.eu/#ticket/zoom/3062

Best regards, Tomas
CMS #682928 (id:3062) APEL March 2025 accounting data not up to date yet for FI_HIP_T2
State: assigned  |  Priority: urgent  |  Opened: 2025-04-03 09:28 (366d ago)  |  Updated: 2025-04-03 09:28
Conversation (1 message)
There is again a problem with the APEL accounting database. The APEL March 2025 accounting data not up to date yet for FI_HIP_T2. From SGAS it can be seen that there was about 8.5 MHS23 produced in March 2025, https://accounting.ndgf.org:6143/sgas/view/wlcg/tiersplit?startdate=2025-03&enddate=2025-03

APEL currently still as of today lists only 4.7 MHS06 produced at HIP, https://accounting.egi.eu/tier2/country/Finland/normcpu/VO/DATE/2025/3/2025/3/

Can you please look into this problem and correct it?

Best regards, Tomas Lindén
WLCG tickets (1)
WLCG #681646 (id:1747) Request to deploy IPv6 on CEs and WNs at WLCG sites (FI_HIP_T2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:02 (430d ago)  |  Updated: 2025-11-26 14:33
Conversation (18 messages)
GGUS ID: 164340
Last modifier: Andrea Sciaba
Date: 2023-11-29 06:44:17

Public Diary:
Any progress on this ticket?
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:36:26
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (FI_HIP_T2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_NDGF
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164340
Last modifier: Ville Salmela
Date: 2023-11-29 11:03:11

Public Diary:
FI_HIP_T2 internal note: IPv6 routing must be enabled between compute and storage for this to work.
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Andrea Sciaba
Date: 2023-11-29 13:51:43

Status: in progress
Responsible Unit: NGI_NDGF
Public Diary:
FI_HIP_T2 internal note: IPv6 routing must be enabled between compute and storage for this to work.
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Ville Salmela
Date: 2024-02-09 13:21:49

Public Diary:
Reminder sent
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Mattias Wadenstein
Date: 2024-05-31 12:32:12

Status: on hold
Responsible Unit: NGI_NDGF
Public Diary:
Assigning to the manager of the Finnish CMS Tier-2 compute cluster.
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Andrea Sciaba
Date: 2024-09-18 09:53:47

Public Diary:
Hi Tomas,
did you make any progress on IPv6?
Cheers,
Andrea
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Tomas Linden
Date: 2024-05-31 13:12:22

Status: in progress
Responsible Unit: NGI_NDGF
Public Diary:
We plan to install Alma Linux 9.4 on another cluster and to have IPv6 and the latest ARC version on that. We will try to meet the end of June deadline for this, but it is nor guaranteed that it can be met. The current ARC CE kale-cms needs to be totally reinstalled for IPv6 to make sense, so that will be done later.
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Tomas Linden
Date: 2024-09-19 06:53:56

Public Diary:
Hello Andrea!

Unfortunately not yet. We have problems with job submission to our new ARC 7 Release Candidate 1 CE both with X.509 proxies and with tokens even though standrard ARC testjobs run fine. The submission problems need to be fixed first and then the cluster needs to be scaled up to production size before we have time for IPv6.

Best regards, Tomas
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Andrea Sciaba
Date: 2025-01-24 16:06:16

Public Diary:
Hi Tomas,
just checking if there was any progress...
Andrea
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Tomas Linden
Date: 2025-01-27 14:14:18

Public Diary:
Hello Andrea!

Unfortunately the cluster work has progressed much more slowly than expected, but there has been some progress. We are working on the internal network configuration of the cluster and preparing for scaling up the numer of nodes. We have only a single node with a faster external link (10 Gb/s), so if we configure IPv6 on that one and route all traffic from the worker nodes InfiniBand network to that node and don't expose the worker nodes to the public network, then I guess that would solve the IPv6 requirement?

Best regards, Tomas
Internal Diary:
Escalated this ticket to NGI_NDGF
GGUS ID: 164340
Last modifier: Andrea Sciaba
Date: 2025-01-28 18:47:04

Public Diary:
Hi Tomas,
I'd say so, after all the real requirement is that WNs can open a connection with an external host via IPv6. Of course, this node should not risk becoming a bottleneck...
Let me know when it's done, so I'll check ETF and close the ticket.
Thanks,
Andrea
Internal Diary:
Escalated this ticket to NGI_NDGF
Hi Tomas,
do you have any news on the IPV6 deployment?
Thanks!
Andrea
Hello Andrea!

We are still scaling up the new cluster, but the IPv6 depolyment has not yet progressed unfortunately. There is still quite a lot of work remaining until we can get to the IPv6 deployment, but I will keep you informed about this.

Best regards, Tomas
Hi Tomas,
are there any news?
Cheers,
Andrea
Hello Andrea!

Thanks for the reminder. Our new cluster is at 72 % of the intended capacity, so there is still remaining work and especially with networking until we will get to the IPv6 deployment.

Best regards, Tomas
Hi Tomas, any recent progress by chance?
Thanks,
Andrea
Hello Andrea!
Unfortunately not really as there are many other tasks that have to be
done at our site. I have discussed the need for IPv6 with the
IT-department of the University of Helsinki, but not more than that so
far.
Best regards, Tomas
On Wed, 26 Nov 2025, helpdesk@ggus.org wrote:
> GGUS Helpdesk Notification
> Ticket [1] #1747 "Request to deploy IPv6 on CEs and WNs at WLCG sites (FI_HIP_T2)"
> was updated by Andrea Sciaba on 2025-11-26 13:52 (UTC).
>
> ___
>
> Hi Tomas, any recent progress by chance?
> Thanks,
> Andrea
>
> ___
>
> Ticket is assigned to NGIs ? NGI_NDGF
>
> Affected VO is none.
>
> [2] https://helpdesk.ggus.eu/#ticket/zoom/1747
> You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. |[3] Manage your notification settings| EGI/WLCG
>
> [1] https://helpdesk.ggus.eu/#ticket/zoom/1747
> [2] https://helpdesk.ggus.eu/#ticket/zoom/1747
> [3] https://helpdesk.ggus.eu/#profile/notifications
___________________________________________
Dr Tomas Lindén, Tomas.Linden@Helsinki.FI
Helsinki Institute of Physics (HIP)
P.O. Box 64 (Gustaf Hällströmin katu 2)
FI-00014 UNIVERSITY OF HELSINKI, Finland
deskphone: +358-2-941 505 63
http://www.helsinki.fi/~tlinden/eindex.html
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM81%99%97%100%94%94%100%97%100%100%100%100%99%100%100%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%99%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (4)

WLCG tickets (4)
WLCG #1002262 (id:1002262) GRIF: Transfer failures as src and dst, both deletion failuers
State: assigned  |  Priority: less urgent  |  Opened: 2026-04-04 06:14 (0d ago)  |  Updated: 2026-04-04 06:14
Conversation (1 message)
Dear experts,
In the last 5 hours, Site GRIF has lots of transfer failures as both source and destination site, also lots of deletion errors
Main error:
Transfer: Result curl error (60): SSL peer certificate or SSH remote key was not OK after 1 attempts ~40k
Deletion: The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink HTTP 403 : Permission refused ~17k

Detail:
2026/04/04 06:10:42

Data Consolidation

data15_13TeV

DAOD_BPHY14.49525875._000246.pool.root.1

Result curl error (60): SSL peer certificate or SSH remote key was not OK after 1 attempts

transfer-failed

INFN-NAPOLI-ATLAS_DATADISK

GRIF_DATADISK

0 s

5.03 MB

https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/742b7b58-2fec-11f1-a96d-fa163e5ce271

davs://t2-dcache-01.na.infn.it:443/dpm/na.infn.it/home/atlas/atlasdatadisk/rucio/data15_13TeV/7e/ee/DAOD_BPHY14.49525875._000246.pool.root.1?copy_mode=pull

davs://eos.grif.fr:11000/eos/grif/atlas/atlasdatadisk/rucio/data15_13TeV/7e/ee/DAOD_BPHY14.49525875._000246.pool.root.1

davs

5031922

1149966683-1775283042000

1775283075

1775282713000

data15_13TeV.00267638.physics_Main.deriv.DAOD_BPHY14.r13313_p4910_p7234_tid49525875_00

data15_13TeV

DAOD_BPHY14

UNKNOWN

true

UNKNOWN

458380115

1775283051000
Link:
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity=T0+Recall&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&var-enr_filters=data.src_experiment_site%7C%21%3D%7CCERN-PROD&var-enr_filters=data.dst_experiment_site%7C%21%3D%7CCERN-PROD&var-enr_filters=data.dst_experiment_site%7C%21%3D%7CNET2&var-enr_filters=data.src_experiment_site%7C%21%3D%7CUAM-LCG2&var-enr_filters=data.purged_reason%7C%3D%7CResult+curl+error+%2860%29%3A+SSL+peer+certificate+or+SSH+remote+key+was+not+OK+after+1+attempts&from=1775239842618&to=1775283042618
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity=T0+Recall&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&var-enr_filters=data.src_experiment_site%7C%21%3D%7CCERN-PROD&var-enr_filters=data.dst_experiment_site%7C%21%3D%7CCERN-PROD&var-enr_filters=data.dst_experiment_site%7C%21%3D%7CNET2&var-enr_filt
WLCG #1002059 (id:1002059) changes proposed for ALICE job properties (GRIF)
State: assigned  |  Priority: less urgent  |  Opened: 2026-03-11 21:33 (23d ago)  |  Updated: 2026-03-31 16:16
Conversation (3 messages)
Dear colleagues,
to help improve job efficiencies, ALICE would be interested in
changing the core count and/or lifetime of ALICE production jobs.

Please let us know which of these changes can be tried at your site:

1. Increasing the core count from 8 to 16?
2. Increasing the lifetime to 48 hours?
3. Increasing the lifetime to 72 hours?

If a given change is found to cause issues, it will be reverted.

If you prefer ALICE jobs to keep running as they are, that also works.

Thanks for your consideration!
Hello,

After discussion within GRIF sites, I can say:
- the lifetime is (supposed to be?) already 72hours
- we are not against 16cores jobs but we need to study how to implement it in HTcondor-CE and ARC-CE and optimize defrag parameters; maybe you have information from sites who
already have a mix of 1-8 and 16core jobs?
do you know what would be the ratio of number of jobs 16-cores/8-cores ?

Regards,
Sophie for the GRIF team
Bonjour Sophie,for now the next jobs at your site will only ask for a TTL of 3 days: let's see...
WLCG #1002190 (id:1002190) LHCb Disk resources at GRIF
State: assigned  |  Priority: less urgent  |  Opened: 2026-03-24 15:00 (11d ago)  |  Updated: 2026-03-30 08:07
Conversation (2 messages)
Dear colleagues,

Now that the 2026 data taking has started, we are reviewing the state of the storage
pledges of the LHCb T1 and T2 sites.

Earlier this year, we were informed that (with a few exceptions) the sites would have
the pledged disk and tape resources available to the experiment, and we hope you
can confirm that.

GRIF pledges 1513 TB of disk space for 2026. However, the Storage Resource Record (SRR)
advertises only 1272 TB [1]. Do you know what explains that discrepancy?

Thanks in advance for your efforts to make these crucial resources available
to us!

best regards, Jan van Eldik / LHCb Compute project lead

[1]
+---------+--------------+-----------+-----------+----------+
| Site | Share | Size | Used | Fraction |
+---------+--------------+-----------+-----------+----------+
| GRIF | LHCb-Disk | 1272.0 | 1210.8 | 95.2% |
+---------+--------------+-----------+-----------+----------+
Hello, do you have some feedback for us?
Thanks in advance, Jan
WLCG #1002121 (id:1002121) Upgrade your HTCondorCE endpoints to 24.0.x series (GRIF)
State: assigned  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 15:19
Conversation (3 messages)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
Hello, GRIF-LLR has the condor version :
$CondorVersion: 23.10.29 2025-09-22 BuildID: 834959 PackageID: 23.10.29-1 GitSHA: ded6225d $
$CondorPlatform: x86_64_AlmaLinux9 $
Hi Anne,
thanks for the reply. Please let me know about the upgrading plans.

Best regards,
Alessandro
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%90%93%100%100%100%95%100%100%100%100%100%
HammerCloud100%99%100%100%100%99%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

CMS tickets (1)
CMS #682717 (id:2850) Request for XRootD Upgrade and Network Packet Labeling Configuration (IN2P3-IRES)
State: in progress  |  Priority: less urgent  |  Opened: 2025-03-18 14:46 (382d ago)  |  Updated: 2025-10-20 13:19
Conversation (15 messages)
Dear Site
Administrators,



CMS will resume data
taking next month and expand the use of tokens. Significant improvements and
bug fixes for tokens have been made in XRootD. We kindly request all CMS sites
using native XRootD to upgrade their XRootD services to the latest version (5.7.3),
including site redirectors and storage services.



We would also like to
take this opportunity to encourage sites to enable network packet labelling by
adding the following four configuration lines to both XRootD and redirectors.



xrootd.pmark ffdest
eu.scitags.org:10514

xrootd.pmark domain
any

xrootd.pmark defsfile
curl https://scitags.docs.cern.ch/api.json

xrootd.pmark map2exp
path /<path-to-store>/store cms



If your site supports
multiple VOs on the same XRootD service or requires additional details, please
refer to the SciTag Network Packet Labeling Twiki page: https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesXrootdScitagPacketLabeling.

Our target date for
completing the upgrade and configuration update is April 5th, in preparation
for the LHC commissioning for 2025.



Thank you for your
cooperation. Please let us know if you have any questions or concerns.



Best regards,

Jakrapop and Noy

CMS Site Support
************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
Ticket Link: https://helpdesk.ggus.eu/#ticket/zoom/2842
Hi,
We have a mix of xrootd (for CMS AAA) and dCache. What should we do?
Hi Jerome,
For your site, you only need to upgrade the XRootD redirector to version 5.7.3. We will make a package labelling campaign for EOS and dCache later.
Best,
Jakrapop
Please both upgrade and add the network packet labeling to the re-director xrootd config.
Thanks,
- Stephan
Dears,
We will upgrade tomorrow our dCache / xrootd (a downtime has been declared). As EPEL provides only 5.8.1-1.el8 and the UMD-5 provides 5.7.1, we will upgrade to version 5.8.1.
Best,
Jerome
The XRootD RPMs has been upgraded to v5.8.1 and configuration lines have been added to the xrootd configuration. CMS is the only user for the xrootd instance (as we are using it only for joining the CMS federation). Should we add anything to the dcache configuration, which has also a XRooTD Door ?
Hello Jerome, From SAM test, your XRootd version still 5.6.7 [1]. Could you please take a look?
Thank you,
Noy
[1]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-3version&var-dst_hostname=sbgse1.in2p3.fr&var-timestamp=1748021082000
Hello Jerome, The SAM test shows your site still use 5.6.7 XRootD version. This consider to be the outdate. We recommend you update to 5.7.3 or newer. Could you please provide an upgrade plan and put new configuration package labeling?
Cheers,
Noy
[1]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-3version&var-dst_hostname=sbgse1.in2p3.fr&var-timestamp=1755163650000
The current version I recommend is 5.8.3 or newer.
Any update, Thank you --- Noy
Hello Jerome. All XRootD endpoints on your site were already upgraded to 5.8.3. Could you please confirm the packet labeling configuration on both endpoints?
Cheers,
Noy
GGUS infrastructure had a problem between 11:00-12:00 to send
notifications today, I'm adding this note to trigger notification again
and want to appologize for any inconvenience.

Kind regards,
Pavel Weber
Hi,
On sbgse1 (xrootd redirector), the following package is installed: xrootd-server-5.8.3-1.el9.x86_64
On sbgdcache, xrootd is provided through dCache (v9.2.37).

Kind regards,
Jerome
Hello Jerome. Thank you for your confirmation. According to this document [1], could you please verify the following configurations are implemented to your XRootD server?
xrootd.pmark ffdest eu.scitags.org:10514

xrootd.pmark domain any

xrootd.pmark defsfile curl https://scitags.docs.cern.ch/api.json

xrootd.pmark map2exp path /<path>/store cms

xrootd.pmark map2act cms default default

Thank you,
Noy
[1]
https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesXrootdScitagPacketLabeling
The dCache configuration is missing on:
https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesXrootdScitagPacketLabeling
WLCG tickets (1)
WLCG #1002204 (id:1002204) Space used inconsistency in IPHC SE for Belle II
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-26 16:35 (9d ago)  |  Updated: 2026-04-03 07:29
Conversation (4 messages)
Hi,

We have some inconsistency between the space used by Belle II reported in https://scigne.fr/resources/srr/sbgdcache_srr.json and what our data management system expects. We would like to run a consistency check to understand the situation. Can you please provide a dump of dCache dump of all the files under https://sbgdcache.in2p3.fr/belle ?

Regards,
Cedric
Any news ?
Hi,

Sorry for being late.
I used: ls -lR /pnfs/sbgdcache.in2p3.fr/belle/, there might be better way, but the output is attached.
Cheers,
Yannick
Hi,

Merci.
I'm running the consistency check.

Cheers,
Cedric
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM96%100%100%100%100%100%100%100%96%100%100%100%100%100%97%100%
HammerCloud99%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

WLCG tickets (2)
WLCG #1002117 (id:1002117) Upgrade your HTCondorCE endpoints to 24.0.x series (BUDAPEST)
State: in progress  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 15:04
Conversation (1 message)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
WLCG #1001941 (id:1001941) changes proposed for ALICE job properties (BUDAPEST)
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-02 16:17 (33d ago)  |  Updated: 2026-03-06 15:15
Conversation (7 messages)
Dear colleagues,
to help improve job efficiencies, ALICE would be interested in
changing the core count and/or lifetime of ALICE production jobs.

Please let us know which of these changes can be tried at your site:

1. Increasing the core count from 8 to 16?
2. Increasing the lifetime to 48 hours?
3. Increasing the lifetime to 72 hours?

If a given change is found to cause issues, it will be reverted.

If you prefer ALICE jobs to keep running as they are, that also works.

Thanks for your consideration!
Dear Maarten,

> 1. Increasing the core count from 8 to 16?

We can try this and check if we need any configuration changes. Our WNs have two 24 vcore CPUs that are set up at the moment as one slot with auto partitioning.

> 2. Increasing the lifetime to 48 hours?

Our limit is 48 hours (wall clock) right now.

> 3. Increasing the lifetime to 72 hours?

I don't see any problems with going up to 72 hours if needed.

Cheers:
Csaba
Hi Csaba,would you prefer these changes to be tried out in several steps?
Hi Maarten,

You can start sending 16core, 72hour jobs right away, they will start gradually after the 8core ones finish.

I found that our wall clock limit is already set to 72 hours, so the jobs (at least for CMS) were limited by themselves to 48h.
Hi again,as of 12:00 today, new jobs are submitted with those parameters: let's see...
Hi Maarten, it looks ok from our side; all Alice jobs use 16 cores and the first ones have been running for over 48 hours.
Hi again,let's check again after the weekend and then hopefully close the ticket, thanks!
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%10%0%57%91%40%47%33%53%33%4%6%17%2%3%
HammerCloud99%100%100%100%46%70%100%100%99%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (3)

WLCG tickets (3)
WLCG #681756 (id:1857) Request to implement BGP tagging of LHCONE prefixes. (INDIACMS-TIFR)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:18 (430d ago)  |  Updated: 2025-02-03 14:09
Conversation (3 messages)
GGUS ID: 168355
Last modifier: Julia Andreeva
Date: 2025-01-28 14:28:00

Public Diary:
Any progress on this ticket?
Internal Diary:
Escalated this ticket to ROC_Asia/Pacific
GGUS ID: 168355
Last modifier: Julia Andreeva
Date: 2024-09-23 15:43:22
Subject: Request to implement BGP tagging of LHCONE prefixes. (INDIACMS-TIFR)
Ticket Type: USER
CC: ;edoardo.martelli@cern.ch
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: Network problem
Description:
This ticket concerns all the sites connected to LHCONE.

In agreement with the WLCG Management Board, it has been decided to
implement the tagging of the IP prefixes announced to LHCONE.
The task consists of tagging the IP prefixes that your site announces to
LHCONE with all the BGP communities that identifies the experiments and
collaborations supported by your site. The initial goal is to document
the use of the network. In the longer term the tags may be used to
reduce the exposure on the LHCONE connection, by filtering unnecessary
prefixes.

You will find the values of the BGP communities to use and other
information in these pages:
- https://twiki.cern.ch/twiki/bin/view/LHCONE/MultiOneBGPcommunities
-
https://indico.cern.ch/event/1356138/contributions/6123461/attachments/2925447/5147273/WLCG-20240911-GDB-MultiONE-implementation.pdf

If you need any support on this task, please don't hesitate to ask your
NREN or LHCONE provider.
Or just reply to this ticket asking your questions; experts will guide
you in the implementation.

Please take this opportunity also to review the network information
related to your site in CRIC :
https://wlcg-cric.cern.ch/core/networkroute/list/

We ask you to perform the required action by the end of March 2025
GGUS ID: 168355
Last modifier: Julia Andreeva
Date: 2025-01-28 14:28:19

Public Diary:
Any progress on this ticket?
Internal Diary:
Escalated this ticket to ROC_Asia/Pacific
WLCG #681740 (id:1841) No accounting data for June in the EGI portal, WLCG accounting validation for the site has not been performed (INDIACMS-TIFR)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:15 (430d ago)  |  Updated: 2025-02-03 11:17
Conversation (1 message)
GGUS ID: 167729
Last modifier: Julia Andreeva
Date: 2024-07-31 08:51:49
Subject: No accounting data for June in the EGI portal, WLCG accounting validation for the site has not been performed (INDIACMS-TIFR)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: Other
Description:
You are getting this ticket, because CPU consumption for June in the EGI portal for your site is 0, moreover, you did not perform validation for June using CRIC UI, when you can inject numbers from your local accounting. As a result, your CPU metrics in the WLCG June accounting report will be 0. Please, make sure that APEL accounting data flow is fixed for your site for the next month. If it would take longer time to investigate and fix the problem, please, use CRIC monthly accounting validation procedure.
WLCG #681737 (id:1838) Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (INDIACMS-TIFR)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:15 (430d ago)  |  Updated: 2025-02-03 11:16
Conversation (10 messages)
GGUS ID: 166649
Last modifier: Julia Andreeva
Date: 2024-04-29 13:50:11
Subject: Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (INDIACMS-TIFR)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: Other
Description:
CPU accounting data is missing for your site in the EGI accounting portal for March, validation has not been performed either. Please, make sure that your site is properly reporting to APEL. While solving the problem, please, provide proper accounting metrics during monthly validation.
GGUS ID: 166649
Last modifier: Puneet Kumar Patel
Date: 2024-07-23 10:10:58
Changed CC to puneet.kumar.patel@cern.ch

Public Diary:
Hi Julia,
Thank you for the ticket.
It has been forwarded to me recently, unfortunately, I did not receive direct alert in this regard. I will work on this and update here as early as possible.

kind regards,
Puneet

Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166649
Last modifier: Puneet Kumar Patel
Date: 2024-08-07 09:57:22

Public Diary:
Hi Julia,
Apelparser is not able to parse the blah records nor push it to the database, it is throwing following two errors [1]. I'm not sure if this is related to updated package of database and apelparser in the Almalinux9.
Kindly suggest with latest documentation (if any).

[1]
parser - WARNING - Failed to parse file. Is BlahParser correct?
&
ERROR - Error loading records: (1305, 'PROCEDURE apelclient.ReplaceProcessedFile does not exist')
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166649
Last modifier: Puneet Kumar Patel
Date: 2024-08-07 10:17:03

Public Diary:
I think schema of the database is old or not correct. I'm just trying to fix this one first.
thanks,
Puneet
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166649
Last modifier: Puneet Kumar Patel
Date: 2024-08-08 04:26:48

Public Diary:
Hi,
Client database schema has been fixed now, job records are getting stored in local database.
Currently working-on to send these records to the accounting portal.
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166649
Last modifier: Puneet Kumar Patel
Date: 2024-08-13 11:06:47

Public Diary:
Hi Julia,
ssmsed is not able to find any record [1]. If I'm correct, JobRecord table should have some records in the database right ?
Also, there is "Failed to parse file" warning [2] for blah records and 'userFQAN' is missing in parsed blah file [3].
Any suggestion would be very helpful.

thank,
Puneet

[1]
INFO:ssmsend:Starting sending SSM version 3.4.0.
2024-08-13 16:30:03,663 - ssmsend - INFO - Starting sending SSM version 3.4.0.
INFO:ssmsend:Setting up SSM with protocol: AMS
2024-08-13 16:30:03,663 - ssmsend - INFO - Setting up SSM with protocol: AMS
INFO:ssmsend:No AMS token provided, using cert/key pair instead.
2024-08-13 16:30:03,663 - ssmsend - INFO - No AMS token provided, using cert/key pair instead.
INFO:ssmsend:No server certificate supplied. Will not encrypt messages.
2024-08-13 16:30:03,663 - ssmsend - INFO - No server certificate supplied. Will not encrypt messages.
INFO:ssmsend:Messages will be signed using /C=TW/O=AP/OU=GRID/CN=condor-ce01.indiacms.res.in
2024-08-13 16:30:03,664 - ssmsend - INFO - Messages will be signed using /C=TW/O=AP/OU=GRID/CN=condor-ce01.indiacms.res.in
INFO:ssmsend:No messages found to send.
2024-08-13 16:30:06,429 - ssmsend - INFO - No messages found to send.
INFO:ssmsend:SSM has shut down.
2024-08-13 16:30:06,429 - ssmsend - INFO - SSM has shut down.

[2]
parser log:
2024-08-13 15:04:20,346 - parser - INFO - Setting up parser for blah files
2024-08-13 15:04:20,402 - apel.parsers.parser - INFO - Site: INDIACMS-TIFR; batch system: condor-ce01.indiacms.res.in
2024-08-13 15:04:20,402 - parser - INFO - Scanning directory: /var/lib/condor/accounting/
2024-08-13 15:04:20,408 - parser - INFO - Files skipped: rerun at DEBUG log level to see details.
2024-08-13 15:04:20,946 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240813-1504-condor-ce01
2024-08-13 15:04:20,947 - parser - WARNING - Failed to parse file. Is BlahParser correct?
2024-08-13 15:04:21,015 - parser - INFO - Finished parsing blah log files.
2024-08-13 15:04:21,018 - parser - INFO - ========================================
2024-08-13 15:04:21,018 - parser - INFO - Setting up parser for HTCondor files
2024-08-13 15:04:21,068 - apel.parsers.parser - INFO - Site: INDIACMS-TIFR; batch system: condor-ce01.indiacms.res.in
2024-08-13 15:04:21,068 - apel.parsers.parser - INFO - Parser will retrieve per-processor accounting information.
2024-08-13 15:04:21,069 - apel.parsers.htcondor - INFO - Site: INDIACMS-TIFR; batch system: condor-ce01.indiacms.res.in
2024-08-13 15:04:21,069 - parser - INFO - Scanning directory: /var/lib/condor/accounting/
2024-08-13 15:04:21,070 - parser - INFO - Files skipped: rerun at DEBUG log level to see details.
2024-08-13 15:04:21,567 - parser - INFO - Parsing file: /var/lib/condor/accounting/batch-20240813-1504-condor-ce01
2024-08-13 15:04:21,615 - parser - INFO - Parsed 112 lines
2024-08-13 15:04:21,615 - parser - INFO - Ignored 0 lines (incomplete jobs)
2024-08-13 15:04:21,615 - parser - INFO - Failed to parse 0 lines
2024-08-13 15:04:21,689 - parser - INFO - Finished parsing HTCondor log files.

[3]
blah record:
[root@condor-ce01 apel]# tail /var/lib/condor/accounting/blah-20240813-1504-condor-ce01
"timestamp=2024-08-13 08:55:15" "userDN=/DC=ch/DC=cern/OU=computers/CN=cmspilot02\/vocms0804.cern.ch" "ceID=condor-ce01.indiacms.res.in:9619/condor-ce01.indiacms.res.in-condor" "jobID=41644.0_condor-ce01.indiacms.res.in" "lrmsID=condor-ce01.indiacms.res.in#47199.0#1723538093" "localUser=cmsjob"
"timestamp=2024-08-13 08:57:00" "userDN=/DC=ch/DC=cern/OU=computers/CN=cmspilot02\/vocms0804.cern.ch" "ceID=condor-ce01.indiacms.res.in:9619/condor-ce01.indiacms.res.in-condor" "jobID=41645.0_condor-ce01.indiacms.res.in" "lrmsID=condor-ce01.indiacms.res.in#47200.0#1723538216" "localUser=cmsjob"


Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166649
Last modifier: Julia Andreeva
Date: 2024-08-13 12:24:10

Public Diary:
Hi Puneet,
I've added Adrian who is APEL expert and might be able to help you.
Cheers
Julia
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166649
Last modifier: Adrian Coveney
Date: 2024-08-20 15:50:11

Public Diary:
Involve person(s) has been changed to apel-admins@stfc.ac.uk.
Hi. I've been on leave, so have only just seen this. apel-admins is the better email address to add as it goes to more than just me. I'm wrapping up for the day but will try and have a look later in the week.
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166649
Last modifier: Puneet Kumar Patel
Date: 2024-09-12 09:33:22

Public Diary:
Hi APEL team,
Any suggestion would be helpful.
thanks,
Puneet
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166649
Last modifier: Puneet Kumar Patel
Date: 2024-09-25 04:27:47

Public Diary:
Hi,
Current updates are:
1. Parser is running however, it looks like blah records are not correct. By seeing parser log [1] and comparing old blah record file [2] with the recent one [3] it can be observed that extra characters (shown in bold [3]) are included with the recent parsed files for blah.

2. apelclient is throwing 'Out of range value for column 'MemoryVirtual' error [4]. I am not sure which table I should look for or alter.
[1]
2024-09-24 16:06:38,585 - parser - ERROR - 'userFQAN' raised 99 times
2024-09-24 16:06:38,585 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240120-condor-ce01
2024-09-24 16:06:38,624 - parser - INFO - Parsed 59 lines
2024-09-24 16:06:38,624 - parser - INFO - Ignored 0 lines (incomplete jobs)
2024-09-24 16:06:38,624 - parser - INFO - Failed to parse 333 lines
2024-09-24 16:06:38,624 - parser - ERROR - 'userFQAN' raised 333 times
2024-09-24 16:06:38,625 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240121-condor-ce01
2024-09-24 16:06:38,628 - parser - WARNING - Failed to parse file. Is BlahParser correct?
2024-09-24 16:06:38,628 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240122-condor-ce01
2024-09-24 16:06:38,631 - parser - WARNING - Failed to parse file. Is BlahParser correct?
2024-09-24 16:06:38,632 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240123-condor-ce01
2024-09-24 16:06:38,635 - parser - WARNING - Failed to parse file. Is BlahParser correct?
2024-09-24 16:06:38,636 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240124-condor-ce01
2024-09-24 16:06:38,648 - parser - WARNING - Failed to parse file. Is BlahParser correct?
2024-09-24 16:06:38,648 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240125-condor-ce01

[2]
"timestamp=2024-01-31 06:13:48" "userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba" "ceID=condor-ce01.indiacms.res.in:9619/condor-ce01.indiacms.res.in-condor" "jobID=500422.0_condor-ce01.indiacms.res.in" "lrmsID=949382_condor-ce01.indiacms.res.in" "localUser=samjob"

[3]
"timestamp=2024-09-24 09:54:58" "userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba" "ceID=condor-ce01.indiacms.res.in:9619/condor-ce01.indiacms.res.in-condor" "jobID=50751.0_condor-ce01.indiacms.res.in" "lrmsID=condor-ce01.indiacms.res.in#56219.0#1727171617" "localUser=samjob"

[4]
2024-09-24 16:06:11,050 - parser - INFO - ========================================
2024-09-24 16:06:11,050 - parser - INFO - Starting apel parser version 2.1.0
2024-09-24 16:06:11,056 - apel.db.backends.mysql - INFO - Connected to localhost:3306
2024-09-24 16:06:11,056 - apel.db.backends.mysql - INFO - Database: apelclient; username: apel
2024-09-24 16:06:11,056 - parser - INFO - Connection to DB established
2024-09-24 16:06:11,056 - parser - INFO - ========================================
2024-09-24 16:06:11,056 - parser - INFO - Setting up parser for blah files
2024-09-24 16:06:11,153 - apel.parsers.parser - INFO - Site: INDIACMS-TIFR; batch system: condor-ce01.indiacms.res.in:9618/condor-ce01.indiacms.res.in-condor
2024-09-24 16:06:11,153 - parser - INFO - Scanning directory: /var/lib/condor/accounting/
2024-09-24 16:06:11,166 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240101-0033-condor-ce01
2024-09-24 16:06:11,183 - parser - INFO - Parsed 24 lines
2024-09-24 16:06:11,183 - parser - INFO - Ignored 0 lines (incomplete jobs)
2024-09-24 16:06:11,183 - parser - INFO - Failed to parse 0 lines
2024-09-24 16:06:11,183 - parser - INFO - Parsing file: /var/lib/condor/accounting/blah-20240101-0133-condor-ce01
2024-09-24 16:06:11,189 - parser - INFO - Parsed 9 lines
2024-09-24 16:06:11,189 - parser - INFO - Ignored 0 lines (incomplete jobs)
2024-09-24 16:06:11,189 - parser - INFO - Failed to parse 0 lines

Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%99%100%57%82%100%100%92%100%93%100%100%100%100%
HammerCloud100%99%100%100%99%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (4)

CMS tickets (3)
CMS #1001718 (id:1001718) Deletions failing at Bari
State: in progress  |  Priority: less urgent  |  Opened: 2026-02-03 16:46 (60d ago)  |  Updated: 2026-03-27 11:39
Conversation (8 messages)
Dear colleagues,
It seems that deletions are failing at Bari for a while now.

The error is the following
The requested service is not available at the moment.
Details: An unknown exception occurred.
Details: DavPosix::unlink Could not connect to server

an example file would be

davs://webdav.recas.ba.infn.it:8443/cms/store/mc/Run3Summer22EEMiniAODv4/NMSSM_XtoYHto2W2Gto2L2Nu2G_MX-3000_MY-500_TuneCP5_13p6TeV_madgraph-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_postEE_v6-v2/2560000/8a4e0a71-a0d0-42a3-a343-2cd653e9bbf8.root

This has resulted in the site getting full, since our dynamic space cannot be released.
Could you please have a look and let us know, if you see any errors on your side?

Best,
Panos
Hello,
Did you maybe had a change to have a look into this?

Thank you.

Best,
Panos
Hello,
That's a friendly-ping. The site is unusable at this state.

Best,
Panos
Hi

1. Sam test are running smooth.
2. in the log I found DELETE request correctly handled
[2001:760:4227:0:0:0:0:121] 8443 "CN=Alan Malta Rodrigues,CN=718748,CN=amaltaro,OU=Users,OU=Organic Units,DC=cern,DC=ch" 2026-02-16T11:51:28.605Z "2873ed76-6bcd-4b28-9eba-856187a61bf4" "DELETE /cms/store/unmerged/RunIII2024Summer24MiniAODv6/GluGluToWW-OS-TT_TuneCP5_13p6TeV_madgraph-pythia8/MINIAODSIM/150X_mcRun3_2024_realistic_v2-v2/2820001/ab329641-553e-458b-9dba-2162b1fd7617.root HTTP/1.1" 204 0 9

ls: cannot access '/lustre/cms/store/unmerged/RunIII2024Summer24MiniAODv6/GluGluToWW-OS-TT_TuneCP5_13p6TeV_madgraph-pythia8/MINIAODSIM/150X_mcRun3_2024_realistic_v2-v2/2820001/ab329641-553e-458b-9dba-2162b1fd7617.root': No such file or directory

[root@webdav-3-10-18-b ~]# ll /lustre/cms/store/unmerged/RunIII2024Summer24MiniAODv6/GluGluToWW-OS-TT_TuneCP5_13p6TeV_madgraph-pythia8/MINIAODSIM/150X_mcRun3_2024_realistic_v2-v2/2820001/
total 13888800
-rw-rw-r--+ 1 storm storm 97563673 Feb 16 12:50 01d1f93e-524b-4468-b1a1-216f2f86cb0a.root
-rw-rw-r--+ 1 storm storm 96257333 Feb 16 12:23 03b533ce-5878-462d-a029-b3ab390d63dc.root

Is the issue still there ?

Ale
Hi Ale,
Thanks for having a look in this. The deletions you see seem to be coming from Alan's private certificate (which is probably used in the CMS WM system). Rucio should be using either tokens or this DN: DC=ch, DC=cern, OU=Organic Units, OU=Users, CN=cmsrucio, CN=779320, CN=Robot: CMS Rucio Data Transfer

Rucio is still unable to delete any file at the site.
I attach a dump of deletion-failed events of the past 15 minutes, I hope these point to some files that cannot be deleted and helps debugging further.

Cheers,
Panos
Hi Ale, any news on this? I see that on the 25th of February deletions did eventually work. Did you change anything?
Hi. I haven't made any change.

let us know how it is going now

Ale
Anyway I haven't received the last notification about your last messages
alessandro.italiano@ba.infn.it for a smooth interaction
CMS #1001022 (id:1001022) Request for GPU resource verification – HEPSCORE benchmarking tests
State: waiting for submitter's reply  |  Priority: less urgent  |  Opened: 2025-10-31 16:34 (155d ago)  |  Updated: 2026-01-08 10:53
Conversation (4 messages)
Dear Site Admin,

I’m contacting you because I’m trying to run some GPU benchmarking tests using HEPSCORE, but the pilot jobs submitted to your site appear to be idle.

Bari used to provide GPU resources, so I’d like to check whether GPU pilots are still accepted and if there have been any configuration changes on your side.

Below are some example jobs currently affected:

[mmascher@vocms0206 ~]$ entry_q CMSHTPC_T2_IT_Bari_recas_ce03_gpu -all -af gridjobid
condor ce-03.recas.ba.infn.it ce-03.recas.ba.infn.it:9619 632719.0
condor ce-03.recas.ba.infn.it ce-03.recas.ba.infn.it:9619 633037.0

Could you please:

Confirm whether GPU resources are still available for CMS pilot jobs;

Check if there are any local issues preventing pilot job execution.

Thank you for your help!

Best regards,
Marco Mascheroni
CMS Submission Infrastructure Team
i job in coda sono stati "editati" in modo che potessero essere eseguiti sui nodi con le GPU. Molti sono in RUN o già terminati.
Abbiamo anche ripristinato la jobRoute per i job con la GPU di CMS
[root@wn-gpu-7-9-28 ~]# nvidia-smi
Fri Dec 19 10:03:06 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:21:00.0 Off | 0 |
| N/A 28C P0 36W / 250W | 657MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:81:00.0 Off | 0 |
| N/A 24C P0 35W / 250W | 763MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3790263 C cmsRun 648MiB |
| 1 N/A N/A 3800333 C cmsRun 630MiB |
+-----------------------------------------------------------------------------------------+
[root@wn-gpu-7-9-28 ~]#
Notavo però che i job nonostante siano associati ad una GPU non la usano [Gpu-util a 0%]
CMS #683073 (id:3207) CMS Frontier activity from T2_IT_Bari
State: waiting for submitter's reply  |  Priority: very urgent  |  Opened: 2025-04-17 12:30 (352d ago)  |  Updated: 2025-12-19 09:25
Conversation (5 messages)
Hello, Although your CMS Frontier squids are working, there has been substantial fail-over to the Backup Proxies at CERN.
For example:

wn-7-6-26.recas.ba.infn.it
3,388
3,388
828.59 MB
17 Apr 2025 - 04:11

wn-3-5-17.recas.ba.infn.it
3,111
3,111
432.10 MB
17 Apr 2025 - 12:38

wn-3-4-11.recas.ba.infn.it
3,066
3,066
324.21 MB
17 Apr 2025 - 04:08

wn-8-9-9.recas.ba.infn.it
2,654
2,654
442.31 MB
17 Apr 2025 - 04:09

wn-7-6-6.recas.ba.infn.it

It looks like the reason is that the Bari squids are running using just one worker so that they are often driven into saturation (60 CPU seconds).
For example, see attachment. The fix is to run multiple workers for each squid instance. You might want 2 or 3 workers per squid, not more.

InstallSquid < Frontier < TWiki

In addition, the version of squid you are using is extremely old and really needs to be updated.
The current release is frontier-squid-5.10-1.1

Please update your squids and use multiple workers, if possible.

Best Regards,
Barry
Hello, You continue to fail-over to CERN because your two squids are saturated.
You need multiple workers and an updated squid version.

Best Regards,
Barry
Hello, One of your squids is down (ss-03.recas.ba.infn.it) since yesterday and there is massive fail-over to the CERN Backup proxies.
For example today:

wn-4-3-13.recas.ba.infn.it
1,435,251
1,435,251
1.92 GB
27 Aug 2025 - 15:59

wn-3-4-15.recas.ba.infn.it
1,267,478
1,267,478
1.79 GB
27 Aug 2025 - 15:58

wn-4-3-7.recas.ba.infn.it
1,265,824
1,265,824
1.65 GB
27 Aug 2025 - 15:59

wn-3-5-3.recas.ba.infn.it
1,263,017
1,263,017
1.62 GB
27 Aug 2025 - 15:59

wn-3-4-17.recas.ba.infn.it
1,197,467
1,197,467
1.70 GB
27 Aug 2025 - 16:01

The total so far today is 90 million hits for 121 GB.Please address this ticket.

Best Regards,
Barry
We are changing the configuration very soon, there are new squid servers ready for production.
New configuration committed.
WLCG tickets (1)
WLCG #681628 (id:1729) Enable site network monitoring (INFN-CLOUD-BARI)
State: waiting for submitter's reply  |  Priority: less urgent  |  Opened: 2025-01-29 10:01 (430d ago)  |  Updated: 2026-01-09 10:32
Conversation (4 messages)
GGUS ID: 162964
Last modifier: Julia Andreeva
Date: 2023-08-02 09:03:09
Subject: Enable site network monitoring (INFN-CLOUD-BARI)
Ticket Type: USER
CC: ;smckee@umich.edu
Status: assigned
Responsible Unit: NGI_IT
Issue type: Monitoring
Description:
WLCG Sites / Site Administrators / Networking Support, As presented at the WLCG Ops Coordinations meeting on April 6, 2023, our WLCG Monitoring Task Force is initiating a campaign to enable site network monitoring and gathering associated network information in preparation for Data Challenge 2024 (DC24).
Our primary targets are the Tier-1’s and larger Tier-2s but we would like to see as many sites participating as possible.
You can find detailed instructions in Gitlab at CERN: https://gitlab.cern.ch/wlcg-doma/site-network-information
In case you do not have access to the project Gitlab at CERN, 3 PDF files are attached to the twiki page:
https://twiki.cern.ch/twiki/bin/view/LCG/SiteNetworkMonitoring
They capture information from the project Gitlab at CERN. These three PDFs provide the overview of the project, an example site network information template to be filled out and information detailing how to provide your site’s network metrics.
We would like site’s to complete this by the end of September 2023 to give us time to verify the data and provide any fixes well in advance of DC24.
To interact with Gitlab does require a CERN account. If you have issues adding your site files to Gitlab or WLCG-CRIC, please contact us. Best regards, The WLCG
Monitoring Task Force
GGUS ID: 162964
Last modifier: Marica Antonacci
Date: 2023-09-27 14:52:17
GGUS ID: 162964
Last modifier: Julia Andreeva
Date: 2023-10-05 09:43:31

Public Diary:
Any progress with this ticket?
Dear MTF
we are catching up with tickets. Is this request still valid? If so, we will work on it these days.

Sorry for the very late reply.
Vincenzo
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
HammerCloud99%99%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #1002132 (id:1002132) Upgrade your HTCondorCE endpoints to 24.0.x series (INFN-LNL-2)
State: assigned  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 14:13
Conversation (1 message)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%89%83%94%100%98%21%52%100%44%48%93%100%65%
HammerCloud98%98%98%98%100%99%100%100%100%100%100%99%99%100%100%99%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (9)

CMS tickets (1)
CMS #1002207 (id:1002207) Intermittent WebDAV SAM test failure at T2_IT_Pisa
State: assigned  |  Priority: less urgent  |  Opened: 2026-03-27 09:03 (8d ago)  |  Updated: 2026-03-27 09:03
Conversation (1 message)
Good morning, Pisa admins.
Since midnight UTC today (27 Mar). Your WebDAV endpoint has been failed SAM "4crt-read" test [1]. The log file shows "HTTP 500" and "Result Result curl error (35): SSL connect error" error messages [2]. Could you please take a look and check this server's status/certificate file?
Cheers,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_IT_Pisa
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-4crt-read&var-dst_hostname=stwebdav.pi.infn.it&var-timestamp=1774601197334
WLCG tickets (8)
WLCG #1002135 (id:1002135) Upgrade your HTCondorCE endpoints to 24.0.x series (INFN-PISA)
State: assigned  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 14:13
Conversation (1 message)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
WLCG #681609 (id:1710) NGI_IT - January 2024 - RP/RC OLA performance
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:59 (430d ago)  |  Updated: 2025-09-30 13:09
Conversation (20 messages)
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-02-02 12:24:58
Subject: NGI_IT - January 2024 - RP/RC OLA performance
Ticket Type: USER
CC: grid-prod@lists.pi.infn.it; grid-prod@lists.lnf.infn.it;
Status: assigned
Responsible Unit: NGI_IT
Issue type: Operations
Description:
Dear NGI/ROC,

the EGI RC OLA and RP OLA Report for January 2024 has been produced and is available at the following links:
- NGIs reports: http://argo.egi.eu/egi/report-ar/Critical/NGI?month=2024-01 (Clicking on the NGI name, it will be displayed the resource centres A/R figures)
- RCs reports: http://argo.egi.eu/egi/report-ar/Critical/SITES?month=2024-01

According to the Service targets reports for Resource infrastructure Provider [1] and Resource Centre[2] OLAs, the following problems occurred:

============= RC Availability Reliability [2]==========

According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (November, December, and January):

INFN-FRASCATI
https://argo.egi.eu/egi/report-status/Critical/SITES/INFN-FRASCATI/SRM/atlasse.lnf.infn.it
please fix in GOCDB the SURL information for the srm endpoint:
https://docs.egi.eu/internal/configuration-database/adding-service-endpoint/#surl-value-for-srm

INFN-PISA
https://argo.egi.eu/egi/report-status/Critical/SITES/INFN-PISA/webdav/stwebdav.pi.infn.it
please fix in GOCDB the information of the storage area of the webdav endpoint
https://docs.egi.eu/internal/configuration-database/adding-service-endpoint/#webdav

* During the 10 working days after receiving this ticket the NGI can suspend the site or ask to not suspend the site by providing adequate explanation. If no answer is provided to this ticket, the NGI will be contacted by email; if no reply will be provided to the email, the site will be suspended[6].

If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.

If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.

**********************

Links:

[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"

[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"

[3] https://confluence.egi.eu/x/SiAmBg "EGI Infrastructure Oversight escalation"

[4] https://confluence.egi.eu/x/0h4mBg "Recomputation of SAM results or availability reliability statistics"

[5] https://docs.egi.eu/providers/operations-manuals/man05_top_and_site_bdii_high_availability/ "top-BDII and site-BDII High Availability"

[6] https://confluence.egi.eu/x/xx4mBg "Quality verification of monthly availability and reliability statistics"

Best Regards,
EGI Operations
GGUS ID: 165200
Last modifier: IGOR ABRITTA COSTA
Date: 2024-02-02 16:52:25
Changed CC to grid-prod@lists.pi.infn.it; grid-prod@lists.lnf.infn.it;

Public Diary:
Dear EGI Operation,

As agreed on the DPM migration and decommission ticket (https://ggus.eu/index.php?mode=ticket_info&ticket_id=158808) we disable the production and monitoring of the SRM Service on GOCDB (as ATLAS no longer needs SRM) and the performance should start to rise.

Let us know if you agree with this solution,

Cheers,
Igor
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-02-05 08:39:10

Public Diary:
Hi Igor,
thanks for fixing the information in GOCDB.
Cheers,
Alessandro
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Elisabetta Vilucchi
Date: 2024-02-05 10:13:15

Status: solved
Responsible Unit: NGI_IT
Solution:
Solved for INFN-FRASCATI site.
Cheers,
Elisabetta
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-05-30 14:46:29

Public Diary:
Hi enrico,

could you please fix the failures?
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-02-29 16:09:24

Public Diary:
Hi enrico,

could you please fix the GOCDB information?

let us know,
Alessandro
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-04-05 14:35:31

Public Diary:
Hi Enrico,

please fix the information on GOCDB
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-09-02 08:34:44

Public Diary:
Hi,
Could you please fix the information of the webdav endpoint on GOCDB?
It is also failing the host certificate validity check on the CE.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-08-05 13:21:35

Public Diary:
Hi,

the failures were not solved yet: could you please fix them?
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-10-03 09:37:44

Public Diary:
Dear all,

please fix the CE and webdav failures.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-11-01 15:12:08

Public Diary:
could you please fix the issues?
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 165200
Last modifier: Alessandro Paolini
Date: 2024-11-29 09:14:10

Public Diary:
Hi,
please check if the new lsc file for ops VO are installed on all the machines:
https://twiki.cern.ch/twiki/bin/view/LCG/VOMSLSCfileConfiguration

Please also fix the webdav information in GOCDB as suggested in the ticket description
Internal Diary:
Escalated this ticket to NGI_IT
please fix the ops VO setting on the CE and fix the information of the webdav endpoint on GOCDB
I see that the webdav endpoint is passing the tests since April 2025, good!
https://argo.egi.eu/egi/report-ar-group-details/Critical/SITES/INFN-PISA/details

Anyway, the SRM tests started to fail in March 2025:
https://argo.egi.eu/egi/report-ar-group-details/Critical/SITES/INFN-PISA/services/SRM/endpoints

Is it a protocol still supported on your endpoints? If not, please remove the SRM service type from GOCDB.
If yes, please check the settings of the ops VO, in particular if you have the new lsc files https://twiki.cern.ch/twiki/bin/view/LCG/VOMSLSCfileConfiguration

Please do the same check on the CE endpoints: the metric that checks the host certificate validity uses voms authz.

Besides, ensure that the CE endpoints are published in the BDII. Here is how to configure the inforprovider on HTCondorCE: https://htcondor.com/htcondor-ce/v24/configuration/optional-configuration/#enabling-bdii-integration
I have fixed the information in GOCDB.
very well! The SRM metrics will stop soon.Please let me know about the CE endpoints.
I'm working on the CE, we have HTCondor-CE without HTCondor batch system
there is currently a downtime (until Jul 5th) for issues with the cooling system, which then caused issues to the storage elements.
Hi Enrico,do you have news?
Please note that in HTCondor24 the syntax for the ssl mapping is different. See https://jira.egi.eu/browse/EGIKEDB-22 and https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9#The_Submit_Node_CE

Let me know,
Alessandro
Hi Enrico,
what is the Condor version installed on the CEs?
Could you please check if you are using the correct syntax for the SSL mapping related to the condor version?
WLCG #681626 (id:1727) New configuration for monitoring webdav and XrootD (and EOS) endpoints (INFN-PISA)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:01 (430d ago)  |  Updated: 2025-04-24 15:52
Conversation (8 messages)
GGUS ID: 160470
Last modifier: Alessandro Paolini
Date: 2023-02-14 15:36:19

Public Diary:
Dear Site-administrators,

you are receiving this ticket because you provide either a webdav or an xrootd endpoint.
Currently in the GOCDB page of your endpoints the Url field is used to pass the storage area information to the ARGO Monitoring Service.
We would need you to make a change by defining specific Extension Properties to be used for monitoring purposes, and leave the Url field for other usages.

Please have a look at this documentation as a reference:
https://docs.egi.eu/internal/configuration-database/adding-service-endpoint/#webdav

and set the following Extension Properties:

- webdav:
- Name: ARGO_WEBDAV_OPS_URL
- Value: webdav URL containing also the VO ops folder, for example: https://darkstorm.cnaf.infn.it:8443/webdav/ops or https://hepgrid11.ph.liv.ac.uk/dpm/ph.liv.ac.uk/home/ops/

- xrootd:
- Name: ARGO_XROOTD_OPS_URL
- Value: XRootD base SURL to test (the path where ops VO has write access, for example: root://eosatlas.cern.ch//eos/atlas/opstest/egi/, root://recas-se-01.cs.infn.it:1094/dpm/cs.infn.it/home/ops/, root://dcache-atlas-xrootd-ops.desy.de:2811/pnfs/desy.de/ops or similar)

This information was also circulated with a broadcast some months ago:
https://operations-portal.egi.eu/broadcast/archive/2939

The tests at the moment can retrieve the information both from the Url field and from the Extension properties, but in the near future only the Extension Properties will be used.
After you set the Extension properties, you can wait for a few hours or for the next execution of the tests, and then you can unset the Url field, verifying that the tests run successfully.

Thanks for your collaboration,
EGI Operations
GGUS ID: 160470
Last modifier: Alessandro Paolini
Date: 2023-02-14 15:08:16
Subject: New configuration for monitoring webdav and XrootD (and EOS) endpoints (INFN-PISA)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Other
Description:
Dear Site-administrators,you are receiving this ticket because you provide either a webdav or an xrootd endpoint.Currently in the GOCDB page of your endpoints the Url field is used to pass the storage area information to the ARGO Monitoring Service.We would need you to make a change by defining specific Extension Properties to be used for monitoring purposes, and leave the Url field for other usages.Please have a look at this documentation as a reference:https://docs.egi.eu/internal/configuration-database/adding-service-endpoint/#webdavand set the following Extension Properties:- webdav: - Name: ARGO_WEBDAV_OPS_URL - Value: webdav URL containing also the VO ops folder, for example: https://darkstorm.cnaf.infn.it:8443/webdav/ops or https://hepgrid11.ph.liv.ac.uk/dpm/ph.liv.ac.uk/home/ops/- xrootd: - Name: ARGO_XROOTD_OPS_URL - Value: XRootD base SURL to test (the path where ops VO has write access, for example: root://eosatlas.cern.ch//eos/atlas/opstest/egi/, root://recas-se-01.cs.infn.it:1094/dpm/cs.infn.it/home/ops/, root://dcache-atlas-xrootd-ops.desy.de:2811/pnfs/desy.de/ops or similar)This information was also circulated with a broadcast some months ago:https://operations-portal.egi.eu/broadcast/archive/2939The tests at the moment can retrieve the information both from the Url field and from the Extension properties, but in the near future only the Extension Properties will be used.After you set the Extension properties, you can wait for a few hours or for the next execution of the tests, and then you can unset the Url field, verifying that the tests run successfully.Thanks for your collaboration,EGI Operations
GGUS ID: 160470
Last modifier: Renato Santana
Date: 2023-05-04 08:07:24

Public Diary:
Dears,

Would you please update this ticket?

Many thanks in advance!

Cheers,
Renato
EGI-Operations
GGUS ID: 160470
Last modifier: Alessandro Paolini
Date: 2023-08-25 13:46:24

Public Diary:
Dear all,

please set the information in the Extension Properties variables as soon as possible because for September we would like to ask the Monitoring service to use only the Extension Properties to retrieve information about the storage path.

Let us know,
Alessandro
GGUS ID: 160470
Last modifier: Alessandro Paolini
Date: 2023-11-02 14:01:55

Public Diary:
Dear all,
just a reminder that from Nov 6th the monitoring metrics will use only the extension properties to retrieve the information about the storage path.
Please update the information as soon as possible, or the A/R figures will be affected.
GGUS ID: 160470
Last modifier: Alessandro Paolini
Date: 2023-12-12 09:09:11
Changed CC to igi-ops-support@lists.italiangrid.it; igi-noc@lists.italiangrid.it; diego.michelotto@cnaf.infn.it;

Public Diary:
Hi Enrico,
could you please fix the information in GOCDB? The webdav and xrootd tests are failing and the A/R figures are affected.

Let me know,
Alessandro
GGUS ID: 160470
Last modifier: Alessandro Paolini
Date: 2024-10-28 13:05:57

Status: in progress
Responsible Unit: NGI_IT
Public Diary:
The configuration has not been fixed yet, and the tests are still failing.

@Enrico please update the information in GOCDB as suggested and let us know.
I have just updated the information in GOCDB, sorry for the delay. Let me know if now they are correct.
WLCG #681586 (id:1687) Request to implement BGP tagging of LHCONE prefixes. (INFN-PISA)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 09:57 (430d ago)  |  Updated: 2025-04-24 09:02
Conversation (4 messages)
GGUS ID: 168362
Last modifier: Julia Andreeva
Date: 2025-01-28 14:25:53

Public Diary:
Hi Igor

we checked our logs and we are indeed able to submit pilot jobs to Frascati after the invalid user account was fixed. However these pilots remain in waiting status in your queue and after a while they are deleted. We have been observing this cycle (two pilot jobs submissions, they stay in the queue for a couple of days, they are removed, we resubmit, etc.) since a week or so.

Do you see waiting LHCb jobs in your queue?
Concezio
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 168362
Last modifier: Julia Andreeva
Date: 2024-09-23 15:43:38
Subject: Request to implement BGP tagging of LHCONE prefixes. (INFN-PISA)
Ticket Type: USER
CC: ;edoardo.martelli@cern.ch
Status: assigned
Responsible Unit: NGI_IT
Issue type: Network problem
Description:
This ticket concerns all the sites connected to LHCONE.

In agreement with the WLCG Management Board, it has been decided to
implement the tagging of the IP prefixes announced to LHCONE.
The task consists of tagging the IP prefixes that your site announces to
LHCONE with all the BGP communities that identifies the experiments and
collaborations supported by your site. The initial goal is to document
the use of the network. In the longer term the tags may be used to
reduce the exposure on the LHCONE connection, by filtering unnecessary
prefixes.

You will find the values of the BGP communities to use and other
information in these pages:
- https://twiki.cern.ch/twiki/bin/view/LHCONE/MultiOneBGPcommunities
-
https://indico.cern.ch/event/1356138/contributions/6123461/attachments/2925447/5147273/WLCG-20240911-GDB-MultiONE-implementation.pdf

If you need any support on this task, please don't hesitate to ask your
NREN or LHCONE provider.
Or just reply to this ticket asking your questions; experts will guide
you in the implementation.

Please take this opportunity also to review the network information
related to your site in CRIC :
https://wlcg-cric.cern.ch/core/networkroute/list/

We ask you to perform the required action by the end of March 2025
GGUS ID: 168362
Last modifier: Julia Andreeva
Date: 2025-01-28 14:26:10

Public Diary:
Any progress on this ticket?
Internal Diary:
Escalated this ticket to NGI_IT
Our site does not have BGP turned on yet. We are in the process of acquiring a new router and will then turn on BGP when the new router goes into production. We hope to be finished in 2/3 months.
WLCG #681974 (id:2075) Lost files and wrong permissions on Pisa SE
State: assigned  |  Priority: less urgent  |  Opened: 2025-02-03 10:24 (425d ago)  |  Updated: 2025-02-11 14:41
Conversation (7 messages)
GGUS ID: 168577
Last modifier: cedric.serfon
Date: 2024-10-14 12:43:56
Subject: Lost files and wrong permissions on Pisa SE
Ticket Type: USER
CC:
Responsible Unit: TPM
Issue type: Storage Systems
Description:
Hi,

We identified a few files that were supposed to be located on Pisa SE, but actually are not :
srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/belle/TMP/belle/MC/fab/step1/MC16rd_proc16/prod00047277/s00/e0016/4S/r00000/mixed/10601300/mdst/sub02/mdst_000233_prod00047277_task151110002213.root
srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/belle/TMP/belle/MC/fab/step1/MC16rd_proc16/prod00047278/s00/e0016/4S/r00000/mumu/mdst/sub00/mdst_000104_prod00047278_task15670000104.root
srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/belle/TMP/belle/MC/fab/step1/MC16rd_proc16/prod00047280/s00/e0016/4S/r00000/taupair/10601400/mdst/sub00/mdst_000977_prod00047280_task151032000977.root
srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/belle/TMP/belle/MC/fab/step1/MC16rd_proc16/prod00047281/s00/e0016/4S/r00000/uubar/mdst/sub00/mdst_000058_prod00047281_task15592000058.root

Can you check if you find something in your logs that can explain what happened to them ?
BTW, while checking for these missing files, I noticed something strange. In one directory, files have different owners or permission. Some of them are read-only, others are writable by anyone !!

gfal-ls -l srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/belle/TMP/belle/MC/fab/step1/MC16rd_proc16/prod00047277/s00/e0016/4S/r00000/mixed/10601300/mdst/sub02
-r-------- 1 52 53 44711357 Sep 29 03:38 mdst_000002_prod00047277_task151061001982.root
-r-------- 1 52 53 43213448 Sep 27 16:58 mdst_000256_prod00047277_task151112002236.root
-r-------- 1 52 53 20828014 Sep 28 10:47 mdst_000063_prod00047277_task151084002043.root
-r-------- 1 52 53 43497259 Sep 28 19:59 mdst_000092_prod00047277_task151095002072.root
-rw-rw-rw- 1 50 50 42976700 Sep 29 01:20 mdst_000248_prod00047277_task151110002228.root
-rw-rw-rw- 1 50 50 43016363 Sep 28 23:31 mdst_000227_prod00047277_task151110002207.root
-r-------- 1 52 53 41516109 Sep 28 05:31 mdst_000037_prod00047277_task151084002017.root

Do you understand why ?

Regards,
Cedric

Affected ROC/NGI: NGI_IT
Affected Site: INFN-PISA
GGUS ID: 168577
Last modifier: Tomas Holub
Date: 2024-10-14 16:19:09

Status: assigned
Responsible Unit: DMSU
Public Diary:
The CISO has requested that the IT Services team do a thorough check on all the network side, rebuilding controllers etc (Its a CISCO Fabric Network), the firewalls have been rebuilt, and building the new VRF's is under way down in main campus.

The Walton Institute is not on main campus and have dual 10Gbps connections to it as well as their own separate dual 10Gbps connections to HEAnet. A meeting is expected next week about how to migrate the Walton Institute connections onto the 'new' infrastructure. At that point the default routes should start to work again.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 168577
Last modifier: cedric.serfon
Date: 2024-10-17 07:20:46

Public Diary:
Not sure why the site assignment was removed,,,
Are there any news ?
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 168577
Last modifier: Alessandro Paolini
Date: 2024-10-17 07:38:33

Status: assigned
Responsible Unit: NGI_IT
Public Diary:
Not sure why the site assignment was removed,,,
Are there any news ?
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 168577
Last modifier: Enrico Mazzoni
Date: 2024-10-21 14:53:36

Public Diary:
Dear Cedric I checked the filesystem and:
- I can confirm that the files are missing also from the filesystem point of view
- I can't find any info in the SE log file useful to understand the problem's origin
- about the permissions issue what you report is not what I can see from the filesystem, here you are the ls of that folder
ls -l TMP/belle/MC/fab/step1/MC16rd_proc16/prod00047277/s00/e0016/4S/r00000/mixed/10601300/mdst/sub02
total 212928
-rw-r--r-- 1 500 501 43170963 6 ott 18.58 mdst_000071_prod00047277_task151091002051.root
-rw-r--r-- 1 500 501 43897705 28 set 18.34 mdst_000103_prod00047277_task151095002083.root
-rw-r--r-- 1 500 501 44195816 28 set 22.36 mdst_000210_prod00047277_task151110002190.root
-rw-rwxr--+ 1 500 501 43674589 28 set 22.39 mdst_000213_prod00047277_task151110002193.root
-rw-rwxr--+ 1 500 501 42976700 29 set 00.20 mdst_000248_prod00047277_task151110002228.root

could you please check again with gfal-ls to see if you now get the right filesystem information?

Sorry for the delay, Enrico
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 168577
Last modifier: cedric.serfon
Date: 2024-10-22 07:56:54

Public Diary:
Dear Enrico,

About the missing files, does it mean that they were never written on the SE ?

About the permission I still can see some difference using gfal :

[09:49:50 serfon@lxplus998:~/gbasf2 ]> which gfal-ls
/usr/bin/gfal-ls
[09:49:53 serfon@lxplus998:~/gbasf2 ]> hostname
lxplus998.cern.ch
[09:49:55 serfon@lxplus998:~/gbasf2 ]> gfal-ls -l srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/belle/TMP/belle/MC/fab/step1/MC16rd_proc16/prod00047277/s00/e0016/4S/r00000/mixed/10601300/mdst/sub02
-rw-rw-rw- 1 50 50 42976700 Sep 29 01:20 mdst_000248_prod00047277_task151110002228.root
-r-------- 1 52 53 44195816 Sep 28 23:36 mdst_000210_prod00047277_task151110002190.root
-rw-rw-rw- 1 50 50 43674589 Sep 28 23:39 mdst_000213_prod00047277_task151110002193.root
-r-------- 1 52 53 43897705 Sep 28 19:34 mdst_000103_prod00047277_task151095002083.root
-r-------- 1 52 53 43170963 Oct 6 19:58 mdst_000071_prod00047277_task151091002051.root
[09:49:58 serfon@lxplus998:~/gbasf2 ]> gfal-ls --version
gfal2-util version 1.9.0 (gfal2 2.23.0)
dcap-2.23.0
file-2.23.0
gridftp-2.23.0
http-2.23.0
sftp-2.23.0
srm-2.23.0
xrootd-2.23.0

Regards,
Cedric

Internal Diary:
Escalated this ticket to NGI_IT
Hi,

We have still a significant number of errors coming from Pisa : https://monitoring.sdcc.bnl.gov/pub/grafana/d/belle2xfers/belle-ii-transfers-and-deletions?orgId=1&viewPanel=39&var-src_rse=All&var-dst_rse=All&var-activity=All&var-binning=10m&var-Filters=payload.src-rse%7C%3D%7CPisa-TMP-SE&var-Filters=payload.dst-rse%7C%3D%7CPisa-DATA-SE

with typical error :

INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: > User-Agent: libdavix/0.8.4 libcurl/7.69.0-DEV
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix:
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < HTTP/1.1 404 Not Found
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < Content-Type: text/html
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < X-Content-Type-Options: nosniff
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < X-XSS-Protection: 1; mode=block
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < Cache-Control: no-cache, no-store, max-age=0, must-revalidate
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < Pragma: no-cache
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < Expires: 0
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < Strict-Transport-Security: max-age=31536000 ; includeSubDomains
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < X-Frame-Options: DENY
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: < Transfer-Encoding: chunked
INFO Mon, 10 Feb 2025 21:20:23 -0500; Davix: Negative result for operation: HTTP 404 : File not found . After 1 retry
INFO Mon, 10 Feb 2025 21:20:23 -0500; [1739240423888] DEST http_plugin CLEANUP 0
INFO Mon, 10 Feb 2025 21:20:23 -0500; Gfal2: Event triggered: DESTINATION http_plugin CLEANUP 0
INFO Mon, 10 Feb 2025 21:20:23 -0500; [1739240423888] BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: UnknownHostException while pushing https://stwebdav.pi.infn.it:8443/belle/DATA/belle/Data/build-light-2411a/DB00003158/proc16/prod00048484/e0016/4S/r00000/all/78000100/mdst/sub00/mdst_000320_prod00048484_task150000321.root: stwebdav.pi.infn.it
INFO Mon, 10 Feb 2025 21:20:23 -0500; Gfal2: Event triggered: BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: UnknownHostException while pushing https://stwebdav.pi.infn.it:8443/belle/DATA/belle/Data/build-light-2411a/DB00003158/proc16/prod00048484/e0016/4S/r00000/all/78000100/mdst/sub00/mdst_000320_prod00048484_task150000321.root: stwebdav.pi.infn.it
ERR Mon, 10 Feb 2025 21:20:23 -0500; Recoverable error: [5] TRANSFER ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: UnknownHostException while pushing https://stwebdav.pi.infn.it:8443/belle/DATA/belle/Data/build-light-2411a/DB00003158/proc16/prod00048484/e0016/4S/r00000/all/78000100/mdst/sub00/mdst_000320_prod00048484_task150000321.root:

The file mentioned above is supposed to have been uploaded to Pisa SE on 2025-02-11 02:18 UTC. Can you find anything in the SE logs ?

Regards,
Cedric
WLCG #681621 (id:1722) SE gridsrm.pi.infn.it is not working for Biomed users.
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:00 (430d ago)  |  Updated: 2025-02-03 10:07
Conversation (5 messages)
GGUS ID: 166437
Last modifier: Sorina Pop
Date: 2024-04-19 14:44:10
Subject: SE gridsrm.pi.infn.it is not working for Biomed users.
Ticket Type: TEAM
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Storage Systems
Description:
Dear site admins,

SE gridsrm.pi.infn.it is not working for Biomed users. The incident was detected from the Biomed ARGO box that you may want to check to see the status: https://biomed.ui.argo.grnet.gr/biomed/report-status/CORE/SITES/INFN-PISA/SRM/gridsrm.pi.infn.it

According to the announcement in https://operations-portal.egi.eu/broadcast/archive/3021, the biomed VOMS server host certificate changed on Friday, 29th of March, requiring client updates and it's likely that errors are due to this change.

Could you please have a look?

Thank you in advance for your support,
Sorina for the Biomed VO
GGUS ID: 166437
Last modifier: Akos Szlavecz
Date: 2024-05-01 10:26:51

Public Diary:
Dear site admins,

Could you check this issue, please?

Regards
Ákos
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 166437
Last modifier: Akos Szlavecz
Date: 2024-06-12 14:45:20

Public Diary:
Dear site admins,

The site is still not working for the biomed VO.
Could you check this issue, please?

Regards
Ákos
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 166437
Last modifier: Pansanel Jerome
Date: 2024-05-17 15:10:11

Public Diary:
Hi,
Is there any news?
Best,
Jerome
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 166437
Last modifier: Sorina Pop
Date: 2024-08-21 12:19:23

Public Diary:
Dear site admins,

Any news on this issue?

Best regards,
Sorina
Internal Diary:
Escalated this ticket to NGI_IT
WLCG #681619 (id:1720) Missing CPU accounting data in the EGI portal for May. Monthly accounting validation has not been performed either. (INFN-PISA)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:00 (430d ago)  |  Updated: 2025-02-03 10:06
Conversation (1 message)
GGUS ID: 167568
Last modifier: Julia Andreeva
Date: 2024-07-12 09:20:42
Subject: Missing CPU accounting data in the EGI portal for May. Monthly accounting validation has not been performed either. (INFN-PISA)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Other
Description:
Hello,
You get this ticket because accounting data for your site for May do not show up in the EGI accounting portal. In case you have troubles with the accounting reporting (which should be followed up with APEL support team), you should use monthly accounting validation in order to provide accounting metrics from your local accounting for the WLCG monthly accounting report. It has not been done for your site for May. In case your were not notified for monthly accounting validation please contact me (julia.andreeva@cern.ch), so that we update mailing list for notification. Please, make sure that for June your numbers are provided and that your reporting to APEL is fixed.
WLCG #681611 (id:1712) Upgrade to a supported HTCondor version and enable SSL authentication (INFN-PISA)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:59 (430d ago)  |  Updated: 2025-02-03 10:02
Conversation (5 messages)
GGUS ID: 163998
Last modifier: Alessandro Paolini
Date: 2023-11-03 11:26:42
Subject: Upgrade to a supported HTCondor version and enable SSL authentication (INFN-PISA)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Middleware
Description:
Dear site admins,

with this ticket we would like to follow-up the upgrade to a supported version of HTCondorCE and the migration from voms-based authentication with X509 certificates to AAI tokens for accessing the HTCondorCE endpoints.

The HTCondor team set-up an upgrade procedure to help sites and VOs with the migration from X509 personal certificates to tokens.
Essentially it was created an intermediate step where the plain SSL authentication can be used to authenticate a client' proxy, in addition to the GSI one or to the token one:
- https://confluence.egi.eu/x/EYAtDQ

In summary, the steps are:

- update to HTCondor 9.0.19
- enable the SSL authz (with priority over GSI)
- map the users' DNs
- test the SSL authz successfully
- update to latest HTCondor 10.x

You can find the HTCondor 9.0.19 version in WLCG repository for the time being, as explained in the instructions.

Please also note the usage in the last step of the HTCondor Feature channel (https://htcondor.org/htcondor/release-highlights/index.html#feature-channel) since it this the one supporting the EGI Check-in plugin from 10.4.0.
In this way the sites can accept clients’ proxies and tokens at the same time while waiting for the supported VOs moving completely to tokens.
You can find the latest HTCondor 10.x version in the HTCondor Feature Channel repository.

Please note that after the upgrade to HTCondor 10 version, you need to install and configure the EGI Check-in plugin in order to be compliant with the EGI tokens:
https://github.com/EGI-Federation/check-in-validator-plugin

Please get in contact with your supported communities to properly map the users' DNs to local accounts to ensure also the access via X509 personal certificates.

Concerning the ops VO, please map at least the following certificates:
- EGI Monitoring Service:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-egi@grnet.gr"
"/DC=EU/DC=EGI/C=HR/O=Robots/O=SRCE/CN=Robot:argo-egi@cro-ngi.hr"

- EGI Security monitoring:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-secmon@grnet.gr"

Please also configure properly the Accounting settings on the HTcondor 10 installation, as explained in the instructions.

Thanks for your collaboration,
EGI Operations
GGUS ID: 163998
Last modifier: Alessandro Paolini
Date: 2025-01-23 12:07:22

Status: in progress
Responsible Unit: NGI_IT
Public Diary:
yes, the migration hasn't been done yet.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163998
Last modifier: Alessandro Paolini
Date: 2025-01-23 12:35:58

Public Diary:
thanks for the news: which version are you going to install?

the suggestion is 23.0.x version:
https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv23EL9
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163998
Last modifier: Enrico Mazzoni
Date: 2025-01-23 12:12:57

Public Diary:
The work is in progress are taking care of gridce0.pi.infn.it just now.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 163998
Last modifier: Enrico Mazzoni
Date: 2025-01-23 13:46:43

Public Diary:
We are actually updating the old CEs with CentOS 7 to keep the site working. After that we will start updating to EL9.
Internal Diary:
Escalated this ticket to NGI_IT
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM99%100%100%100%100%100%100%100%99%97%97%96%100%95%100%100%
HammerCloud98%100%99%100%100%100%99%99%100%100%100%99%99%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #681639 (id:1740) Request to deploy IPv6 on CEs and WNs at WLCG sites (INFN-ROMA1-CMS)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:01 (430d ago)  |  Updated: 2026-03-31 08:09
Conversation (17 messages)
GGUS ID: 164371
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:37:44
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (INFN-ROMA1-CMS)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164371
Last modifier: Andrea Sciaba
Date: 2023-11-29 06:43:44

Public Diary:
yes, the migration hasn't been done yet.
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164371
Last modifier: Andrea Sciaba
Date: 2024-07-01 13:37:51

Public Diary:
Hi,
would it be possible to have some information about your plans?
Thanks,
Andrea
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164371
Last modifier: francesco failla
Date: 2024-07-02 06:12:28

Public Diary:
Hi Andrea,
our situation is like that of INFN-ROMA.
So we’re currently testing the el9 WNs, and we’ll add the IPv6 addresses to the new nodes once upgrading. This won’t realistically change before September, given the holiday period.

Cheers,
Francesco
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164371
Last modifier: Andrea Sciaba
Date: 2024-07-05 09:48:34

Status: in progress
Responsible Unit: NGI_IT
Public Diary:
Ciao Francesco,
grazie dell'update!
Andrea
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164371
Last modifier: Andrea Sciaba
Date: 2024-12-03 13:53:13

Public Diary:
Ciao Francesco,
what's the situation, did you make any progress after the holidays?
Andrea
Internal Diary:
Escalated this ticket to NGI_IT
Ciao Andrea,
I talked to Alessandro De Salvo since the operation is being planned together with ATLAS. The plan is to deploy ipv6 for the WNs when reinstalling them with EL9. This will happen gradually in a couple of months since we will first proceed with the CEs which must also be upgraded to EL9

Shahram
Ciao Shahram!
I was wondering if the CE and WN reinstallation has happened and if so, how far are you from deploying IPv6.
Grazie,
Andrea
Ciao Andrea,
we have a CE and a batch of WNs ready. However, moving to EL9 requires also upgrading the batch system from LSF9 to LSF10. The new nodes wit the new LSF master will be tested in production starting the last week of August before upgrading all ATLAS and CMS nodes.

Shahram
Ciao Sha,
thanks for the update! We are getting closer and closer then...
Andrea
Ciao Shahram,
did you progress with the new nodes deployment?
Andrea
Ciao Andrea,
I see Alessandro already updated you. As soon as the ATLAS nodes are done we'll start with the CMS nodes.

Shahram
Ciao Shahram,

was there any progress, by chance?

Andrea
Ciao Andrea,
I have set up a new CE with Alma9 and a few test WNs with ALMA9. I am currently waiting for the CE to be added to the SAM tests for a full end-to-end validation before starting he mass migration of all nodes.

Shahram
Ciao Shahram!
I can help with that, if you are still waiting. Whom did you ask for the CEs to be added?
Andrea
Ciao Andrea,
the test pilots have been successful (a set just finished a few minutes ago) and SAM tests have been successful on the new CE (cmsrm-ce-02.roma1.infn.it) and the new WNs since Saturday. I will be migrating groups of 12 nodes this week. They should be all migrated by the end of next week.

Shahram
Great! Then after the Easter holidays we might be able to close the ticket.
Andrea
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM12%42%62%49%53%26%23%54%45%47%62%63%83%100%100%91%
HammerCloud99%100%98%100%100%100%100%100%100%99%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

CMS tickets (2)
CMS #682716 (id:2849) Request for XRootD Upgrade and Network Packet Labeling Configuration (HPC4L)
State: in progress  |  Priority: less urgent  |  Opened: 2025-03-18 14:46 (382d ago)  |  Updated: 2025-08-14 13:26
Conversation (3 messages)
Dear Site
Administrators,



CMS will resume data
taking next month and expand the use of tokens. Significant improvements and
bug fixes for tokens have been made in XRootD. We kindly request all CMS sites
using native XRootD to upgrade their XRootD services to the latest version (5.7.3),
including site redirectors and storage services.



We would also like to
take this opportunity to encourage sites to enable network packet labelling by
adding the following four configuration lines to both XRootD and redirectors.



xrootd.pmark ffdest
eu.scitags.org:10514

xrootd.pmark domain
any

xrootd.pmark defsfile
curl https://scitags.docs.cern.ch/api.json

xrootd.pmark map2exp
path /<path-to-store>/store cms



If your site supports
multiple VOs on the same XRootD service or requires additional details, please
refer to the SciTag Network Packet Labeling Twiki page: https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesXrootdScitagPacketLabeling.

Our target date for
completing the upgrade and configuration update is April 5th, in preparation
for the LHC commissioning for 2025.



Thank you for your
cooperation. Please let us know if you have any questions or concerns.



Best regards,

Jakrapop and Noy

CMS Site Support
************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
Ticket Link: https://helpdesk.ggus.eu/#ticket/zoom/2842
Any update?--Thank you, Noy
Good afernoon, HPC4L admins. SAM test shows your site still use XRootD version 5.6.9 [1]. We recommend you update to 5.8.3 or newer .Could you please provide an XRootD upgrade plan and implement packet labeling config?
Cheers,
Noy
[1]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-3version&var-dst_hostname=mgm.hpc4l.org&var-timestamp=1755137882000
CMS #681821 (id:1922) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T2_LB_HPC4L
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 10:47 (430d ago)  |  Updated: 2025-02-13 14:57
Conversation (4 messages)
GGUS ID: 168900
Last modifier: Jakrapop Akaranee
Date: 2024-11-05 12:16:52
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T2_LB_HPC4L
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: USCMS
Issue type: CMS_SAM tests
Description:
Dear HPC4L Site Administrators,
We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack, specifically for the following endpoint:

mgm.hpc4l.org (XrootD [1] )

Could you please review dual stack support on your storage element?
Thank you for your assistance.
Best Regards,Jakrapop
-----------
[1] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dmgm.hpc4l.org%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice
[2] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dmgm.hpc4l.org%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
GGUS ID: 168900
Last modifier: Saadallah Itani
Date: 2024-11-05 12:54:02

Public Diary:
Dear Jakrapop,

We haven't yet deployed Ipv6 but its on our todo list, and due to the
current dire situation in Lebanon, we postponed it until things settle down.


Regards,
Saad
ext 2229

From: helpdesk@ggus.org
Sent: Tuesday, November 5, 2024 2:18 PM
To: Saadallah Itani
Subject: GGUS-Ticket-ID: #168900 Ticket for site "HPC4L" "cms" "Request for
Dual Stack Support on Storage Element in ETF Pre-Production at T2_LB_HPC4L"

Dear support staff,
Ticket #168900 for site "HPC4L" is ASSIGNED to you.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info
&ticket_id=168900
SUBJECT: Request for Dual Stack Support on Storage Element in ETF
Pre-Production at T2_LB_HPC4L
TICKET INFORMATION:
DESCRIPTION:
Dear HPC4L Site Administrators,
We are currently preparing the ETF pre-production instance and have found
that your storage element no longer supports dual stack, specifically for
the following endpoint:


* mgm.hpc4l.org (XrootD [1] )

Could you please review dual stack support on your storage element?
Thank you for your assistance.
Best Regards, Jakrapop
-----------
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fche
ck_mk%2Fview.py%3Fhost%3Dmgm.hpc4l.org%26service%3Dorg.cms.SE-XRootD-1connec
tion%26site%3Detf%26view_name%3Dservice
[2]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fche
ck_mk%2Fview.py%3Fhost%3Dmgm.hpc4l.org%26service%3Dorg.cms.SE-WebDAV-1connec
tion%26site%3Detf%26view_name%3Dservice
NOTIFIED SITE: HPC4L
CONCERNED VO: cms
PRIORITY: urgent
ISSUE TYPE: CMS_SAM tests
SUBMITTER: Jakrapop Akaranee
*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:
Escalated this ticket to USCMS
GGUS ID: 168900
Last modifier: Stephan Lammel
Date: 2024-11-05 14:29:23

Status: on hold
Responsible Unit: USCMS
Public Diary:
Dear Jakrapop,

We haven't yet deployed Ipv6 but its on our todo list, and due to the
current dire situation in Lebanon, we postponed it until things settle down.


Regards,
Saad
ext 2229

From: helpdesk@ggus.org
Sent: Tuesday, November 5, 2024 2:18 PM
To: Saadallah Itani
Subject: GGUS-Ticket-ID: #168900 Ticket for site "HPC4L" "cms" "Request for
Dual Stack Support on Storage Element in ETF Pre-Production at T2_LB_HPC4L"

Dear support staff,
Ticket #168900 for site "HPC4L" is ASSIGNED to you.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info
&ticket_id=168900
SUBJECT: Request for Dual Stack Support on Storage Element in ETF
Pre-Production at T2_LB_HPC4L
TICKET INFORMATION:
DESCRIPTION:
Dear HPC4L Site Administrators,
We are currently preparing the ETF pre-production instance and have found
that your storage element no longer supports dual stack, specifically for
the following endpoint:


* mgm.hpc4l.org (XrootD [1] )

Could you please review dual stack support on your storage element?
Thank you for your assistance.
Best Regards, Jakrapop
-----------
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fche
ck_mk%2Fview.py%3Fhost%3Dmgm.hpc4l.org%26service%3Dorg.cms.SE-XRootD-1connec
tion%26site%3Detf%26view_name%3Dservice
[2]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fche
ck_mk%2Fview.py%3Fhost%3Dmgm.hpc4l.org%26service%3Dorg.cms.SE-WebDAV-1connec
tion%26site%3Detf%26view_name%3Dservice
NOTIFIED SITE: HPC4L
CONCERNED VO: cms
PRIORITY: urgent
ISSUE TYPE: CMS_SAM tests
SUBMITTER: Jakrapop Akaranee
*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:
Escalated this ticket to USCMS
Assigning missing CMS site name error during import to new GGUS.

Jakrapop
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM83%100%100%100%95%100%94%93%97%100%100%100%100%95%100%97%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #1001599 (id:1001599) Pilot validation failures at T2_LV_HPCNET
State: assigned  |  Priority: less urgent  |  Opened: 2026-01-20 17:36 (74d ago)  |  Updated: 2026-03-24 10:24
Conversation (3 messages)
Dear T2_LV_HPCNET admins,

A large number of pilots sent to ce-arc.tier2.hpc-net.lv from the CERN factory have been failing validation. This is occurring on 2 worker nodes:

215 <env name="hostname">tr2sl-n0.tier2.hpc-net.lv</env>
408 <env name="hostname">tr2sl-n15</env>

According to the pilot error logs, the failure occurs in the test_squid.sh test:
Executing (flags:) /tmp/glide_RCNQap/client/test_squid.sh
starting SQUID VALIDATION
/tmp/glide_RCNQap/client/test_squid/sam_squid.sh: line 43: /cvmfs/cms.cern.ch/cmsset_default.sh: Transport endpoint is not connected
ERROR: CMS software initialisation script cmsset_default.sh failed
summary: NO_SETUP_SCRIPT
exit status squid frontier validation 1

Just before the squid test, the log also shows that the cmssw image on cvmfs is not found:

Tue Jan 20 15:56:54 EET 2026 ERROR: /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel8-x86_64 file not found
Tue Jan 20 15:56:54 EET 2026 No valid singularity image found (Selected singularity image, /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel8-x86_64, does not exist), but image is not required via SINGULARITY_IMAGE_REQUIRED. Continuing to test the binary.
Tue Jan 20 15:56:54 EET 2026 A later setup must set GWMS_SINGULARITY_IMAGE or jobs must set their image. Otherwise singularity/apptainer will not work.
Tue Jan 20 15:56:54 EET 2026 The Singularity image () is not a readable file/directory.
INFO Testing the Singularity/Apptainer binary with test image: oras://ghcr.io/apptainer/alpine:latest

INFO Checking for singularity...
INFO GWMS Singularity wrapper: PATH is set to /usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/tr2home/cms001/.local/bin:/tr2home/cms001/bin:/tr2home/cms001/.local/bin:/tr2home/cms001/bin outside Singularity. T
his will not be propagated to inside the container instance.

Checking on lxplus, the cms:rhel8-x86_64 image does exist.

Can you please have a look into this on your end?

All the best,
Vaiva
Thanks, I will check what's the issue.
Hello,

We're still observing a large number of pilot validation failures in client/test_squid.sh.
Can you investigate the issue?

Thanks,
Vaiva
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (5)

CMS tickets (1)
CMS #681727 (id:1828) Rucio Consistency Check failed (T2_PK_NCP)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:14 (430d ago)  |  Updated: 2025-10-09 11:42
Conversation (9 messages)
GGUS ID: 159115
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-10-03 15:12:39
Subject: Rucio Consistency Check failed (T2_PK_NCP)
Ticket Type: USER
CC: cms-comp-ops-transfer-team@cern.ch;
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: CMS_Data Transfers
Description:
Dear site admin,
The last consistency check failed at your site.
Please check the logs[1].

Best
Felipe Gómez
CMS Data Management Operator

[1] https://cmsweb.cern.ch/rucioconmon/ce/show_run?rse=T2_PK_NCP&run=2022_09_27_03_18
GGUS ID: 159115
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-10-04 11:22:09

Public Diary:
Hi Fawad,
1). These are the current settings for your site[1]: this is not allowing to write files.
How much is your current available bandwidth?
I have seen transfer errors from/to your site in Rucio. I am not sure how to properly configure your site in Rucio to adapt it to this temporal situation.

2). Your RSE Rucio protocols[2] are davs, gsiftp and root. I have seen that your storage.json[3] file has SRMv2 instead of gsiftp. I thik Rucio is updating from the storage.json and assuming you have gsiftp.
If this is a true SRM endpoint, can you update the storage.json[3] to have "srm://pcncp22.ncp.edu.pk..." instead of "gsiftp://pcncp22.ncp.edu.pk..."?

Cheers,
Felipe

[1]
(venv) [fgomezco@lxplus7111 ~]$ rucio-admin rse info T2_PK_NCP
Settings:
=========
availability_delete: False
availability_read: True
availability_write: False
...
[2]
Protocols:
==========
davs
domains: '{"lan": {"read": 0, "write": 0, "delete": 0}, "wan": {"read": 1, "write": 1, "delete": 1, "third_party_copy_read": 1, "third_party_copy_write": 1}}'
extended_attributes: None
hostname: pcncp22.ncp.edu.pk
impl: rucio.rse.protocols.gfal.Default
port: 443
prefix: /dpm/ncp.edu.pk/home/cms/
scheme: davs
gsiftp
domains: '{"lan": {"read": 0, "write": 0, "delete": 0}, "wan": {"read": 2, "write": 2, "delete": 2, "third_party_copy_read": 2, "third_party_copy_write": 2}}'
extended_attributes: None
hostname: pcncp22.ncp.edu.pk
impl: rucio.rse.protocols.gfal.Default
port: 2811
prefix: /dpm/ncp.edu.pk/home/cms/
scheme: gsiftp
root
domains: '{"lan": {"read": 0, "write": 0, "delete": 0}, "wan": {"read": 3, "write": 3, "delete": 3, "third_party_copy_read": 3, "third_party_copy_write": 3}}'
extended_attributes: None
hostname: pcncp22.ncp.edu.pk
impl: rucio.rse.protocols.gfal.Default
port: 1094
prefix: //dpm/ncp.edu.pk/home/cms/
scheme: root
[3] gsiftp://pcncp22.ncp.edu.pk:2811/dpm/ncp.edu.pk/home/cms
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 159115
Last modifier: Fawad Saeed
Date: 2022-10-04 05:27:03

Public Diary:
Hi Felipe,
The required datasets (relval) for Consistency Check is not residing/transferred at our site T2_PK_NCP. As our site is in waiting Room (since long time) and not able to transfer required datasets due to unavailability of high speed Network connectivity. That's why consistency check failed with timeout error.
Just for your information, although we are trying to restore our high speed network connectivity but not possible in near future due to some budget constraint.
Regards
Fawad Saeed
T2_PK_NCP

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 159115
Last modifier: Fawad Saeed
Date: 2022-10-05 09:54:02

Public Diary:
Hi Felipo,
I have changed the SRM endpoints in storage.json, could you please check.
Regarding Bandwidth ATM we are connected with 75 Mbps internet, quite insufficient for transfers.
Regards
Fawad

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 159115
Last modifier: Stephan Lammel
Date: 2024-02-22 14:18:59

Public Diary:
from Fawad on GGUS# 158813:
===========================
Regarding migration of pcncp22.ncp.edu.pk (From DPM to dCache), it is still on the way due to certain configuration related issues. That's why XRootD and WebDAV are not working ATM. Hopefully we will sort out these issues as per our earliest convivence.
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 159115
Last modifier: Fawad Saeed
Date: 2022-10-27 06:25:02

Public Diary:
Hi Felipe,
Just a soft reminder regarding the status/configuration now.
In case, the pending issue has got vanished, could you please close this ticket?
Regards,
Fawad.

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 159115
Last modifier: Stephan Lammel
Date: 2024-02-22 14:23:53

Public Diary:
https://ggus.eu/index.php?mode=ticket_info&ticket_id=158813
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 159115
Last modifier: Chan-anun Rungphitakchai
Date: 2024-12-03 00:08:36

Status: in progress
Responsible Unit: ROC_Asia/Pacific
Public Diary:
Hello NCP admin,
Could you please provide an update on DPM migration to dCache status? After migration completed. Please update configurations for HTCondor CE and storage endpoints to support IAM token. The attached documents show instructions and example [1]
Thank you,
Noy
[1]
https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheCMSsetup
https://twiki.cern.ch/twiki/bin/view/LCG/HTCondorCEtokenConfigTips
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
Hello Fawad. Could you please provide any update about DPM migration? The consistency check already disabled for your site
Thank you,
Noy
WLCG tickets (4)
WLCG #681731 (id:1832) Request to deploy IPv6 on CEs and WNs at WLCG sites (NCP-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:14 (430d ago)  |  Updated: 2026-03-31 09:20
Conversation (12 messages)
GGUS ID: 164381
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:38:10
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (NCP-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164381
Last modifier: Andrea Sciaba
Date: 2023-11-29 06:43:34

Public Diary:
Hi Chironat,
from what I can see from the ALICE monitoring, your WNs do not pass the IPv6 connectivity test:

http://alimonitor.cern.ch/display?interval.max=0&interval.min=31536000000&page=IPv6%2Fsite_readiness

If you expand the statistics at the end, the entry for your site shows zero, which means that it never passes the test.
I'm adding Costin, in case he can help understanding why.

Cheers,
Andrea
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164381
Last modifier: Andrea Sciaba
Date: 2024-07-01 14:46:15

Public Diary:
Hi,
would it be possible to have some information about your plans?
Thanks,
Andrea
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164381
Last modifier: Fawad Saeed
Date: 2024-09-27 06:40:02

Public Diary:
Hi Andrea,
Although we have limited IPv6 addresses, mostly are in use. Moreover, at the moment our majority resources are down due to repair of UPS, as soon as this intervention will over, we will incorporate other resources with new IPv6 pool. Hopefully it will be completed in mid-November 2024.
Regards
Fawad
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164381
Last modifier: Andrea Sciaba
Date: 2024-09-18 12:10:46

Public Diary:
Hi,
once again, could you please provide some information about the required IPv6 deployent on CEs and WNs?
Thanks, Andrea
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164381
Last modifier: Andrea Sciaba
Date: 2024-10-15 15:52:39

Status: in progress
Responsible Unit: ROC_Asia/Pacific
Public Diary:
Hi Fawad,
thanks for the update, I'm changing the status of the ticket to reflect the progress.
Andrea
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164381
Last modifier: Andrea Sciaba
Date: 2025-01-24 16:31:43

Public Diary:
Hi Fawad,
as some time has passed since your last update, I was wondering if you could complete the deployment.
Thanks,
Andrea
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
Hi Fawad,
do you have any news?
Cheers,
Andrea
Hi,
do you have any news?
Cheers,
Andrea
Hello Andrea,
We have two htcondor based CEs (htcondor-ce-2.ncp.edu.pk and pcncp04.ncp.edu.pk) at NCP. ATM we only enabled ipv6 (both on CE and worker node) on pcncp04.ncp.edu.pk. After shifting it on only ipv6 it has been observed that CMS etf SAM are not landing on pcncp04.ncp.edu.pk.
Is there anything required to changed at Submission infrastructure or some thing is still missing or misconfigured at our end? could you please help us to debug it.
Thanks in Advance and sorry for the very long delay.
Regards
Fawad & Adeel
T2_PK_NCP
Hi Fawad and Adeel!
In principle, the change should be transparent... I see that ETF does send jobs to pcncp04, but they stay idle forever, see for example:
https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dpcncp04.ncp.edu.pk%26site%3Detf%26view_name%3Dhost
Could you check the HTCONDOR-CE or the batch system logs looking for error messages that might explain what's happening?
Cheers,
Andrea
Hi,
I was wondering if there are any news...
Best regards,
Andrea
WLCG #1002145 (id:1002145) Upgrade your HTCondorCE endpoints to 24.0.x series (NCP-LCG2)
State: assigned  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 14:13
Conversation (1 message)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
WLCG #681724 (id:1825) No accounting information for a site in the EGI acounting portal (NCP-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:14 (430d ago)  |  Updated: 2025-09-08 08:54
Conversation (16 messages)
GGUS ID: 151484
Last modifier: Fawad Saeed
Date: 2021-05-07 06:34:03

Public Diary:
Hi ,
The Problem is due to that, right now our site have no CE in production. Previously we were using CREAMCE, but now it is decommissioned. We are in the process of installing its replacement (ARC-CE and HTcondor CE). But it will take some time to be deployed hopefully in June 2021. Once new CEs will in production we will enable its apel accounting.
Thanks and Best Regards
Fawad Saeed
NCP-LCG2
Internal Diary:
Added attachment f2eac07933a25841e6b6f7459660015f
https://ggus.eu/index.php?mode=download&attid=ATT119915
GGUS ID: 151484
Last modifier: Chien-De Li
Date: 2021-10-07 08:09:07

Public Diary:
Any updates on this ticket?
Internal Diary:
Added attachment f2eac07933a25841e6b6f7459660015f
https://ggus.eu/index.php?mode=download&attid=ATT119915
GGUS ID: 151484
Last modifier: Julia Andreeva
Date: 2021-04-22 09:21:31
Subject: No accounting information for a site in the EGI acounting portal (NCP-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: Other
Description:
You get this ticket because your site is not properly reporting accounting information to APEL and is shown with 0 consumption in the EGI accounting portal and correspondingly in the WLCG monthly accounting reports. Please, investigate the problem and fix it asap. Meanwhile, WLCG operations would like to remind you that you have a possibility to provide your CPU and disk consumption during monthly accounting validation of the WLCG sites. You get a mail and instruction how to proceed monthly. Thank you. Julia.
GGUS ID: 151484
Last modifier: GGUS SYSTEM
Date: 2023-04-13 12:11:39

Public Diary:
Julia,

There is no activity on this ticket for more than 2 years.
I assume this ticket is not relevant any longer?

GGUS ticket monitoring
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: Julia Andreeva
Date: 2023-04-13 13:38:20

Status: in progress
Responsible Unit: ROC_Asia/Pacific
Public Diary:
Hi Fawad,
Accounting problem is still not fixed. I see two HTCondor CEs used by CMS. Could you, please, make sure that APEL client works properly and information is sent from your site.

Thank you
Julia
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: Fawad Saeed
Date: 2023-04-20 05:36:02

Public Diary:
Hi Julia,
Although we have two htcondor-ce's for our site. First one (htcondor-ce.ncp.edu.pk) is for test purpose and other one (htcondor-ce-2.ncp.edu.pk) is for production purpose. It is to bring into your notice that our site is operating on very minimal computing resources, atm our CEs is not accepting jobs except SAM job just to maintain Site avaialabilty/Reliablity. Thats why apel accounting was not configured at our site. Hopefully in current year we will incorporate more computing resouces, then we will enable accountig for our site.
So my suggestion is either to hold or close (as unsloved) this ticket till further action.
Sorry for the delay and inconvience.
Regards
Fawad Saeed
NCP-LCG2

Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: Renato Santana
Date: 2024-05-08 09:14:24

Public Diary:
Dear Fawad,
Any news concerning your site, APEL configuration?

Thanks in advance for updating this ticket information.

Cheers,
Renato
On behalf of EGI Operations Team

Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: Alessandro Paolini
Date: 2024-05-10 12:58:20

Public Diary:
Ticket category has been changed from Service Request to Incident.
changing ticket category to "incident".
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: Fawad Saeed
Date: 2024-08-29 11:03:02

Public Diary:
Hi,
First of all, sorry for the reply after long time. We have configured apel accounting for one of our newly installed htcondor-CE (pcncp04.ncp.edu.pk) on alma linux 8. ATM only jobs from ops VO is allowed on this CE. Could you please check and confirm, whether our Site NCP-LCG2 is publishing accountig information correctly?
Thanks for your patience and support.
Regards
Fawad Saeed
NCP-LCG2
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: Alessandro Paolini
Date: 2024-09-17 14:56:18

Public Diary:
Hi Fawad,
at the moment there are no test jobs submitted with the ops VO since we are waiting for the new version of the probe for HTcondorCE.

Maybe can you try to submit some jobs with the dteam VO, or in case to enable a production VO, so we can see in the coming days if the accounting data are properly published?

Cheers,
Alessandro
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: Alessandro Paolini
Date: 2024-09-17 14:57:30
Changed CC to operations@egi.eu

Public Diary:
Hi Fawad,
at the moment there are no test jobs submitted with the ops VO since we are waiting for the new version of the probe for HTcondorCE.

Maybe can you try to submit some jobs with the dteam VO, or in case to enable a production VO, so we can see in the coming days if the accounting data are properly published?

Cheers,
Alessandro
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: GGUS SYSTEM
Date: 2025-01-13 10:19:23

Public Diary:
Dears,
Any news concerning this very old ticket?
Can we close it?

Cheers,
Renato

Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: GGUS SYSTEM
Date: 2025-01-27 10:19:42

Public Diary:
Hi,
Just noticed this ticket, sorry for the unusual delay. I will come back to you within few days. please hold this ticket.
Regards
Fawad
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 151484
Last modifier: Fawad Saeed
Date: 2025-01-20 09:54:02

Public Diary:
Hi,
Just noticed this ticket, sorry for the unusual delay. I will come back to you within few days. please hold this ticket.
Regards
Fawad
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 151484
Last modifier: GGUS SYSTEM
Date: 2025-01-20 10:19:30

Public Diary:
Hi,
Just noticed this ticket, sorry for the unusual delay. I will come back to you within few days. please hold this ticket.
Regards
Fawad
Internal Diary:
Sent 2nd reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
Dear Fawad,This old 'service request' ticket was migrated to the new GGUS (zammad). It stills is 'in progress' status.
Can we close this ticket?
Cheers,
Renato
EGI Operations
WLCG #681741 (id:1842) Missing CPU accounting data in the EGI portal for May. Monthly accounting validation has not been performed either. (NCP-LCG2)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:18 (430d ago)  |  Updated: 2025-02-03 11:18
Conversation (1 message)
GGUS ID: 167569
Last modifier: Julia Andreeva
Date: 2024-07-12 09:20:44
Subject: Missing CPU accounting data in the EGI portal for May. Monthly accounting validation has not been performed either. (NCP-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: Other
Description:
Hello,
You get this ticket because accounting data for your site for May do not show up in the EGI accounting portal. In case you have troubles with the accounting reporting (which should be followed up with APEL support team), you should use monthly accounting validation in order to provide accounting metrics from your local accounting for the WLCG monthly accounting report. It has not been done for your site for May. In case your were not notified for monthly accounting validation please contact me (julia.andreeva@cern.ch), so that we update mailing list for notification. Please, make sure that for June your numbers are provided and that your reporting to APEL is fixed.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%96%100%100%95%100%97%100%100%100%100%95%95%100%93%
HammerCloud98%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (8)

CMS tickets (1)
CMS #1001498 (id:1001498) CVMFS errors at T2_PL_CYFRONET
State: assigned  |  Priority: urgent  |  Opened: 2026-01-08 13:34 (86d ago)  |  Updated: 2026-01-08 13:34
Conversation (1 message)
Dear admins,
I see frequent CVMFS errors in the CMS SAM tests, which are often enough to make the SAM status critical:
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_PL_Cyfronet
An example of a failed test is
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.WN-02cvmfs-/cms-ce-token&var-dst_hostname=arc02.grid.cyfronet.pl&var-timestamp=1767868014000
which shows a timeout when trying to access /cvmfs/cms-griddata.cern.ch.
Solving this issue might improve the overall situation.
Could you please check what's going on?
Thanks,
Andrea
WLCG tickets (7)
WLCG #681657 (id:1758) Request to deploy IPv6 on CEs and WNs at WLCG sites (CYFRONET-LCG2)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:03 (430d ago)  |  Updated: 2026-03-31 08:30
Conversation (7 messages)
GGUS ID: 164336
Last modifier: Andrea Sciaba
Date: 2023-11-28 15:36:15
Subject: Request to deploy IPv6 on CEs and WNs at WLCG sites (CYFRONET-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_PL
Issue type: Other
Description:
Dear Tier-1/Tier-2 Site Support,

Please deploy dual-stack connectivity (IPv4+IPv6) on your computing services (computing elements and worker nodes) as soon as possible and by 30 June 2024 at the latest.

This is in response to a new deployment plan for IPv6, mandated by the WLCG Management Board and the LHC experiments.

For more details on the goal, the motivations and technical aspects, see https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Comp.
Please note that switching off IPv4 is NOT requested nor recommended at this stage: any step in this direction should first be discussed with the LHC experiments you support and WLCG.

Another purpose of this ticket is to track the status of this IPv6 deployment process at your site.

As a first step we ask you to answer this ticket as soon as possible with this information:
your estimate of the timescale for the deployment;
a few details about the steps required to fulfill the request;

and to add comments to this ticket whenever progress has been made.

In the unfortunate case it becomes evident that the deadline cannot be met, we would appreciate it if you could explain what are the obstacles and still give an estimate for the time of completion.

This ticket will only be closed on successful testing conducted by the LHC VO(s) supported by your site and using a dedicated IPv6-only ETF instance running the experiment’s functional tests.

For questions and requests for help you can contact the 'WLCG IPv6' support unit in GGUS.
GGUS ID: 164336
Last modifier: Andrea Sciaba
Date: 2023-11-28 16:59:46

Public Diary:
Concerned VO has been changed from cms to none.

Internal Diary:
Added Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434
GGUS ID: 164336
Last modifier: Andrea Sciaba
Date: 2024-09-18 09:43:18

Public Diary:
Hi,
once again, could you please provide some information about the required IPv6 deployent on CEs and WNs?
Thanks, Andrea
Internal Diary:
Added Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434
GGUS ID: 164336
Last modifier: Andrea Sciaba
Date: 2024-04-08 15:03:13

Public Diary:
Hi,
would it be possible to have some information?
Thanks,
Andrea
Internal Diary:
Added Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434
GGUS ID: 164336
Last modifier: Andrea Sciaba
Date: 2025-01-24 16:05:02

Public Diary:
Hi,
would it be possible to have some information?
Thanks, Andrea
Internal Diary:
Added Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434
Hi,
are there any news?
Cheers,
Andrea
Dear Cyfronet admins,
do you have any news on the IPv6 deployment?
This is becoming even more important, as NCBJ has now IPv6-only storage, which would require IPv6 on your WNs if CMS jobs need to access data on NCBJ.
Cheers,
Andrea
WLCG #1001209 (id:1001209) LCG.CYFRONET.pl - New LSC VOMS server for Belle II (voms.cc.kek.jp)
State: assigned  |  Priority: very urgent  |  Opened: 2025-11-21 10:18 (134d ago)  |  Updated: 2026-03-17 14:18
Conversation (12 messages)
Hello,
following the previous announcement about the belle-auth.cc.kek.jp
VOMS server, an additional update is now required on Grid services
at the sites for Belle II.

The LSC file for voms.cc.kek.jp has been updated with the new host
certificate subject and the new issuer KEK CA 2024

Please replace it in all relevant services (CE, SEs, UI, etc.)
with the new version available at:

/cvmfs/grid.cern.ch/etc/grid-security/vomsdir/belle/voms.cc.kek.jp.lsc

[~]$ cat voms.cc.kek.jp.lsc
/C=JP/O=KEK/OU=CRC/CN=voms.cc.kek.jp
/C=JP/O=KEK/OU=CRC/CN=KEK GRID Certificate Authority2024

Kindly confirm once the update is completed.
Thank you very much for your cooperation.
Best regards,
Silvio
Any new on this?
Thank you so much
Silivo
Remind.
Thnx
Silvio
Any News?
Thnx
Silvio
Hello,
the voms file has been updated on all SE nodes.
Cheers
Oskar
Dear Oskar,
I see that I cannot access to the storage with a proxy with VOMS attribute issued by voms.cc.kek.jp may you please doublechek that the lsc file is properly propagate?
Thnx
Silvio
[spardi@ui-tier1 TPC2021]$ cat /cvmfs/grid.cern.ch/etc/grid-security/vomsdir/belle/voms.cc.kek.jp.lsc
/C=JP/O=KEK/OU=CRC/CN=voms.cc.kek.jp
/C=JP/O=KEK/OU=CRC/CN=KEK GRID Certificate Authority2024

[spardi@ui-tier1 TPC2021]$ gfal-ls davs://eos01.grid.cyfronet.pl:8443/eos/belle/TMP/belle/test/TPC/
gfal-ls error: 1 (Operation not permitted) - HTTP 403 : Permission refused
[spardi@ui-tier1 TPC2021]$ voms-proxy-info --all
subject : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Silvio Pardi spardi@infn.it/CN=1153329420
issuer : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Silvio Pardi spardi@infn.it
identity : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Silvio Pardi spardi@infn.it
type : RFC3820 compliant impersonation proxy
strength : 2048
path : /home/VIRGO/spardi/proxy.dirac1.belle_pilot_full
timeleft : 144:41:47
key usage : Digital Signature, Key Encipherment
=== VO belle extension information ===
VO : belle
subject : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Silvio Pardi spardi@infn.it
issuer : /C=JP/O=KEK/OU=CRC/CN=voms.cc.kek.jp
attribute : /belle/Role=production/Capability=NULL
attribute : /belle/Role=NULL/Capability=NULL
attribute : /belle/team/Role=NULL/Capability=NULL
attribute : nickname = spardi (belle)
timeleft : 144:41:46
uri : voms.cc.kek.jp:15020
Also the ARCCE is not accessibile, may you please doublecheck the situation on the CE as well?
Thnx
(base) -bash-5.1$ arcsub -c arc02.grid.cyfronet.pl helloWorld.xrls ERROR: One or multiple job descriptions was not submitted.
Any news?
Thnx
Hello,
I ve made a typo in a voms file on our SE, sorry about that. It should be fixed now.
We have some problems with CE, our team is trying to solve it.
Cheers
Oskar
Any news on this?
Also the ARCCE is not accessibile, may you please doublecheck the situation on the CE as well?
Thnx
(base) -bash-5.1$ arcsub -c arc02.grid.cyfronet.pl helloWorld.xrls ERROR: One or multiple job descriptions was not submitted.
Please follow this ticket
ARCCE is accessible:
(base) [psgebeli@linuxfarmb ~]$ arcsub -c arc02.grid.cyfronet.pl helloWorld.xrls Job submitted with jobid: gsiftp://arc02.grid.cyfronet.pl:2811/jobs/YvqLDmpPkM9n3o564oLSK7HpABFKDmABFKDmNFIKDmABFKDmXlvpIo
(base) [psgebeli@linuxfarmb ~]$
WLCG #681663 (id:1764) Request to implement BGP tagging of LHCONE prefixes. (CYFRONET-LCG2)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:04 (430d ago)  |  Updated: 2025-02-03 14:15
Conversation (3 messages)
GGUS ID: 168339
Last modifier: Julia Andreeva
Date: 2024-09-23 15:42:46
Subject: Request to implement BGP tagging of LHCONE prefixes. (CYFRONET-LCG2)
Ticket Type: USER
CC: ;edoardo.martelli@cern.ch
Status: assigned
Responsible Unit: NGI_PL
Issue type: Network problem
Description:
This ticket concerns all the sites connected to LHCONE.

In agreement with the WLCG Management Board, it has been decided to
implement the tagging of the IP prefixes announced to LHCONE.
The task consists of tagging the IP prefixes that your site announces to
LHCONE with all the BGP communities that identifies the experiments and
collaborations supported by your site. The initial goal is to document
the use of the network. In the longer term the tags may be used to
reduce the exposure on the LHCONE connection, by filtering unnecessary
prefixes.

You will find the values of the BGP communities to use and other
information in these pages:
- https://twiki.cern.ch/twiki/bin/view/LHCONE/MultiOneBGPcommunities
-
https://indico.cern.ch/event/1356138/contributions/6123461/attachments/2925447/5147273/WLCG-20240911-GDB-MultiONE-implementation.pdf

If you need any support on this task, please don't hesitate to ask your
NREN or LHCONE provider.
Or just reply to this ticket asking your questions; experts will guide
you in the implementation.

Please take this opportunity also to review the network information
related to your site in CRIC :
https://wlcg-cric.cern.ch/core/networkroute/list/

We ask you to perform the required action by the end of March 2025
GGUS ID: 168339
Last modifier: Julia Andreeva
Date: 2025-01-28 14:32:14

Public Diary:
Any progress on this ticket?
Internal Diary:
Escalated this ticket to NGI_PL
GGUS ID: 168339
Last modifier: Julia Andreeva
Date: 2025-01-28 14:32:31

Public Diary:
Any progress on this ticket?
Internal Diary:
Escalated this ticket to NGI_PL
WLCG #681678 (id:1779) Enable site network monitoring (CYFRONET-LCG2)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:05 (430d ago)  |  Updated: 2025-02-03 10:44
Conversation (2 messages)
GGUS ID: 162953
Last modifier: Julia Andreeva
Date: 2023-08-02 09:02:41
Subject: Enable site network monitoring (CYFRONET-LCG2)
Ticket Type: USER
CC: ;smckee@umich.edu
Status: assigned
Responsible Unit: NGI_PL
Issue type: Monitoring
Description:
WLCG Sites / Site Administrators / Networking Support, As presented at the WLCG Ops Coordinations meeting on April 6, 2023, our WLCG Monitoring Task Force is initiating a campaign to enable site network monitoring and gathering associated network information in preparation for Data Challenge 2024 (DC24).
Our primary targets are the Tier-1’s and larger Tier-2s but we would like to see as many sites participating as possible.
You can find detailed instructions in Gitlab at CERN: https://gitlab.cern.ch/wlcg-doma/site-network-information
In case you do not have access to the project Gitlab at CERN, 3 PDF files are attached to the twiki page:
https://twiki.cern.ch/twiki/bin/view/LCG/SiteNetworkMonitoring
They capture information from the project Gitlab at CERN. These three PDFs provide the overview of the project, an example site network information template to be filled out and information detailing how to provide your site’s network metrics.
We would like site’s to complete this by the end of September 2023 to give us time to verify the data and provide any fixes well in advance of DC24.
To interact with Gitlab does require a CERN account. If you have issues adding your site files to Gitlab or WLCG-CRIC, please contact us. Best regards, The WLCG
Monitoring Task Force
GGUS ID: 162953
Last modifier: Julia Andreeva
Date: 2023-10-05 09:54:54

Public Diary:
Any progress on this ticket?
Internal Diary:
Sent 1st reminder to ticket submitter (natalia.diana.szczepanek@cern.ch) requesting input.
WLCG #681670 (id:1771) Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (CYFRONET-LCG2)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:05 (430d ago)  |  Updated: 2025-02-03 10:41
Conversation (5 messages)
GGUS ID: 166646
Last modifier: Julia Andreeva
Date: 2024-04-29 13:50:04
Subject: Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (CYFRONET-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_PL
Issue type: Other
Description:
CPU accounting data is missing for your site in the EGI accounting portal for March, validation has not been performed either. Please, make sure that your site is properly reporting to APEL. While solving the problem, please, provide proper accounting metrics during monthly validation.
GGUS ID: 166646
Last modifier: Patryk Lason
Date: 2024-04-29 19:49:02

Public Diary:
Hi,
how can we check and republish accounting data? Is there any instruction we should follow? Everything used to work smoothly.
Thank you
Patryk
Dnia 29 kwietnia 2024 15:52:22 CEST, helpdesk@ggus.org napisal/a:
>Dear support staff,
> Ticket #166646 for site "CYFRONET-LCG2" is ASSIGNED to you.
>
> REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166646
> SUBJECT: Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (CYFRONET-LCG2)
>
> TICKET INFORMATION:
> DESCRIPTION:
> CPU accounting data is missing for your site in the EGI accounting portal for March, validation has not been performed either. Please, make sure that your site is properly reporting to APEL. While solving the problem, please, provide proper accounting metrics during monthly validation.
> NOTIFIED SITE: CYFRONET-LCG2
> CONCERNED VO: none
> PRIORITY: urgent
> ISSUE TYPE: Other
> SUBMITTER: Julia Andreeva
> *********************************************************************
> This is an automated mail. When replying don't change the subject line!
> S T R I P P R E V I O U S M A I L S please!!
> *********************************************************************
GGUS ID: 166646
Last modifier: GGUS SYSTEM
Date: 2024-12-20 08:19:03

Public Diary:
Hi,
how can we check and republish accounting data? Is there any instruction we should follow? Everything used to work smoothly.
Thank you
Patryk
Dnia 29 kwietnia 2024 15:52:22 CEST, helpdesk@ggus.org napisal/a:
>Dear support staff,
> Ticket #166646 for site "CYFRONET-LCG2" is ASSIGNED to you.
>
> REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166646
> SUBJECT: Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (CYFRONET-LCG2)
>
> TICKET INFORMATION:
> DESCRIPTION:
> CPU accounting data is missing for your site in the EGI accounting portal for March, validation has not been performed either. Please, make sure that your site is properly reporting to APEL. While solving the problem, please, provide proper accounting metrics during monthly validation.
> NOTIFIED SITE: CYFRONET-LCG2
> CONCERNED VO: none
> PRIORITY: urgent
> ISSUE TYPE: Other
> SUBMITTER: Julia Andreeva
> *********************************************************************
> This is an automated mail. When replying don't change the subject line!
> S T R I P P R E V I O U S M A I L S please!!
> *********************************************************************
Internal Diary:
Sent 1st reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 166646
Last modifier: GGUS SYSTEM
Date: 2024-12-27 08:19:12

Public Diary:
Hi,
how can we check and republish accounting data? Is there any instruction we should follow? Everything used to work smoothly.
Thank you
Patryk
Dnia 29 kwietnia 2024 15:52:22 CEST, helpdesk@ggus.org napisal/a:
>Dear support staff,
> Ticket #166646 for site "CYFRONET-LCG2" is ASSIGNED to you.
>
> REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166646
> SUBJECT: Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (CYFRONET-LCG2)
>
> TICKET INFORMATION:
> DESCRIPTION:
> CPU accounting data is missing for your site in the EGI accounting portal for March, validation has not been performed either. Please, make sure that your site is properly reporting to APEL. While solving the problem, please, provide proper accounting metrics during monthly validation.
> NOTIFIED SITE: CYFRONET-LCG2
> CONCERNED VO: none
> PRIORITY: urgent
> ISSUE TYPE: Other
> SUBMITTER: Julia Andreeva
> *********************************************************************
> This is an automated mail. When replying don't change the subject line!
> S T R I P P R E V I O U S M A I L S please!!
> *********************************************************************
Internal Diary:
Sent 2nd reminder to ticket submitter (julia.andreeva@cern.ch) requesting input.
GGUS ID: 166646
Last modifier: GGUS SYSTEM
Date: 2025-01-03 08:19:23

Public Diary:
Hi,
how can we check and republish accounting data? Is there any instruction we should follow? Everything used to work smoothly.
Thank you
Patryk
Dnia 29 kwietnia 2024 15:52:22 CEST, helpdesk@ggus.org napisal/a:
>Dear support staff,
> Ticket #166646 for site "CYFRONET-LCG2" is ASSIGNED to you.
>
> REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166646
> SUBJECT: Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (CYFRONET-LCG2)
>
> TICKET INFORMATION:
> DESCRIPTION:
> CPU accounting data is missing for your site in the EGI accounting portal for March, validation has not been performed either. Please, make sure that your site is properly reporting to APEL. While solving the problem, please, provide proper accounting metrics during monthly validation.
> NOTIFIED SITE: CYFRONET-LCG2
> CONCERNED VO: none
> PRIORITY: urgent
> ISSUE TYPE: Other
> SUBMITTER: Julia Andreeva
> *********************************************************************
> This is an automated mail. When replying don't change the subject line!
> S T R I P P R E V I O U S M A I L S please!!
> *********************************************************************
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
WLCG #681666 (id:1767) CE arc02.grid.cyfronet.pl is not working for Biomed users
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:04 (430d ago)  |  Updated: 2025-02-03 10:40
Conversation (5 messages)
GGUS ID: 166178
Last modifier: Sorina Pop
Date: 2024-04-09 15:28:42
Subject: CE arc02.grid.cyfronet.pl is not working for Biomed users
Ticket Type: TEAM
CC:
Status: assigned
Responsible Unit: NGI_PL
Issue type: Other
Description:
Dear sites admins,

CE arc02.grid.cyfronet.pl is not working for Biomed users. The incident was detected from the Biomed ARGO box that you may want to check to see the status: https://biomed.ui.argo.grnet.gr/biomed/report-status/CORE/SITES/CYFRONET-LCG2/ARC-CE/arc02.grid.cyfronet.pl

According to the announcement in https://operations-portal.egi.eu/broadcast/archive/3021, the biomed VOMS server host certificate changed on Friday, 29th of March, requiring client updates and it's likely that errors are due to this change.Could you please have a look ?

Thank you in advance for your support,
Sorina for the Biomed VO.
GGUS ID: 166178
Last modifier: Pansanel Jerome
Date: 2024-05-17 15:06:40

Public Diary:
Hi,
Is there any news?
Best,
Jerome
Internal Diary:
Escalated this ticket to NGI_PL
GGUS ID: 166178
Last modifier: Akos Szlavecz
Date: 2024-05-01 10:30:00

Public Diary:
Dear site admins,

Could you check this issue, please?

Regards
Ákos
Internal Diary:
Escalated this ticket to NGI_PL
GGUS ID: 166178
Last modifier: Akos Szlavecz
Date: 2024-06-10 12:23:18

Public Diary:
Dear site admins,

The site is still not working for the biomed VO.
Could you check this issue, please?

Regards
Ákos
Internal Diary:
Escalated this ticket to NGI_PL
GGUS ID: 166178
Last modifier: Sorina Pop
Date: 2024-08-21 12:20:43

Public Diary:
Dear site admins,

Any news on this issue?

Best regards,
Sorina
Internal Diary:
Escalated this ticket to NGI_PL
WLCG #681656 (id:1757) Migration of data from SE se02.grid.cyfronet.pl (CTA)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:03 (430d ago)  |  Updated: 2025-02-03 10:36
Conversation (5 messages)
GGUS ID: 168438
Last modifier: Krzysztof Oziomek
Date: 2024-09-26 09:38:47

Public Diary:
Hi Ernst,
I see that the related pull request is been working on: https://github.com/italiangrid/voms-admin-server/pull/78
I've just asked the product team when they expect to complete it.

cheers,
Alessandro
Internal Diary:
Added Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434
GGUS ID: 168438
Last modifier: Krzysztof Oziomek
Date: 2024-09-26 09:37:33
Subject: Migration of data from SE se02.grid.cyfronet.pl (CTA)
Ticket Type: USER
CC:
Responsible Unit: TPM
Issue type: Other
Description:
As per schedule in parent ticket:

https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434

The VO has time until 2024-11-12 (extension can be requested) to retrieve data from decommissioned service.

Please communicate using this ticket when the data retrieval is complete.
Affected Site: CYFRONET-LCG2
GGUS ID: 168438
Last modifier: Krzysztof Oziomek
Date: 2024-09-26 09:52:39

Public Diary:
Hi Ernst,
I see that the related pull request is been working on: https://github.com/italiangrid/voms-admin-server/pull/78
I've just asked the product team when they expect to complete it.

cheers,
Alessandro
Internal Diary:
Reset Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434
GGUS ID: 168438
Last modifier: Lukas Pacher
Date: 2024-09-26 21:18:22

Status: assigned
Responsible Unit: NGI_PL
Public Diary:
Hi Ernst,
I see that the related pull request is been working on: https://github.com/italiangrid/voms-admin-server/pull/78
I've just asked the product team when they expect to complete it.

cheers,
Alessandro
Internal Diary:
Reset Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434
GGUS ID: 168438
Last modifier: Krzysztof Oziomek
Date: 2024-10-04 08:15:53

Public Diary:
Hi Ernst,
I see that the related pull request is been working on: https://github.com/italiangrid/voms-admin-server/pull/78
I've just asked the product team when they expect to complete it.

cheers,
Alessandro
Internal Diary:
Added Parent-Child relation with parent ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168434
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%97%100%97%97%100%95%95%68%0%57%95%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (3)

CMS tickets (1)
CMS #1002228 (id:1002228) XRootD SAM test failure at T2_PT_NCG_Lisbon
State: in progress  |  Priority: less urgent  |  Opened: 2026-04-01 08:37 (3d ago)  |  Updated: 2026-04-02 10:03
Conversation (2 messages)
Good morning, Lisbon admins.
Since 13:00UTC yesterday (31 Mar). Your XRootD endpoint has been failed SAM "1connection" test [1]. The log file shows "Connection refused" errors [2] and manually test gave the same result [3]. Could you please take a look at this server and check status/connection?
Cheers,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_PT_NCG_Lisbon
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-1connection&var-dst_hostname=xroot01.ncg.ingrid.pt&var-timestamp=1775030007355
[3]
[crungphi@lxplus800 ~]$ nc -zv -4 -w 30 193.136.75.164 1094
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connection refused.
[crungphi@lxplus800 ~]$ nc -zv -6 -w 30 xroot01.ncg.ingrid.pt 1094
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connection refused.
[crungphi@lxplus800 ~]$ ping -c 5 xroot01.ncg.ingrid.pt

PING xroot01.ncg.ingrid.pt(xroot01.ncg.ingrid.pt (2001:690:2150:aa:35::3)) 56 data bytes
--- xroot01.ncg.ingrid.pt ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4133ms

[crungphi@lxplus800 ~]$ ping -c 5 -4 xroot01.ncg.ingrid.pt

PING xroot01.ncg.ingrid.pt (193.136.75.164) 56(84) bytes of data.
--- xroot01.ncg.ingrid.pt ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4127ms

[crungphi@lxplus800 ~]$ tracepath xroot01.ncg.ingrid.pt

1?: [LOCALHOST] 0.009ms pmtu 1500
1: k513-v-rjuxl-v12-ip203.ipv6.cern.ch 14.714ms
1: k513-v-rjuxl-v11-ip203.ipv6.cern.ch 13.889ms
2: k513-b-rjupl-v1-ca41.ipv6.cern.ch 0.788ms
3: b773-b-rjuxl-2-cd10.cern.ch 1.477ms
4: g773-e-rjuxm-20-sy7.cern.ch 2.133ms
5: g773-e-fpa78-2-fn2.cern.ch 8.312ms
6: e773-e-rjuxm-v20-fo2.cern.ch 8.996ms
7: e773-e-rjup1-2-sx7.cern.ch 8.852ms
8: no reply
9: lag-7-0.rt0.par.fr.geant.net 17.012ms asymm 10
10: lag-4-0.rt0.bil.es.geant.net 27.476ms
11: lag-2-0.rt0.por.pt.geant.net 38.757ms
12: no reply
...
30: no reply
Too many hops: pmtu 1500
Resume: pmtu 1500
Hi,
The issue should be solved one of our scripts just removed some rpm's critical to xroot data servers.

Cheers
WLCG tickets (2)
WLCG #681568 (id:1669) SE srm01.ncg.ingrid.pt is not working for Biomed users
State: waiting for submitter's reply  |  Priority: less urgent  |  Opened: 2025-01-29 09:55 (430d ago)  |  Updated: 2025-08-07 13:46
Conversation (5 messages)
GGUS ID: 167118
Last modifier: Akos Szlavecz
Date: 2024-06-12 14:43:27
Subject: SE srm01.ncg.ingrid.pt is not working for Biomed users
Ticket Type: TEAM
CC: biomed-issues-followup@googlegroups.com;
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: File Transfer
Description:
Dear site admins,

SE srm01.ncg.ingrid.pt is not working for Biomed users.
The incident was detected from the Biomed ARGO box that you may want to check to see the status:
https://biomed.ui.argo.grnet.gr/biomed/report-status/CORE/SITES/NCG-INGRID-PT/SRM/srm01.ncg.ingrid.pt

Could you check this issue, please?

Thank you in advance for your support,
Ákos for the Biomed VO
GGUS ID: 167118
Last modifier: Joao Pina
Date: 2024-09-17 14:43:43

Status: in progress
Responsible Unit: NGI_IBERGRID
Public Diary:
Dear site admins,

Any news on this issue?

Best regards,
Sorina
Internal Diary:
Added attachment EGI VO CLARIN SLA report 2024-05 - 2024-10.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119686
GGUS ID: 167118
Last modifier: Sorina Pop
Date: 2024-08-21 12:20:05
Changed CC to biomed-issues-followup@googlegroups.com;

Public Diary:
Dear site admins,

Any news on this issue?

Best regards,
Sorina
Internal Diary:
Added attachment EGI VO CLARIN SLA report 2024-05 - 2024-10.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119686
GGUS ID: 167118
Last modifier: Joao Pina
Date: 2024-09-17 14:45:00

Public Diary:
Hi,

SRM01 was off for some time since not used by WLCG. Can you confirm if everything working?

Cheers
Joao Pina
Internal Diary:
Added attachment EGI VO CLARIN SLA report 2024-05 - 2024-10.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119686
srm01 was decomissioned.
Can Biomed use a webdav storm for files management?

Cheers
Joao Pina
WLCG #681569 (id:1670) Upgrade your VOMS endpoint(s) to EL9 (NCG-INGRID-PT)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:55 (430d ago)  |  Updated: 2025-01-30 14:37
Conversation (2 messages)
GGUS ID: 167798
Last modifier: Alessandro Paolini
Date: 2024-08-05 11:30:29
Subject: Upgrade your VOMS endpoint(s) to EL9 (NCG-INGRID-PT)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IBERGRID
Issue type: Middleware
Description:
Dear Site administrators,

with this ticket we are going to track the migration of your VOMS endpoint(s) to EL9.

The relesse notes of the new version can be found in https://italiangrid.github.io/voms/releases.html

VOMS documentations: https://italiangrid.github.io/voms/documentation.html

Clean installation guide: https://italiangrid.github.io/voms/documentation/sysadmin-guide/3.0.14/clean-installation.html

Packages available on the product team repository: https://italiangrid.github.io/voms-repo/

EL9 stable repository: https://repo.cloud.cnaf.infn.it/service/rest/repository/browse/voms-rpm-stable/redhat9/

repo file: https://italiangrid.github.io/voms-repo/repofiles/rhel/voms-stable-el9.repo

You should do a dump of each VO database on the current server and then restore them once the new server is up and running.

Please try to complete the migration before the end of the month.

Best regards,
Alessandro
GGUS ID: 167798
Last modifier: Joao Pina
Date: 2024-08-29 14:04:00

Status: in progress
Responsible Unit: NGI_IBERGRID
Public Diary:
Hi,

We have added voms to list of servers to be upgraded.

Cheers
Joao Pina
Internal Diary:
Added attachment EGI VO CLARIN SLA report 2024-05 - 2024-10.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119686
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%86%89%100%100%100%97%97%99%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

WLCG tickets (2)
WLCG #681789 (id:1890) Upgrade your VOMS endpoint(s) to EL9 (JINR-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:21 (430d ago)  |  Updated: 2025-02-03 11:37
Conversation (2 messages)
GGUS ID: 167795
Last modifier: Valery Mitsyn
Date: 2024-08-05 15:39:47

Status: in progress
Responsible Unit: ROC_Russia
Public Diary:
Hi,
did you have the chance to verify the information?
Internal Diary:
Added attachment 2024-10-27-1532__cceos.ihep.ac.cn__se.hpc.utfsm.cl__7020978411__b01d9ea4-9478-11ef-8b11-fa163e5a92fb.txt
https://ggus.eu/index.php?mode=download&attid=ATT119668
GGUS ID: 167795
Last modifier: Alessandro Paolini
Date: 2024-08-05 11:30:22
Subject: Upgrade your VOMS endpoint(s) to EL9 (JINR-LCG2)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: ROC_Russia
Issue type: Middleware
Description:
Dear Site administrators,

with this ticket we are going to track the migration of your VOMS endpoint(s) to EL9.

The relesse notes of the new version can be found in https://italiangrid.github.io/voms/releases.html

VOMS documentations: https://italiangrid.github.io/voms/documentation.html

Clean installation guide: https://italiangrid.github.io/voms/documentation/sysadmin-guide/3.0.14/clean-installation.html

Packages available on the product team repository: https://italiangrid.github.io/voms-repo/

EL9 stable repository: https://repo.cloud.cnaf.infn.it/service/rest/repository/browse/voms-rpm-stable/redhat9/

repo file: https://italiangrid.github.io/voms-repo/repofiles/rhel/voms-stable-el9.repo

You should do a dump of each VO database on the current server and then restore them once the new server is up and running.

Please try to complete the migration before the end of the month.

Best regards,
Alessandro
WLCG #681785 (id:1886) Enable site network monitoring (JINR-LCG2)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:21 (430d ago)  |  Updated: 2025-02-03 11:36
Conversation (2 messages)
GGUS ID: 162967
Last modifier: Julia Andreeva
Date: 2023-08-02 09:03:17
Subject: Enable site network monitoring (JINR-LCG2)
Ticket Type: USER
CC: ;smckee@umich.edu
Status: assigned
Responsible Unit: ROC_Russia
Issue type: Monitoring
Description:
WLCG Sites / Site Administrators / Networking Support, As presented at the WLCG Ops Coordinations meeting on April 6, 2023, our WLCG Monitoring Task Force is initiating a campaign to enable site network monitoring and gathering associated network information in preparation for Data Challenge 2024 (DC24).
Our primary targets are the Tier-1’s and larger Tier-2s but we would like to see as many sites participating as possible.
You can find detailed instructions in Gitlab at CERN: https://gitlab.cern.ch/wlcg-doma/site-network-information
In case you do not have access to the project Gitlab at CERN, 3 PDF files are attached to the twiki page:
https://twiki.cern.ch/twiki/bin/view/LCG/SiteNetworkMonitoring
They capture information from the project Gitlab at CERN. These three PDFs provide the overview of the project, an example site network information template to be filled out and information detailing how to provide your site’s network metrics.
We would like site’s to complete this by the end of September 2023 to give us time to verify the data and provide any fixes well in advance of DC24.
To interact with Gitlab does require a CERN account. If you have issues adding your site files to Gitlab or WLCG-CRIC, please contact us. Best regards, The WLCG
Monitoring Task Force
GGUS ID: 162967
Last modifier: Valery Mitsyn
Date: 2023-08-02 14:05:51

Status: in progress
Responsible Unit: ROC_Russia
Public Diary:
Hi Julia,
is this ticket still relevant?
Thanks.
Guenter
Internal Diary:
Added attachment 2024-10-27-1532__cceos.ihep.ac.cn__se.hpc.utfsm.cl__7020978411__b01d9ea4-9478-11ef-8b11-fa163e5a92fb.txt
https://ggus.eu/index.php?mode=download&attid=ATT119668
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%97%100%100%97%100%97%100%100%100%100%95%91%97%97%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (4)

CMS tickets (3)
CMS #1000836 (id:1000836) Transfer errors from METU to CERN
State: assigned  |  Priority: less urgent  |  Opened: 2025-10-13 12:11 (173d ago)  |  Updated: 2026-03-26 12:28
Conversation (8 messages)
Hi, I'm trying to understand why all loadtest transfers from METU to CERN fail:
https://monit-grafana.cern.ch/d/CIjJHKdGk/fts-transfers?orgId=20&from=now-7d&to=now&var-group_by=dst_hostname&var-vo=cms&var-src_country=All&var-dst_country=All&var-src_site=All&var-dst_site=All&var-fts_server=All&var-bin=1h&var-include=&var-filters=data.src_hostname%7C%3D%7Ceymir.grid.metu.edu.tr&var-filters=data.dst_hostname%7C%3D%7Ceoscms.cern.ch&var-protocol=All&var-auth=All&var-staging=All

For example:

https://fts3-cms.cern.ch:8449/fts3/ftsmon/#/job/829186c2-a821-11f0-be45-fa163e65d4d2

From the transfer log, the error is
Non recoverable error: [1] DavPosix::unlink HTTP 403 : Permission refusedand there is also aDestination file exists!
If I do from LXPLUS
gfal-ls -l davs://eoscms.cern.ch:443/eos/cms/store/test/loadtest/source/T2_TR_METU/urandom.270MB.file0001
I get
-rwxrwxrwx 0 0 0 270000000 Apr 3 2025 davs://eoscms.cern.ch:443/eos/cms/store/test/loadtest/source/T2_TR_METU/urandom.270MB.file0001
which dates back to April!
Is that normal?
Thanks,
Andrea
Dear Andrea,
Apologies for the delay. I just realised this was opened against the CMS Data management team.

I had a look and it seems that the problem is that the job cannot delete the file in order to overwrite it.
This might be a token related issue.

Mihai from the FTS team had a look at the token itself and it includes the storage.modify:/store/test/loadtest/source/ scope

The file FTS tries to delete is under `/eos/cms/store/test/loadtest/source/`
I think the CERN EOSCMS admins should have a look into this.
I'm reassigning the ticket to them and I'm happy to provide any other details that might be needed.

I'm not sure my reassignment will work as expected so I also manually include Guilherme, just in case :)

Cheers,
Panos
No worries, let me know where it's best to submit this type of ticket in the future.
Cheers,
Andrea
Dear Panos, Andrea,
It's fine to keep the ticket here. I tried to look at the logs of the FTS job, but it's gone. Would you have a more recent example?
There were some recent changes in EOS to support more things with tokens, but this particular case is probably not related.
I will have a look at the logs on EOS to see if there is a bug, since the token should have been able to delete/overwrite the file.

Cheers,
-Guilherme
Dear Guilherme,
Thanks a lot for having a look! Here is a more recent log https://fts-cms-009.cern.ch:8449/var/log/fts3/transfers/2025-10-27/eymir.grid.metu.edu.tr__eoscms.cern.ch/2025-10-27-0811__eymir.grid.metu.edu.tr__eoscms.cern.ch__5219922730__675628e0-b30c-11f0-837f-fa163e9e00e6

I also attach a .txt with the log in case it disappears again.
I hope this helps with debugging.

Cheers,
Panos
Hi Panos,
I see
```
INFO Mon, 27 Oct 2025 09:11:28 +0100; Destination file exists!
```

and also
```
INFO Mon, 27 Oct 2025 09:11:28 +0100; Job metadata: {\"issuer\":?\"rucio\",?\"multi_sources\":?false,?\"overwrite_when_only_on_disk\":?false,?\"auth_method\":?\"oauth2\"}
```
that is, overwrite_when_only_on_disk is false, so I guess this is the expected behavior, no?

Best regards,
-Guilherme
Thanks a lot for having a look Guilherme! overwrite_when_only_on_disk is something that FTS should take into consideration only for transfers to tape and I think this one points to a disk: the destination url is `davs://eoscms.cern.ch:443/eos/cms/store/test/loadtest/source/T2_TR_METU/urandom.270MB.file0001`

In the log I see this line "Overwrite enabled: 1" which I assume means that FTS should try to overwrite the file. I think part of this overwrite is trying to delete the file, which results in the 403 error.
I believe that if overwrite_when_only_on_disk was taken into consideration FTS wouldn't even try to delete the file.

Best,
Panos
Hi, I'd like to revive this discussion to be able to close the ticket.
I checked the last 24h and see only two kinds of failures, lack of quota and permission denied errors. Here are log lines from today showing the problem:

/var/log/eos/mgm/xrdlog.mgm-2026-03-26-1774501201.gz:260326 05:55:48 2001294 scitokens_Access: Grant authorization based on scopes for operation=del, path=/eos/cms/store/test/loadtest/source/T2_TR_METU/urandom.270MB.file0001
/var/log/eos/mgm/xrdlog.mgm-2026-03-26-1774501201.gz:260326 05:55:48 time=1774500948.732348 func=Delete level=INFO logid=static.............................. unit=mgm@eoscms-ns-ip700.cern.ch:1094 tid=00007fc603255640 source=HttpHandler:709 tident= sec=(null) uid=0 gid=0 name=- geo="" xt="" ob="" method=DELETE path=/eos/cms/store/test/loadtest/source/T2_TR_METU/urandom.270MB.file0001
/var/log/eos/mgm/xrdlog.mgm-2026-03-26-1774501201.gz:260326 05:55:48 time=1774500948.732606 func=_rem level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eoscms-ns-ip700.cern.ch:1094 tid=00007fc603255640 source=Rm:110 tident=<single-exec> sec=https uid=8959 gid=1399 name=cmsprd geo="" xt="" ob="" path=/eos/cms/store/test/loadtest/source/T2_TR_METU/urandom.270MB.file0001 vid.uid=8959 vid.gid=1399
/var/log/eos/mgm/xrdlog.mgm-2026-03-26-1774501201.gz:260326 05:55:48 time=1774500948.732868 func=MakeResult level=ERROR logid=static.............................. unit=mgm@eoscms-ns-ip700.cern.ch:1094 tid=00007fc603255640 source=ProcCommand:556 tident= sec=(null) uid=0 gid=0 name=- geo="" xt="" ob="" error: unable to remove file/directory '/eos/cms/store/test/loadtest/source/T2_TR_METU/urandom.270MB.file0001' (errno=1)

The permission denied errors are due to the ACL that is set in the directory on EOS. The token allows the operation, but EOS doesn't because of the ACLs:

$ eos acl -l /eos/cms/store/test/loadtest/source/T2_TR_METU

# sys.acl
u:22014:rw,u:relval:rw,u:cmsprod:rw,g:zh:rwx!d

So, if you want these transfers to work, we need to ensure that the user attempting the transfers (cmsprd) has permission to delete files in this directory. I'd suggest to adjust it like this, maybe (without the cmst0 part):

[root@eoscms-ns-ip700 (mgm:master mq:master) ~]$ eos acl -l /eos/cms/store/mc/SAM
# sys.acl
u:22014:rwx,u:relval:rw,u:cmsprod:rwx,u:cmsprd:rwx,u:cmst0:rwx,g:zh:rwx!d

Best regards,
-Guilherme
CMS #681694 (id:1795) enable IAM token for dCache at T2_TR_METU
State: in progress  |  Priority: urgent  |  Opened: 2025-01-29 10:07 (430d ago)  |  Updated: 2025-08-18 14:45
Conversation (27 messages)
GGUS ID: 165519
Last modifier: Chan-anun Rungphitakchai
Date: 2024-03-07 19:19:35
Subject: enable IAM token for dCache at T2_TR_METU
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_TR
Issue type: Storage Systems
Description:
Hello METU admins,
IAM Token support for dCache is ready [1]. You could consider upgrading your dCache door node and configuration IAM token access. In the first place, you need to ensure your dCache version is 9.2 (recommended version). I attach the documents and wiki pages [2]. Please take a look and let us know when you have a plan.
Thank you and have a nice day,
Noy
[1]
https://twiki.cern.ch/twiki/bin/view/CMS/IAMTokens
[2]
https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheCMSsetup https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheXRootD
https://twiki.cern.ch/twiki/bin/view/CMSPublic/XRootDAndTokens
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-03-08 11:05:07

Status: in progress
Responsible Unit: NGI_TR
Public Diary:
Hi,
I was waiting for the migration of the data center to be completed for this, but since it will take time, let me deal with this within the next week. We are using dcache 9.2.8, I hope there will be no problem. I will keep you updated once I enabled IAM tokens.

Best,
Selcuk
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Stephan Lammel
Date: 2024-04-03 19:18:47

Public Diary:
Hallo Selcuk,
how is the upgrade to 9.2 going?
Thanks,
- Stephan
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-04-04 05:10:59

Public Diary:
Hi Stephan,
Due to the cooling issues in the old data center we had to migrate the servers to the new data center sooner than planned. We are working on to complete the migration.

We are already on 9.2. As soon as the migration has been completed, we will enable the token support.

Thanks,
Selcuk
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Stephan Lammel
Date: 2024-04-04 13:13:01

Public Diary:
Sounds good. Thanks for the update Selcuk! - Stephan
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Chan-anun Rungphitakchai
Date: 2024-05-30 19:46:05

Public Diary:
Do you have any update for token support -- Noy
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-06-03 09:52:41

Public Diary:
Hi,
Sorry for the late response. I was having issues to login to GGUS for a while with certificate.
Our disk servers were closed due to migration for a long time. We completed the migration on Friday and will start to work on the token support today. I will inform you if we have any problems.
Thanks,
Selcuk
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-06-04 14:31:08

Public Diary:
Hi,
Today, I tried to enable the token access, unfortunately with no success. The new configuration also broke the other tests. Is there a way to copy file over lxplus via token enabled so that I can trace logs more easily.

---------------------
## layout file
# Gplazma section:
[${host.name}_gplazmaDomain]
[${host.name}_gplazmaDomain/gplazma]
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any davs://eymir.grid.metu.edu.tr davs://eymir.grid.metu.edu.tr:443 eymir.grid.metu.edu.tr root://eymir.grid.metu.edu.tr root://eymir.grid.metu.edu.tr:11001 root://eymir.grid.metu.edu.tr:1\
094
gplazma.oidc.provider!cms=https://cms-auth.web.cern.ch/ -profile=wlcg -authz-id="group:cms gid:3000" -prefix=/dpm/grid.metu.edu.tr/home/cms -suppress=audience
-------------------------------------
## gplazma.conf
auth optional x509
auth optional voms
auth optional oidc

map optional multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.group
map optional vogroup vo-group-path=/etc/dcache/vo-group.json
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.user
map optional vogroup vo-group-path=/etc/dcache/vo-user.json
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.vo
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.unmapped
map sufficient multimap gplazma.multimap.file=/etc/dcache/multimap-prod-user.conf

account requisite banfile
session requisite roles
session sufficient omnisession
----------------------------------------------------------------------------------------------
# /etc/dcache/multimap-prod-user.conf
/etc/dcache/multimap-prod-user.conf

---------------------------------------------------------------------------------------------

The error
Read token retrieval check failed: Could not retrieve token for davs://eymir.grid.metu.edu.tr:443//dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root [last failed attempt: Macaroon request failed: HTTP 401 : Authentication Error ]

Is there any example config files from sites?
Thanks,
Selcuk
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Christoph Wissing
Date: 2024-06-04 15:58:43

Public Diary:
Hello,
a less complex configuration for dCache to run with tokens is described here:

https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheCMSsetup

However the conversion procedure from DPM to dCache creates an unnecessary complex setup, at least for CMS.

Cheers, Christoph
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Stephan Lammel
Date: 2024-06-04 15:44:14

Public Diary:
Hallo Selcuk,
it looks like you used the DPM converter script. That makes
a very complex config. A few sites run into issues configuring
token access.
Noy, can you please point Selcuk to the ticket that has all
the config files attached? (I don't recall which site it was
and can't find the ticket anymore.)
You can trigger new SAM tests as needed on the ETF page.
The ticket should also have the description for this. Otherwise
Noy can provide you off ticket with a client id/sectret to
acquire a token, "Acquiring an IAM-issued Token for Testing" on
https://twiki.cern.ch/twiki/bin/view/CMS/IAMTokens
Thanks,
cheers, Stephan
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Chan-anun Rungphitakchai
Date: 2024-06-04 16:14:26

Public Diary:
Hello Selcuk,
This is a ticket from IPM site [1] and Budapest site [2]. All configuration we need to enable token on dCache server.
Thank you,
Noy
[1]
https://ggus.eu/?mode=ticket_info&ticket_id=164275
[2]
https://ggus.eu/index.php?mode=ticket_info&ticket_id=164207
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Stephan Lammel
Date: 2024-07-23 20:23:12

Public Diary:
Hallo Selcuk,
did you have a chance to compare your configuration
with that of IPM or Budapest?
Thanks,
- Stephan
Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-07-30 09:22:56

Public Diary:
Hi,
It seems that the error I get is related to omnisession.conf, but could not figure out the reason yet. Comparing the configurations did not help in the first try. Will work in detail this week.

--omnisession SUFFICIENT:OK => OK (ends the phase)
Tem 30 12:01:50 eymir.grid.metu.edu.tr dcache@centralDomain[29282]: |
Tem 30 12:01:50 eymir.grid.metu.edu.tr dcache@centralDomain[29282]: +--VALIDATION FAIL (multiple GIDs)

Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-08-02 10:02:13

Public Diary:
Hi,
It seems that the error I get is related to omnisession.conf, but could not figure out the reason yet. Comparing the configurations did not help in the first try. Will work in detail this week.

--omnisession SUFFICIENT:OK => OK (ends the phase)
Tem 30 12:01:50 eymir.grid.metu.edu.tr dcache@centralDomain[29282]: |
Tem 30 12:01:50 eymir.grid.metu.edu.tr dcache@centralDomain[29282]: +--VALIDATION FAIL (multiple GIDs)

Internal Diary:
Added attachment eymir_.log
https://ggus.eu/index.php?mode=download&attid=ATT119459
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-08-02 10:02:14

Public Diary:
Hi,
Yesterday we managed to solve the token issue for atlas experiment, now trying to solve the cms issue but configs seem different.

etf-cms-prod.cern.ch seems down. How can I send test files to follow the logs on webdav with tokens?

Below is my config files.

---------------------
layout.conf file contains

[centralDomain/gplazma]
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any https://eymir.grid.metu.edu.tr roots://eymir.grid.metu.edu.tr:1094 eymir.grid.metu.edu.tr
gplazma.oidc.provider!cms=https://cms-auth.cern.ch/ -profile=wlcg -authz-id="group:cms gid:3000 username:cms group:writer" -prefix=/dpm/grid.metu.edu.tr/home/cms/ -suppress=audience

----------------

## gplazma.conf
auth optional x509
auth optional voms
auth optional oidc

map optional multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.group
map optional vogroup vo-group-path=/etc/dcache/vo-group.json
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.user
map optional vogroup vo-group-path=/etc/dcache/vo-user.json
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.vo
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.unmapped
map sufficient multimap gplazma.multimap.file=/etc/dcache/multimap-prod-user.conf

account requisite banfile
session requisite roles
session sufficient omnisession

## /etc/dcache/multi-mapfile.user is empty for me.

## multi-mapfile.group contains
fqan:/cms gid:3000 group:writer



#new created file contains multimap-prod-user.conf
group:cms username:cms uid:3000

## omnisession.conf

group:writer root:/ home:/
group:cms root:/ home:/

------------

For the error log please see the attached file.

Thanks,

Internal Diary:
Added attachment eymir_.log
https://ggus.eu/index.php?mode=download&attid=ATT119459
GGUS ID: 165519
Last modifier: Stephan Lammel
Date: 2024-08-02 14:28:49

Public Diary:
Hallo Selcuk,
CMS ETF production is up. I don't see any interrupt during the
last 24 hours. Do you maybe have a routing issue to the general,
i.e. non-LHCONE, subnets of CERN?
For CMS, you only need the WLCG any and hostname as audience,
i.e.
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any eymir.grid.metu.edu.tr
will do. You do need both IAM issuers, i.e. https://cms-auth.web.cern.ch/
and https://cms-auth.cern.ch/ for the time being.
Thanks,
cheers, Stephan
Internal Diary:
Added attachment eymir_.log
https://ggus.eu/index.php?mode=download&attid=ATT119459
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-08-06 14:12:45

Public Diary:
Hallo Selcuk,
CMS ETF production is up. I don't see any interrupt during the
last 24 hours. Do you maybe have a routing issue to the general,
i.e. non-LHCONE, subnets of CERN?
For CMS, you only need the WLCG any and hostname as audience,
i.e.
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any eymir.grid.metu.edu.tr
will do. You do need both IAM issuers, i.e. https://cms-auth.web.cern.ch/
and https://cms-auth.cern.ch/ for the time being.
Thanks,
cheers, Stephan
Internal Diary:
Added attachment eymir.log
https://ggus.eu/index.php?mode=download&attid=ATT119467
GGUS ID: 165519
Last modifier: Stephan Lammel
Date: 2024-08-06 14:31:04

Public Diary:
Hallo Selcuk,
Christoph talks about only multimap-prod-user.conf in the twiki.
I don't know if any of the other mapping entries you have are
problematic. CMS asks that all tokens issued by the CMS IAM being
mapped to the local account that can read/write anywhere in /store.
The tokens have no fqan key but the issuer being https://cms-auth.web.cern.ch/
or https://cms-auth.cern.ch/ identifies CMS.
I am not a dCache expert. Let me add Christoph maybe he can spot/
knows where the problem is.
Thanks,
cheers, Stephan
Internal Diary:
Added attachment eymir.log
https://ggus.eu/index.php?mode=download&attid=ATT119467
GGUS ID: 165519
Last modifier: Selcuk Bilmis
Date: 2024-08-06 14:13:06

Public Diary:
Hi,

There is still an issue related to gplazma. Below is the configs and the error-log I get. The user can not be mapped to cms? Do the jobs with token have /fqan:cms?

layout.conf
gplazma.oidc.audience-targets = https://wlcg.cern.ch/jwt/v1/any https://eymir.grid.metu.edu.tr roots://eymir.grid.metu.edu.tr:1094 eymir.grid.metu.edu.tr
gplazma.oidc.provider!cms = https://cms-auth.cern.ch/ https://cms-auth.web.cern.ch/ -profile=wlcg -authz-id="group:cms gid:3000 username:cms" -prefix=/dpm/grid.metu.edu.tr/home/cms/ -suppress=audience

gplazma.conf

auth optional x509
auth optional voms
auth optional oidc

map sufficient multimap gplazma.multimap.file=/etc/dcache/multimap-prod-user.conf
map optional multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.group
map optional vogroup vo-group-path=/etc/dcache/vo-group.json
map optional vogroup vo-group-path=/etc/dcache/vo-user.json
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.vo
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.unmapped

omnisession.conf
group:writer root:/ home:/
group:cms root:/ home:/

multimap-prod-user.conf
group:cms username:cms uid:3000


./multi-mapfile.group:
fqan:/cms gid:3000 group:writer
./multi-mapfile.vo:
username:cms uid:3000

However in the logs, I get following error for the token. The user can not be mapped to cms? Do the jobs with token have /fqan:cms?
The error log is attached.

Internal Diary:
Added attachment eymir.log
https://ggus.eu/index.php?mode=download&attid=ATT119467
Added attachment eymir.log
Added attachment eymir_.log
Any update -- Thank you, Noy
Unfortunately, not. I tried the others settings but it did not work out. Let me try one more time next week and keep you updated about it. ( This week is holiday here)
Hello, Selcuk. Your XRootD didn't support token read yet but WebDAV works perfectly. Could you please provide me an update about that.
Cheers,
Noy
[1]
[crungphi@lxplus802 ~]$ curl -s -d 'client_id=[...]' -d 'client_secret=[...]' -d 'grant_type=client_credentials' -d 'scope=storage.read:/store' -d 'audience=eymir.grid.metu.edu.tr' https://cms-auth.cern.ch/token | jq -M -r .access_token > rwtest.tkn

[crungphi@lxplus802 ~]$ BEARER_TOKEN=`cat rwtest.tkn`;export BEARER_TOKEN
[crungphi@lxplus802 ~]$ xrdcp -f root://eymir.grid.metu.edu.tr:1094//dpm/grid.metu.edu.tr/home/cms//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root /dev/null
[0B/0B][100%][==================================================][0B/s]
Run: [FATAL] Auth failed: No protocols left to try (source)
Hi,
Sorry, I was overwhelmed with the other projects, and now I am back to grid operations. I was actually looking at that now.
Are these the info that I need to check.? Is there any example site-config available?
https://twiki.cern.ch/twiki/bin/view/CMSPublic/XRootDAndTokens
https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheCMSsetup
And also could you provide a test script for me that can run on lxplus? How will I get token first so that I can try xrdcp -f
root://eymir.grid.metu.edu.tr:1094//dpm/grid.metu.edu.tr/home/cms//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root
/dev/null

Thanks.
Hi,
I managed to run this via token. It seems ok now.
xrdcp -d 2 -f root://eymir.grid.metu.edu.tr:1094//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root ./

https://monit-grafana.cern.ch/goto/Upk8vslHg?orgId=20
The followings are waiting to be resolved:
15-tkn contain
5-crt contain ( is this again related to CA that we mentioned?)
9-federation
Hi,
I have 2 questions.
1) What is the difference between eymir and eymir-redir ?
Previously I had configured xrootd from dcache and a seperate xrootd-clustered.cfg.

ss -ltnp | egrep ':1094|:11001
LISTEN 0 128 [::]:11001 [::]:* users:(("xrootd",pid=13122,fd=20))
LISTEN 0 128 [::]:1094 [::]:* users:(("java",pid=14559,fd=895))
Now I stopped xrootd-clustered.cfg. It seems eymir tests are not affected.
ss -ltnp | egrep ':1094|:11001|'
LISTEN 0 128 [::]:1094 [::]:* users:(("java",pid=37364,fd=895))
Do we need both settings? Or should I just follow the tests for eymir.

These commands work from lxplus for instance:
xrdfs eymir.grid.metu.edu.tr:1094 ls /store | head
xrdcp -f root://eymir.grid.metu.edu.tr:1094//dpm/grid.metu.edu.tr/home/cms//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root /dev/null
2) https://monit-grafana.cern.ch/goto/40ldJzuNg?orgId=20
What can be the reasons I am getting UNKNOWN?

Thanks,

---------------------------------------
In dcache layout.conf
[xrootd-${host.name}Domain]
[xrootd-${host.name}Domain/xrootd]
xrootd.plugins=gplazma:ztn,gplazma:gsi,authz:cms-tfc,authz:scitokens
xrootd.cms.tfc.path=/etc/xrootd/storage.xml
xrootd.cms.tfc.protocol=directxr
xrootd.security.tls.mode=STRICT
-----------------------
and storage.xml
cat storage.xml | grep -v '^#'
<storage-mapping>

<!-- General direct access for the door (protocol used by dCache) -->
<lfn-to-pfn protocol="directxr" destination-match=".*"
path-match="/+store/test/xrootd/T2_TR_METU/store/(.*)"
result="/dpm/grid.metu.edu.tr/home/cms/store/$1"/>

<lfn-to-pfn protocol="directxr" path-match="/+store/(.*)"
result="/dpm/grid.metu.edu.tr/home/cms/store/$1"/>

<lfn-to-lfn protocol="directxr" path-match="/+(.*)"
result="/dpm/grid.metu.edu.tr/home/cms/$1"/>

<pfn-to-lfn protocol="directxr"
path-match=".*/+dpm/grid.metu.edu.tr/home/cms/store/(.*)"
result="/store/$1"/>

<pfn-to-lfn protocol="directxr"
path-match=".*/+dpm/grid.metu.edu.tr/home/cms/(.*)"
result="/$1"/>

<lfn-to-pfn protocol="srm" chain="directxr" path-match="(.*)"
result="srm://eymir.grid.metu.edu.tr:8443/srm/managerv1?SFN=$1"/>
<pfn-to-lfn protocol="srm" chain="directxr" path-match=".*\?SFN=(.*)" result="$1"/>
<lfn-to-pfn protocol="srmv2" chain="directxr" path-match="(.*)"
result="srm://eymir.grid.metu.edu.tr:8446/srm/managerv2?SFN=$1"/>
<pfn-to-lfn protocol="srmv2" chain="directxr" path-match=".*\?SFN=(.*)" result="$1"/>

<!-- Federation fallback (not used by the door, but harmless if present) -->
<lfn-to-pfn protocol="remote-xrootd" destination-match=".*"
path-match="/+store/(.*)"
result="root://xrootd-cms.infn.it//store/$1"/>

</storage-mapping>
---------------------------
CMS #683227 (id:3362) XRootd service at T2_TR_METU
State: assigned  |  Priority: less urgent  |  Opened: 2025-05-07 16:40 (332d ago)  |  Updated: 2025-08-12 15:29
Conversation (11 messages)
Good afternoon, METU admins
We want to find out the status of your XRootd federation subscripition. Your native XRootd service on port 11001 seems to be running 5.6.9. We kindly request all CMS sites using native XRootD to upgrade their XRootD services to the latest version (5.7.3).

We would also like to take this opportunity to encourage sites to enable network packet labelling by adding the following four configuration lines to both XRootD and redirectors.
xrootd.pmark ffdest eu.scitags.org:10514
xrootd.pmark domain any
xrootd.pmark defsfile curl https://scitags.docs.cern.ch/api.json
xrootd.pmark map2exp path /<path-to-store>/store cms
xrootd.pmark map2act cms default default

If your site supports multiple VOs on the same XRootD service or requires additional details, please refer to the SciTag Network Packet Labeling Twiki page: https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesXrootdScitagPacketLabeling.

We also notice that access to METU data via site redirector does not work but failed with authorization error.,
[2025-05-07 13:14:13.165604 +0000][Error ][AsyncSock ] [eymir.grid.metu.edu.tr:11001 #0.0] Socket error while handshaking: [FATAL] Auth failed
This suggest your worker node don't have rootCA certificate. Could you please take a look and update (if you using OSG3.6, you probably want to switch to OSG23 as 3.6 is end of life and not update anymore)
You are running two XRootd services on eymir.grid.metu.edu.tr, port 1094 and 11001. That's not support by SAM. The way around this is to make an IP alias and refer to one via the alias, for instance xrootd-redirector.grid.metu.edu.tr and then use it for 11001 service in config file
Cheers,
Stephan and Noy
Hi,
We are still on centos-7 on disk nodes. From xrootd webpage (https://xrootd.org/2024/07/01/announcement_5_7_0.html) "Support for CentOS 7 is now deprecated"
Is it still ok to update the xrootd. Or upgrade the disk nodes to el9 before this is needed?
When I try the following commands, both works?
xrdcp -d 1 root://eymir.grid.metu.edu.tr:1094//dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root ./
gfal-copy -v root://eymir.grid.metu.edu.tr:1094//dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root ./
Can you tell me the way to test this?
Certificates on the machines had been renewed on head and disk nodes. Since jobs had failed I returned the certificate to old version temporarily on head node. I may need to verify that whether this is the problem is the case.

For the redirector, I remember we previously had issues when trying to configure xrootd and we had defined alias
host eymir-redir.grid.metu.edu.tr
eymir-redir.grid.metu.edu.tr is an alias for
eymir.grid.metu.edu.tr.
eymir.grid.metu.edu.tr has address 161.9.255.60
eymir.grid.metu.edu.tr has IPv6 address 2001:a98:1e:ff::60
Also xrootd config
-------xrd.port 11001
all.role server
all.manager any xrootd-cms.infn.it+ 1213
all.sitename T2_TR_METU
xrootd.redirect eymir.grid.metu.edu.tr:1094 /
all.export / nostage
cms.allow host *
xrootd.trace emsg login stall redirect
ofs.trace none
xrd.trace conn
cms.trace all
cms.space linger 0 recalc 30 min 2% 1g 5% 2g
oss.namelib /usr/lib64/libXrdCmsTfc.so file:/etc/xrootd/storage.xml?protocol=direct
ofs.authorize 1
acc.authdb /etc/xrootd/Authfile
sec.protocol /usr/lib64 gsi -d:1 -crl:0 -vomsfun:default -vomsat:extract -crl:0 -d:1
oss.localroot /dpm/grid.metu.edu.tr/home/cms/
xrootd.seclib /usr/lib64/libXrdSec.so

all.adminpath /var/run/xrootd
all.pidpath /var/run/xrootd
cms.delay startup 10
cms.fxhold 60s
xrd.report xrootd.t2.ucsd.edu:9931 every 60s all sync

xrd.network assumev4
xrootd.limit noerror prepare 20
------
can you explain more on this issue?
Thanks,
Selcuk
Hallo Selcuk,
thanks for taking a look and your reply.
I would recommend to upgrade to EL9 with a recent xrootd >=v5.7.3 version.
But you are intertwined with dCache. I believe you are on dCache 9.2 so can
probably upgrade the OS on door node staying with the same dCache version.
Noy will guide you on reproducing the failure on the command line.
I have updated the site-redirector list with eymir-redir.grid.metu.edu.tr:11001
instead of eymir.grid.metu.edu.tr:1094. The service should get proper testing
after the next ETF reload during the night.
Thanks,
cheers, Stephan
Any update? -- Thanks, Noy
Hi,

The plan is to first upgrade the disk nodes to EL9. If everything works as expected, we will proceed with upgrading the head node. The new servers are already installed and configured, but we are waiting for ongoing hardware operations in the data center racks to be completed. Hopefully, next week we can start by upgrading the disk nodes, and once their operation is verified, we will proceed with upgrading the head node and XRootD.
In the meantime, I am still investigating the following errors:
https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dmanyas2.grid.metu.edu.tr%26service%3Dorg.cms.WN-analysis-%252Fcms%252FRole%253Dlcgadmin%26site%3Detf%26view_name%3Dservice

Failed to open the file 'root://eymir.grid.metu.edu.tr//dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/28B9D1FB-8B31-E711-AA4E-0025905B85B2.root'

Additional Info:
[a] Input file could not be opened.
[b] XrdCl::File::Open(...) => error '[FATAL] Auth failed' (errno=0, code=204)

However, the same file can be accessed without issues:

From LXPLUS:

xrdcp -d 1 root://eymir.grid.metu.edu.tr:1094//dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/28B9D1FB-8B31-E711-AA4E-0025905B85B2.root ./

From the node where the error occurs (mogan)

Could this be related to version differences?
Thanks,
Hallo Selcuk,
i see two issues looking at SAM.
The old worker node test, WN-xrootd-access, shows a certificate issue:

10-Aug-2025 10:26:17 UTC Initiating request to open file
root://cms-xrd-global.cern.ch//store/test/xrootd/T2_TR_METU//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/28B9D1FB-8B31-E711-AA4E-0025905B85B2.root
250810 10:26:20 267 secgsi_VerifyCA: CA certificate self-signed: integrity check failed (3ca6d4a0.0)
250810 10:26:20 267 secgsi_VerifyCA: CA certificate self-signed: integrity check failed (3ca6d4a0.0)
[2025-08-10 10:26:20.266198 +0000][Error ][XRootDTransport ] [eymir.grid.metu.edu.tr:11001 #0.0] No protocols left to try

https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.WN-xrootd-access-/cms/Role=lcgadmin&var-dst_hostname=manyas.grid.metu.edu.tr&var-timestamp=1754822388000

while the new probe, WN-25dataaccess, shows a path issue:

data-access: trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T2_TR_METU/PhEDEx/storage.xml?protocol=direct
[E] Error setting up CMSSW and executing cmsRun:
[e] ['/bin/sh', '-c', 'eval scram runtime -sh; cmsRun swnda_cfg.py']
[e] 10-Aug-2025 12:48:51 +03 Initiating request to open file /dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/AE237916-5D76-E711-A48C-FA163EEEBFED.root
[e] ----- Begin Fatal Exception 10-Aug-2025 12:48:51 +03-----------------------
[e] An exception of category 'FileOpenError' occurred while
[e] [0] Constructing the EventProcessor
[e] [1] Constructing input source of type PoolSource
[e] [2] Calling RootInputFileSequence::initTheFile()
[e] [3] Calling StorageFactory::open()
[e] [4] Calling File::sysopen()
[e] Exception Message:
[e] Failed to open the file '///dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/AE237916-5D76-E711-A48C-FA163EEEBFED.root'
[e] Additional Info:
[e] [a] Input file /dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/AE237916-5D76-E711-A48C-FA163EEEBFED.root could not be opened.
[e] [b] open() failed with system error 'No such file or directory' (error code 2)

https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.WN-25dataaccess-/cms-ce-token&var-dst_hostname=manyas.grid.metu.edu.tr&var-timestamp=1754820127000

'hope this helps. Thanks,
- Stephan
Hi,
Thanks for pointing out the error messages.
I wonder why I cant reproduce the error.
When I run following both from lxplus and working node (mode) it works fine? Is federation host using another certificate?
We have renewed the certificates , however if the error was due to this, I should not able to copy the file. It makes me think of xrootd config?
xrdcp -d 1 root://eymir.grid.metu.edu.tr:1094//dpm/grid.metu.edu.tr/home/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/28B9D1FB-8B31-E711-AA4E-0025905B85B2.root ./

Also for the path issue:
Failed to open the file '///dpm/grid.metu.edu.tr/home/cms/store/...
There is triple slash, is it related to config in storage.xml?
Hallo Selcuk,
taking a closer look, i suspect the certificate error is due to the way the old
SAM tests start up Singularity. The root CA probably of your certs was probably
updated. Since we already switched to the new worker node probes, let's
ignore this.
For the WN-25dataaccess you need to decide if you make POSIX access
available on the worker nodes or if not remove the entry from your
JobConfig/site-local-config.xml, i.e.

https://gitlab.cern.ch/SITECONF/T2_TR_METU/-/blob/master/JobConfig/site-local-config.xml?ref_type=heads#L6

Thanks,
cheers, Stephan
For POSIX single or triple slash are correct. - Stephan
Hi,
I have modified the configuration as follows — disabling POSIX access for now — and am waiting for the configuration changes to take effect:

<catalog url="trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T2_TR_METU/PhEDEx/storage.xml?protocol=remote-xrootd"/>
<catalog url="trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T2_TR_METU/PhEDEx/storage.xml?protocol=xrootd"/>
<!-- <catalog url="trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T2_TR_METU/PhEDEx/storage.xml?protocol=direct"/> -->

I have been occupied with other projects, and during the summer I’ve returned to grid operations. I may have missed some information regarding the new tests and need your guidance.

https://monit-grafana.cern.ch/goto/KlOiX1_Ng?orgId=20

From what I understand, we can ignore the critical ones related to CA certificates — could you please confirm?

CVMFS issue:
While installing the worker nodes, we configured 40 GB partitions for /var/cache/cvmfs2, which is causing a warning. Would changing

CVMFS_QUOTA_LIMIT=35000

be a valid solution to suppress the warning, or should we instead use /root with larger space for CVMFS mounts?

Disk-node question:

https://monit-grafana.cern.ch/goto/M78Gj1_HR?orgId=20

I was informed that hardware operations will be completed by October, and was advised to upgrade the nodes afterward if it is not urgent. Is this approach feasible? Are any of the current tests we are seeing specific to CentOS 7? Based on your answer, we can plan the upgrade schedule accordingly.

I’m also unsure why we are getting the reported errors — could you please guide me on that as well?

Thanks,
Hallo Selcuk,
yes, the old SAM worker node tests you can now ignore. We have the new
probes documented at
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsSAMTests
Regarding CVMFS: the catalogues are not counted against the quota but
go into the cache. For this reason, the recommendation is to have cache
space of 20% larger than the quota plus 1 GB. So for a quota of 35 GB you
would need 35 GB*1.20 + 1 GB = 43 GB (or for a 40 GB cache can allow a
quota of 32.5 GB.
https://twiki.cern.ch/twiki/bin/view/CMSPublic/FacilitiesServicesDocumentation
The amount of CVMFS cache a node should have depends on the core count.
For 56 cores our recommendation is about 100 GB.

Regarding storage: We/CMS don't care that much about OS versions. Newer
XRootD or dCache versions will not be available for CentOS 7 but if you have
a working setup (and apart from IAM token access via xroot you have) and
EGI and university are ok with still running CentOS 7, feel free to upgrade at
you schedule.

With reported error are you talking about SE-XRootD-14tkn-read, FTS, or
the worker node tests aftyer your update? The reason for the later is

Found an "lfn-to-pfn" tag for protocol "davs" in /cvmfs/cms.cern.ch/SITECONF/T2_TR_METU/PhEDEx/storage.xml
[E] No "lfn-to-pfn" tag for referenced protocol "remote-xrootd" in /cvmfs/cms.cern.ch/SITECONF/T2_TR_METU/PhEDEx/storage.xml

I only see "local-xrootd" and "xrootd" devined in you PhEDEx/storage.xml

Thanks,
cheers, Stephan
WLCG tickets (1)
WLCG #681691 (id:1792) Access to the TR-03-METU storage element for VO enmr.eu
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:06 (430d ago)  |  Updated: 2025-02-03 10:50
Conversation (3 messages)
GGUS ID: 167699
Last modifier: Andrei Tsaregorodtsev
Date: 2024-07-29 12:12:15
Subject: Access to the TR-03-METU storage element for VO enmr.eu
Ticket Type: USER
CC: A.M.J.J.Bonvin@uu.nl;
Status: assigned
Responsible Unit: NGI_TR
Issue type: Data Management - generic
Description:
Hello,
I would like to access the storage at the TR-03-METU site (eymir.grid.metu.edu.tr) using my enmr.eu X.509 proxy certificate:

$ voms-proxy-info --all
subject : /DC=org/DC=terena/DC=tcs/C=FR/O=Centre national de la recherche scientifique/CN=TSAREGORODTSEV Andrei andrei.tsaregorodtsev@cnrs.fr/CN=7852451638/CN=1068429803
issuer : /DC=org/DC=terena/DC=tcs/C=FR/O=Centre national de la recherche scientifique/CN=TSAREGORODTSEV Andrei andrei.tsaregorodtsev@cnrs.fr/CN=7852451638
identity : /DC=org/DC=terena/DC=tcs/C=FR/O=Centre national de la recherche scientifique/CN=TSAREGORODTSEV Andrei andrei.tsaregorodtsev@cnrs.fr/CN=7852451638
type : RFC compliant proxy
strength : 2048 bits
path : /tmp/x509up_u1885
timeleft : 23:51:27
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO enmr.eu extension information ===
VO : enmr.eu
subject : /DC=org/DC=terena/DC=tcs/C=FR/O=Centre national de la recherche scientifique/CN=TSAREGORODTSEV Andrei andrei.tsaregorodtsev@cnrs.fr
issuer : /DC=org/DC=terena/DC=tcs/C=IT/ST=Roma/O=Istituto Nazionale di Fisica Nucleare/CN=voms2.cnaf.infn.it
attribute : /enmr.eu/Role=NULL/Capability=NULL
attribute : /enmr.eu/amber/Role=NULL/Capability=NULL
attribute : /enmr.eu/csrosetta/Role=NULL/Capability=NULL
attribute : /enmr.eu/dirac/Role=NULL/Capability=NULL
attribute : /enmr.eu/disvis/Role=NULL/Capability=NULL
attribute : /enmr.eu/gromacs/Role=NULL/Capability=NULL
attribute : /enmr.eu/haddock/Role=NULL/Capability=NULL
attribute : /enmr.eu/powerfit/Role=NULL/Capability=NULL
timeleft : 23:51:27
uri : voms2.cnaf.infn.it:15014

I am getting the following error while trying to access with a gfal2 library API:

Failed to copy mandelbrot.jdl to srm://eymir.grid.metu.edu.tr:8446/srm/managerv2?SFN=/dpm/grid.metu.edu.tr/home/ops/enmr.eu/user/a/atsareg/mandelbrot.jdl: GError('[gfalt_copy_file][perform_copy][srm_plugin_filecopy][srm_resolve_turls][srm_resolve_put_turl] DESTINATION OVERWRITE [srm_plugin_prepare_dest_put][srm_plugin_delete_existing_copy][gfal_srm_unlinkG][gfal_srm_report_error] srm-ifce err: Communication error on send, err: [SE][srmRm][] httpg://eymir.grid.metu.edu.tr:8446/srm/managerv2: HTTP Error\n', 70)

Similar call with the gfal-copy :

$ gfal-copy mandelbrot.jdl srm://eymir.grid.metu.edu.tr:8446/srm/managerv2?SFN=/dpm/grid.metu.edu.tr/home/ops/enmr.eu/user/a/atsareg/mandelbrot.jdl
Copying file:///sps/lhcb/atsareg/test/mandelbrot.jdl [FAILED] after 0s
gfal-copy error: 70 (Communication error on send) - DESTINATION SRM_PUT_TURL srm-ifce err: Communication error on send, err: [SE][PrepareToPut][] httpg://eymir.grid.metu.edu.tr:8446/srm/managerv2: HTTP Error

According to OLA agreements the site should support the VO enmr.eu. Can you please check what is the problem and how I can debug the issue.

Best regards,
Andrei

Affected ROC/NGI: NGI_TR
Affected Site: TR-03-METU
GGUS ID: 167699
Last modifier: Selcuk Bilmis
Date: 2024-10-01 12:07:48
Changed CC to A.M.J.J.Bonvin@uu.nl;

Public Diary:
Hi,
This ticket is actually same with the one https://ggus.eu/index.php?mode=ticket_info&ticket_id=167700. We can close this one and continue with the other.

Thanks,
Selcuk
Internal Diary:
Added attachment pilot-output.txt
https://ggus.eu/index.php?mode=download&attid=ATT119862
GGUS ID: 167699
Last modifier: Selcuk Bilmis
Date: 2025-01-28 09:08:02

Public Diary:
Hi,

If there is no request anymore, can we close this ticket?

Best,

Selcuk

Internal Diary:
Added attachment pilot-output.txt
https://ggus.eu/index.php?mode=download&attid=ATT119862
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM9%50%78%87%60%12%8%6%10%66%89%97%89%48%73%56%
HammerCloud99%99%100%100%100%100%100%100%100%100%98%100%100%99%100%99%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (6)

CMS tickets (5)
CMS #682037 (id:2143) File transfer failure to T2_TW_NCHC
State: in progress  |  Priority: urgent  |  Opened: 2025-02-04 20:29 (423d ago)  |  Updated: 2026-03-31 19:38
Conversation (12 messages)
Hello NCHC admin.
Since Saturday (1 Feb). There are several file transfer failure from other sites to your storage endpoint [1]. I found "HTTP: 500" error code when system tried to pull data from your site [2]. Could you please take a look and check your storage services, availability and server.
Cheers,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_TW_NCHC
[2]
https://fts-cms-009.cern.ch:8449/var/log/fts3/transfers/2025-02-03/cmsdcache-kit-disk.gridka.de__se01.grid.nchc.org.tw/2025-02-03-0016__cmsdcache-kit-disk.gridka.de__se01.grid.nchc.org.tw__4657280317__c4189196-e1a0-11ef-af11-fa163e1db714
https://fts-cms-009.cern.ch:8449/var/log/fts3/transfers/2025-02-03/cmsio2.rc.ufl.edu__se01.grid.nchc.org.tw/2025-02-03-0109__cmsio2.rc.ufl.edu__se01.grid.nchc.org.tw__4657508401__224ad0e8-e1c5-11ef-bc6e-fa163e4409e3
Hello NCHC admin. Since last week, your site has a lot of transfer failures. The log file show "HTTP:401" error [1]. Could you please take a look and check service/configuration on your storage area.
Cheers,
Noy
[1]
https://fts-cms-002.cern.ch:8449/var/log/fts3/transfers/2025-02-24/cmsdcache-kit-disk.gridka.de__se01.grid.nchc.org.tw/2025-02-24-1904__cmsdcache-kit-disk.gridka.de__se01.grid.nchc.org.tw__4691726862__21d27106-f2e2-11ef-abc0-fa163e172462
Hello NCHC admin, Your site has a lot of transfer failure since last week. "HTTP 401" error found on the log file [1]. Could you please take a look and check your storage configuration/permission?
Thank you,
Noy
[1]
https://fts-cms-007.cern.ch:8449/var/log/fts3/transfers/2025-04-14/eoscms.cern.ch__se01.grid.nchc.org.tw/2025-04-14-1700__eoscms.cern.ch__se01.grid.nchc.org.tw__4772341233__c47e68b6-1951-11f0-82cc-fa163e9e00e6
Just synchronized the system clock. Hope that resolves the issue.
Hello NCHC admin. the transfer failure is persistence since last week. There is "HTTP 403" error on log files [1]. Could you please take a look at storage area and check configuration/directory and user' permission.
Thank you,
Noy
[1]
https://fts-cms-004.cern.ch:8449/var/log/fts3/transfers/2025-08-05/eos.grif.fr__se01.grid.nchc.org.tw/2025-08-05-0343__eos.grif.fr__se01.grid.nchc.org.tw__4989695304__3ac1a6dc-71ae-11f0-b53d-fa163e65d4d2
https://fts-cms-003.cern.ch:8449/var/log/fts3/transfers/2025-08-05/dcache-cms-webdav.desy.de__se01.grid.nchc.org.tw/2025-08-05-0329__dcache-cms-webdav.desy.de__se01.grid.nchc.org.tw__4989682057__62e8ae46-71ac-11f0-9dd0-fa163e172462
Hello NCHC admins.
The issue is going on for three weeks. Could you please take a look and check storage area configuration/permission for cms user?Cheers,
Noy
It seems the write permission issues are happening only from sites that use tokens. Sites that do not use tokens are working correctly (please let me know if this is not the case). I'm currently investigating my dCache setup to resolve the token write problem....
Hello Chun-Yu. The failure's still going on. Could you please provide an update?
Cheers,
Noy
Hello NCHC admins. The issue still be there. "HTTP 403" error appear on log files [1]. These logs from transferring between tier-1 sites to your storage area. Could you please take a look?
Best regards,
Noy
[1]
https://fts-cms-004.cern.ch:8449/var/log/fts3/transfers/2025-11-02/ccdavcms.in2p3.fr__se01.grid.nchc.org.tw/2025-11-02-0433__ccdavcms.in2p3.fr__se01.grid.nchc.org.tw__5234236111__1bccc170-b7a5-11f0-873d-fa163e5dc706
https://fts-cms-009.cern.ch:8449/var/log/fts3/transfers/2025-11-02/xfer-cms.cr.cnaf.infn.it__se01.grid.nchc.org.tw/2025-11-02-0354__xfer-cms.cr.cnaf.infn.it__se01.grid.nchc.org.tw__5234178485__a8722576-b79f-11f0-88c1-fa163e172462
https://fts-cms-003.cern.ch:8449/var/log/fts3/transfers/2025-11-02/se.cis.gov.pl__se01.grid.nchc.org.tw/2025-11-02-0457__se.cis.gov.pl__se01.grid.nchc.org.tw__5234254259__531d31a2-b7a8-11f0-a04f-fa163e4409e3
https://fts-cms-005.cern.ch:8449/var/log/fts3/transfers/2025-11-02/rdr.echo.stfc.ac.uk__se01.grid.nchc.org.tw/2025-11-02-0543__rdr.echo.stfc.ac.uk__se01.grid.nchc.org.tw__5234296188__e3e95f8e-b7ae-11f0-803f-fa163e255442
https://fts-cms-009.cern.ch:8449/var/log/fts3/transfers/2025-11-02/cmsdcache-kit-disk.gridka.de__se01.grid.nchc.org.tw/2025-11-02-0557__cmsdcache-kit-disk.gridka.de__se01.grid.nchc.org.tw__5234311769__d842abe8-b7b0-11f0-938d-fa163e172462
https://fts-cms-008.cern.ch:8449/var/log/fts3/transfers/2025-11-02/webdav-cms.pic.es__se01.grid.nchc.org.tw/2025-11-02-0455__webdav-cms.pic.es__se01.grid.nchc.org.tw__5234253850__2fdc916a-b7a8-11f0-b89a-fa163e255442
Hello, NCHC admins
The issue's still going on. Could you please take a look and check your storage area/endpoint's load and capability?Thank you,
Noy
Hi, I believe the issue is due to the wrong
authorization scope of the token used by loadtest to/from T2_TW_NCHC, which shown as
storage.modify:/cms/store/test/loadtest/source/
To be compatible with our dCache storage, it should be without "/cms" prefix. Can the center manager modify the scope to bestorage.*:/store/test/loadtest/source/
Thanks.
Thanks Chun-Yu!
All path in CMS tokens scopes should start with /store/.

This must be a remaining wrong Rucio setup. Let me involve the transfer/data management team.
Thanks,
cheers, Stephan
CMS #1001133 (id:1001133) Pilot validation failures at T2_TW_NCHC
State: assigned  |  Priority: urgent  |  Opened: 2025-11-14 15:39 (141d ago)  |  Updated: 2026-03-23 10:59
Conversation (2 messages)
Dear T2_TW_NCHC admins,

Since yesterday morning (13 Nov), a high number of pilots at T2_TW_NCHC have been failing validation [1]. According to pilot logs, the error occurs in test_squid.sh step:
=== Validation error in /home/condor_execute/dir_3996135/c621671d0a3a/glide_AnFepu/client/test_squid.sh ===
Fri Nov 14 20:09:29 CST 2025 Error running '/home/condor_execute/dir_3996135/c621671d0a3a/glide_AnFepu/client/test_squid.sh'
The test script did not produce an XML file. No further information available.
Fri Nov 14 20:09:30 CST 2025 Notifying VO of error

ERROR: file /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml does not exist

The worker nodes are also missing the `host` command:
/home/condor_execute/dir_3996135/c621671d0a3a/glide_AnFepu/client/check_blacklist.sh: line 24: host: command not found
Could you please look into this?

Cheers,
Vaiva

[1] https://monit-grafana.cern.ch/goto/8fBANMivg?orgId=11
I assume this ticket is completely irrelevant now. Can it be closed?
Cheers,
Andrea
CMS #1000299 (id:1000299) CMS Frontier Squid at T2_TW_NCHC
State: assigned  |  Priority: urgent  |  Opened: 2025-08-11 11:36 (236d ago)  |  Updated: 2025-11-28 14:48
Conversation (4 messages)
Hello, Your CMS Frontier squid seems to have intermittent problems and there is a lot of fail-over to the FNAL Backup Proxies.
So far today:
hits hits bandwidth

mon.grid.nchc.org.tw 434,566 434,566 2.34 GB 11 Aug 2025 - 06:08

se.grid.nchc.org.tw 85,146 85,146 437.02 MB 11 Aug 2025 - 06:08

Also, from the SAM tests:

warn [frontier.c:1025]: Request 1 on chan 1 failed at Mon Aug 11 10:21:26 2025: -8 [fn-htclient.c:444]: server error (HTTP/1.1 502 Bad Gateway) proxy=mon.grid.nchc.org.tw[10.200.6.1:3128] server=cmsfrontier.cern.ch

Best Regards,
Barry
Hello, Something is very wrong at your site. Your CMS squids are OK but they are not being used and there is massive
fail-over to FNAL:

mon.grid.nchc.org.tw
9,622,817
9,622,817
36.07 GB
03 Nov 2025 - 15:23

fore.grid.nchc.org.tw
6,181,725
6,181,725
23.13 GB
03 Nov 2025 - 15:19

I can't say more because the SAM tests don't run.

Best Regards,
Barry
Hello, The massive fail-over is happening again.
Please investigate.

Best Regards,
Barry
Hello, Your CMS Frontier squid on mon.grid.nchc.org.tw looks OK in the monitor, but your fail-overs to FNAL are increasing.
It is likely some misconfiguration or local network. So far today:

mon.grid.nchc.org.tw 596,897 596,897 4.11 GB 28 Nov 2025 - 08:23

fore.grid.nchc.org.tw 566,523 566,523 3.86 GB 28 Nov 2025 - 08:23

disk3.grid.nchc.org.tw 146,694 146,694 906.93 MB 28 Nov 2025 - 08:23

Query (proxyurl=http://mon.grid.nchc.org.tw:3128) started: Fri Nov 28 14:12:06 UTC 2025
(proxyurl=http://10.200.6.1:3128) is FAILED:
warn [frontier.c:1025]: Request 1 on chan 1 failed at Fri Nov 28 14:12:16 2025: -6 [fn-socket.c:261]: read from 10.200.6.1:3128 timed out after 10 seconds
warn [frontier.c:1103]: Trying next server cmsfrontier1.cern.ch with same proxy 10.200.6.1[10.200.6.1:3128]
warn [frontier.c:1025]: Request 2 on chan 1 failed at Fri Nov 28 14:12:16 2025: -8 [fn-htclient.c:444]: server error (HTTP/1.1 502 Bad Gateway) proxy=10.200.6.1[10.200.6.1:3128]

Best Regards,
Barry
CMS #681748 (id:1849) xrdcp fails for me with T2_TW_NCHC
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:18 (430d ago)  |  Updated: 2025-09-19 11:21
Conversation (13 messages)
GGUS ID: 168267
Last modifier: Bockjoo Kim
Date: 2024-09-18 10:42:25
Subject: xrdcp fails for me with T2_TW_NCHC
Ticket Type: USER
CC: cms-comp-ops-transfer-team@cern.ch;
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: CMS_AAA WAN Access
Description:
Hi Chun-yu,
Your SAM tests are passing: https://monit-grafana.cern.ch/goto/HObtZngHg?orgId=20
However, the equivalent xrdcp command fails for me (I noticed this because some production jobs were failing with some samples at NCHC):

[bockjoo@cms AAA]$ root://se01.grid.nchc.org.tw:1094//store/test/xrootd/T2_TW_NCHC/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root
-bash: root://se01.grid.nchc.org.tw:1094//store/test/xrootd/T2_TW_NCHC/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root: No such file or directory
[bockjoo@cms AAA]$ xrdcp -d 1 -f root://se01.grid.nchc.org.tw:1094//store/test/xrootd/T2_TW_NCHC/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root /dev/null
[2024-09-18 06:32:41.540877 -0400][Info ][AsyncSock ] [se01.grid.nchc.org.tw:1094.0] TLS hand-shake done.
[2024-09-18 06:32:52.866199 -0400][Error ][XRootD ] [se01.grid.nchc.org.tw:1094] Got invalid redirection URL: 2001:e10:2000:211:0:0:0:3?org.dcache.xrootd.client=bockjoo.4136235@cms.rc.ufl.edu&org.dcache.uuid=206ff7fe-b7bd-4ab9-9a97-3e12f49a627c?org.dcache.door=2001:e10:4000:122:0:0:0:105:1094
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Invalid redirect URL: (source)

Could it be due to a DNS issue? Can you check?
Thanks,
Bockjoo
GGUS ID: 168267
Last modifier: Bockjoo Kim
Date: 2024-09-23 09:18:44

Public Diary:
Priority has been changed from urgent to top priority.
Hi Chun-yu,
[root@vocms036 FedProbeSendAAAMetrics]# xrdmapc --list all se01.grid.nchc.org.tw:1094
0**** se01.grid.nchc.org.tw:1094
Srv [2001:e10:4000:122::105]:1094

2001:e10:4000:122::105 needs a hostname.
Can you check asap?
Thanks,
Bockjoo
Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168267
Last modifier: Bockjoo Kim
Date: 2024-09-18 22:49:20
Changed CC to cms-comp-ops-transfer-team@cern.ch;stephan.lammel@cern.ch

Public Diary:
Priority has been changed from urgent to top priority.
Hi Chun-yu,
[root@vocms036 FedProbeSendAAAMetrics]# xrdmapc --list all se01.grid.nchc.org.tw:1094
0**** se01.grid.nchc.org.tw:1094
Srv [2001:e10:4000:122::105]:1094

2001:e10:4000:122::105 needs a hostname.
Can you check asap?
Thanks,
Bockjoo
Internal Diary:
Escalated this ticket to ROC_Asia/Pacific and the TPM on shift.
GGUS ID: 168267
Last modifier: Chun-Yu Lin
Date: 2024-09-24 01:23:02

Public Diary:
Hi, Bockjoo,
I recently identified the reverse IPv6 lookup issue (since ~Aug 31)
and are consulting with network admin. Meanwhile, I want to ask if
xrootd is more stringent with IPv6 requirements compared to davs, as
davs**transfer tests appear to be working fine?
Thanks,
Chun-yu
Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168267
Last modifier: Bockjoo Kim
Date: 2024-09-23 09:18:45

Public Diary:
Hi T2_TW_NCHC Colleagues,
Can you respond to this ticket?
I am seeing production job failures due to this issue (See the attachment).
Thanks,
Bockjoo

Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168267
Last modifier: Chun-Yu Lin
Date: 2024-09-24 02:50:02

Public Diary:
Got it. Our network admin are debugging reverse lookup issue now.
And yes, we are subscribing to eu in fed xrootd daemon,
all.manager xrootd-cms.infn.it+:1213
In dcache, I still sent fed monitoring summary to ucsd:
xrootd.monitor.summary=xrootd.t2.ucsd.edu:9931:60
because our network TWAREN is logically closer to us.
Thanks.
Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168267
Last modifier: Bockjoo Kim
Date: 2024-09-24 02:28:21

Public Diary:
Hi Chun-yu,
It appears so.
But production jobs are using root protocol.
So, this needs to be resolved fast.
We can ask CMSSW developers to fall-back to davs protocol, but this will take much longer than resolving the DNS issue.
By the way, are you subscribing to xrootd-cms.infn.it?
Thanks,
Bockjoo
Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168267
Last modifier: Chun-Yu Lin
Date: 2024-09-24 07:58:01

Public Diary:
Hi Bockjoo, we fixed the reverse lookup issue and now xrootd copy works.
Please verify if everything works as expected. Thanks.
Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168267
Last modifier: Chun-Yu Lin
Date: 2024-09-25 01:46:02

Public Diary:
Hi Bockjoo,
Yes, it is a public ip with FQHN. Could it be just due to
propagation latency? As it looks fine now on LxPlus:
[chunyu@lxplus930 ~]$ xrdmapc --list all xrootd-cms.infn.it:1094|
grep nchc
Srv se01.grid.nchc.org.tw:11001
On the CMSSW, I don't know how to run it -- Could you check it again, or
give me test example that I can test it locally?
Many thanks,
Chun-yu
Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168267
Last modifier: Bockjoo Kim
Date: 2024-09-24 09:19:57

Public Diary:
Hi Chun-yu,
Thanks for the quick action!
Yes, xrdcp works fine, now.
I still have a concern because
xrdmapc --list all xrootd-cms.infn.it:1094
still shows an ip instead of hostname:

Srv [2001:e10:4000:122::105]:11001
Is this a public ip that is supposed to be mapped to a FQHN?
Also CMSSW shows the ip instead of hostname:

[1] A Try # = 0 trying to open new_filename root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutri
no_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root
m_name root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRu
n3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root orig_site T2_IT_Pisa
%MSG
%MSG-w XrdAdaptorInternal: file_open 24-Sep-2024 05:14:48 EDT pre-events
[2] handler.WaitForResponse
%MSG
%MSG-w XrdAdaptorInternal: file_open 24-Sep-2024 05:14:48 EDT pre-events
[2]-1 idx_redir 1 stack_ip IPv6 ip [2001:760:422b::53] auth hostname xrootd-redic.pi.infn.it host.loadBalanc
er 1
%MSG
%MSG-w XrdAdaptorInternal: file_open 24-Sep-2024 05:14:58 EDT pre-events
[2]-1 idx_redir 2 stack_ip IPv6 ip [2001:e10:4000:122::105] auth gsi hostname [2001:e10:4000:122::105] host.l
oadBalancer 0
%MSG
%MSG-w XrdAdaptorInternal: file_open 24-Sep-2024 05:15:03 EDT pre-events
[2]-1 idx_redir 3 stack_ip IPv6 ip [2001:e10:4000:122::105] auth gsi hostname [2001:e10:4000:122::105] host.l
oadBalancer 0
%MSG
%MSG-w XrdAdaptorInternal: file_open 24-Sep-2024 05:15:13 EDT pre-events
[2]-1 idx_redir 4 stack_ip IPv6 ip [2001:e10:6040:132::8] auth hostname [2001:e10:6040:132::8] host.loadBala
ncer 0
%MSG
%MSG-w XrdAdaptorInternal: file_open 24-Sep-2024 05:15:13 EDT pre-events
[2]-2 about to call Source::determineHostExcludeString
%MSG
%MSG-w XrdAdaptorInternal: file_open 24-Sep-2024 05:15:13 EDT pre-events
[2]-3 Source::determineHostExcludeString excludeStringse01.grid.NCHC.org.tw

Althought CMSSW runs from other source of the file after the DNS fix, if the file exists only at T2_TW_NCHC, e.g., unmerged files,
we still might have a trouble.
Can you look into it?
Thanks,
Bockjoo

Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168267
Last modifier: Bockjoo Kim
Date: 2024-09-25 09:50:37

Public Diary:
Hi Chun-yu,
Yes, it shows the FQHN and the ipv6 ip at CERN:
[11:12][cvcms@lxcvmfs106 (production:cvmfs/lx)~]$ grep -i "nchc\|2001:6d8" xrdmapc.EU.txt
Srv [2001:6d8:0:2000::20]:11001
Srv se01.grid.nchc.org.tw:11001
But at Florida (cms.rc.ufl.edu), I see only ipv6 address:
[bockjoo@cms AAA]$ grep -i "nchc\|2001:6d8" xrdmapc.EU.txt
Srv [2001:6d8:0:2000::20]:11001
I use this CMSSW config:
import FWCore.ParameterSet.Config as cms

process = cms.Process('NoSplit')

process.source = cms.Source("PoolSource",
fileNames = cms.untracked.vstring(
# NCHC nchc
'root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root',
)
)
process.maxEvents = cms.untracked.PSet(input = cms.untracked.int32(10))
process.options = cms.untracked.PSet(wantSummary = cms.untracked.bool(True))
process.output = cms.OutputModule("PoolOutputModule",
#outputCommands = cms.untracked.vstring("drop *", "keep recoTracks_*_*_*"),
fileName = cms.untracked.string('output.root'),
)
process.out = cms.EndPath(process.output)

and CMSSW_13_3_1_patch1
On the other hand, I don't see errors since Monday from the monitoring.
So maybe you are good. Let's keep this open for a while to see if you can find something for the xrdmapc issue.
Thanks,
Bockjoo
Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
Added attachment aaaRFServerCount.pdf
Is this ticket still relevant? -- Thanks, Noy
CMS #681726 (id:1827) Broken permissions at /store/unmerged for certificates with role: cms:/cms/Role=production; site:T2_TW_NCHC
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:14 (430d ago)  |  Updated: 2025-02-03 09:38
Conversation (7 messages)
GGUS ID: 163084
Last modifier: Todor Ivanov
Date: 2023-08-16 14:41:38
Subject: Broken permissions at /store/unmerged for certificates with role: cms:/cms/Role=production; site:T2_TW_NCHC
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;alan.malta@cern.ch;lammel@fnal.gov;
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: CMS_Facilities
Description:
Dear site Admins,

Following a recent change of one of our service certificates, we have noticed your site and few others are throwing `Permission Denied` errors while trying to clean your storage from unnecessary files.

Could you please give write permission at `/store/unmerged` area to any certificate having the role: `cms:/cms/Role=production` as described in the CMS NameSpace policy here: [1].

To check the Errors your site is returning follow the link [2].
Thank you in advance!

Regards,
Todor Ivanov on behalf of WMCore Team.


[1]
https://twiki.cern.ch/twiki/bin/view/CMS/DMWMPG_Namespace

[2]
https://cmsweb.cern.ch/ms-unmerged/data/info?rse=T2_TW_NCHC&detail=False
GGUS ID: 163084
Last modifier: Stephan Lammel
Date: 2023-09-06 17:57:29

Public Diary:
Hallo Jen,
are you stopping/cancelling them or limiting them to just a site?
Thanks,
- Stephan
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 163084
Last modifier: Jennifer Adelman-McCarthy
Date: 2023-09-06 17:46:51

Public Diary:
Production is having a similar issue, we have 5 workflows that are failing with errors such as:
SUS-RunIISummer20UL18MiniAODv2-00681_0MergeMINIAODSIMoutput:60450
code twikiNoOutput (Exit code: 60450)
No output files present in the report
List of skipped files is:
/store/unmerged/RunIISummer20UL18MiniAODv2/DM_MonoZPrime_V_Mx50_Mv2000_gDM1gSM0p25_Zprime0p2_TuneCP5_13TeV_madgraph-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/2530000/29DACE6C-FCDB-AF4C-ADC0-4127E9785EA4.root
SUS-RunIISummer20UL18NanoAODv9-00673_0MergeNANOEDMAODSIMoutput:60450


code twikiNoOutput (Exit code: 60450)
No output files present in the report
List of skipped files is:
/store/unmerged/RunIISummer20UL18NanoAODv9/DM_MonoZPrime_V_Mx50_Mv2000_gDM1gSM0p25_Zprime0p2_TuneCP5_13TeV_madgraph-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/2530000/18319680-FB11-E040-8433-A5B0216DF1D6.root



Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 163084
Last modifier: Chun-Yu Lin
Date: 2023-11-22 01:26:02

Public Diary:
Dear Todor,
I made some changes in the dCache configuration. And gfal-copy
into /store/unmerged seems work now viamy certificate w/ production
role. Could you check if it also works as expected on you side?
Thanks,
Chun-yu
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 163084
Last modifier: Todor Ivanov
Date: 2023-11-06 13:41:42

Public Diary:
Dear Site Admins,

Did you have the chance to take a look at this issue? I can confirm now the error is present even when I manually try to access the top level branch of the `/store/unmerged/` area at your site:

```
In [60]: ctx.unlink(rseList['T2_TW_NCHC']['pfnPrefixes']['WebDAV'] + findUnprotectdLfn(ctx, msUnmerged, rseList['T2_TW_NCHC']))
2023-11-06 14:37:52,283:INFO:init:findUnprotectdLfn(): Using PfnPrefix: davs://se01.grid.nchc.org.tw//cms
2023-11-06 14:37:52,284:INFO:init:findUnprotectdLfn(): Stat /store/unmerged/ area at: davs://se01.grid.nchc.org.tw//cms/store/unmerged/
2023-11-06 14:37:54,322:ERROR:init:findUnprotectdLfn(): FAILED to open dirEntry: davs://se01.grid.nchc.org.tw//cms/store/unmerged/: gfalException: HTTP 401 : Authentication Error
```

Regards,
Todor
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 163084
Last modifier: Stephan Lammel
Date: 2024-05-20 12:45:31
Changed CC to cms-comp-ops-site-support-team@cern.ch;alan.malta@cern.ch;lammel@cern.ch;

Public Diary:
Hi, recently, our dCache config has been more stable and I am able to
look the issue more closely.
Todor, may I ask how did you manually access /store/unmerged ?
The permission issue puzzles me because the unmerged folder seems to me
do belong to cms production group:
[root@se01 ~]# ls /pnfs/cms/store -al
drwxrwxr-x 199 prdcms001 cmsprd 512 Apr 5 01:33 unmerged
drwxrwxr-x 36 cms cms 512 Mar 28 08:55 user

I also check the dCache/NFSv4 ACL and deletion from user/group has been
granted:
[se01] (PnfsManager@centralDomain) admin > getfacl /cms/store/unmerged
USER:0:+lfsxDd:fd
GROUP:0:+lfsxDd:fdg
[root@se01 ~]# nfs4_getfacl /pnfs/cms/store/unmerged
# file: /pnfs/cms/store/unmerged
A:fd:0:rwaDdx
A:fdg:0:rwaDdx
A::OWNER@:rwaDxtTcC
A::GROUP@:rwaDxtc
A::EVERYONE@:rxtc
Could you let me know more access information from you that I can trace
in the log?
Many Thanks,
Chun-yu
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 163084
Last modifier: Chun-Yu Lin
Date: 2024-05-20 04:39:02

Public Diary:
Hi, recently, our dCache config has been more stable and I am able to
look the issue more closely.
Todor, may I ask how did you manually access /store/unmerged ?
The permission issue puzzles me because the unmerged folder seems to me
do belong to cms production group:
[root@se01 ~]# ls /pnfs/cms/store -al
drwxrwxr-x 199 prdcms001 cmsprd 512 Apr 5 01:33 unmerged
drwxrwxr-x 36 cms cms 512 Mar 28 08:55 user

I also check the dCache/NFSv4 ACL and deletion from user/group has been
granted:
[se01] (PnfsManager@centralDomain) admin > getfacl /cms/store/unmerged
USER:0:+lfsxDd:fd
GROUP:0:+lfsxDd:fdg
[root@se01 ~]# nfs4_getfacl /pnfs/cms/store/unmerged
# file: /pnfs/cms/store/unmerged
A:fd:0:rwaDdx
A:fdg:0:rwaDdx
A::OWNER@:rwaDxtTcC
A::GROUP@:rwaDxtc
A::EVERYONE@:rxtc
Could you let me know more access information from you that I can trace
in the log?
Many Thanks,
Chun-yu
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
WLCG tickets (1)
WLCG #681743 (id:1844) No accounting data for June in the EGI portal, WLCG accounting validation for the site has not been performed (TW-NCHC)
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:18 (430d ago)  |  Updated: 2025-02-03 11:18
Conversation (1 message)
GGUS ID: 167731
Last modifier: Julia Andreeva
Date: 2024-07-31 08:51:53
Subject: No accounting data for June in the EGI portal, WLCG accounting validation for the site has not been performed (TW-NCHC)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: Other
Description:
You are getting this ticket, because CPU consumption for June in the EGI portal for your site is 0, moreover, you did not perform validation for June using CRIC UI, when you can inject numbers from your local accounting. As a result, your CPU metrics in the WLCG June accounting report will be 0. Please, make sure that APEL accounting data flow is fixed for your site for the next month. If it would take longer time to investigate and fix the problem, please, use CRIC monthly accounting validation procedure.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%97%100%95%100%90%92%100%100%100%66%97%
HammerCloud92%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #1002140 (id:1002140) Upgrade your HTCondorCE endpoints to 24.0.x series (Kharkov-KIPT-LCG2)
State: in progress  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 14:42
Conversation (1 message)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

WLCG tickets (2)
WLCG #1001528 (id:1001528) Please add WLCG token access for DUNE to your compute element
State: in progress  |  Priority: urgent  |  Opened: 2026-01-13 20:13 (80d ago)  |  Updated: 2026-03-26 15:37
Conversation (13 messages)
DUNE is soon going to be in a situation where we no longer are allowed to use our SSL host and service certificates as SSL clients, due to changes that will be implemented in our InCommon IGTF CA. This means we will need to access all of our compute sites using WLCG tokens from 31 March 2026 onwards.Four arc-ce sites have already made this transition. we are asking all of our other ARC CE sites to make the transition as soon as possible. in the UK RAL-PP and Glasgow have successfully made the transition.
Hi Steven,

Can you please provide a link to the necessary configuration details that is accessible to non-DUNE members ?
Thanks,
Daniela
https://wlcg-authz-wg.github.io/wlcg-authz-docs/token-based-authorization/doma-testbed/#arc-ce
Hi Daniela, please start with the above instructions.

The subject of the token is dunepilot@fnal.gov. the issuer is https://cilogon.org/dune
The user should be mapped to whatever user dune pilot jobs are mapped to at your site now.
Hi Brunel,
It might be worth looking at the QMUL ticket:
https://helpdesk.ggus.eu/#ticket/zoom/1001531

Daniela
This came from another grid-pp site that already has it working:
[authgroup: dune-iam]
authtokens = dunepilot@fnal.gov https://cilogon.org/dune * * *
(plus other ARC user mapping settings, like)
[mapping]
map_to_pool = pltdune /etc/grid-security/pool/pltdune

Also may have to add the amazon cert to your /etc/grid-security/certificates:

$ ls -lrt /etc/grid-security/certificates

-rw-r--r-- 1 root root 1188 Feb 24 13:18 Amazon-Root.pem
-rw-r--r-- 1 root root 1574 Feb 24 13:19 Amazon-RSA-M02.pem
lrwxrwxrwx 1 root root 15 Feb 24 13:20 ce5e74ef.0 -> Amazon-Root.pem
lrwxrwxrwx 1 root root 18 Feb 24 13:20 0a371e14.0 -> Amazon-RSA-M02.pem
-rw-r--r-- 1 root root 47 Feb 24 13:21 Amazon-RSA-M02.crl.url
...

$ cat /etc/grid-security/certificates/Amazon-RSA-M02.crl.url
http://crl.rootca1.amazontrust.com/rootca1.crl

https://www.amazontrust.com/repository/AmazonRootCA1.pem
https://www.amazontrust.com/repository/Amazon-RSA-2048-M02.pem
Hi Steven,
Apologies this has taken a while to get to, we've been quite occupied with works in our DC for the past couple weeks. Just acknowledging we've picked this up and should be able to get the tokens added soon.
Kind regards,James
So we had the factory configuration change to send scitokens to Durham at 02:00 local time Fermilab time which would be 08:00 UTC.Pilot jobs were failing to be submitted correctly to Durham before we made the change and they still are failing now.
The latest failures would be in your logs around that time, from gfactory-1.osg-htc.org and/or gfactory-2.osg-htc.org.
I'll try to get an interactive test to give more information because the arc-ce is famous for giving no meaningful errors back
when one tries to submit.

Please check your logs and see if you can figure out why the auth might be quietly failing, we will do the same.
Steven Timm
Hi Steve, did you accidentally post this in the wrong ticket ? (This is the Brunel ticket and your message mentions Durham.)
Indeed I did post the durham message into the Brunel ticket by mistake. My apologies.
The good news is that we also changed Brunel's dc_22 OSG entry last night, and that one is actually working.
Please ignore the message about Durham above, I will get it into the right ticket.

GGUS browser playing games with me this morning.. too many open tickets at once.

As far as the Brunel ticket is concerned we will hold in the current configuration for a couple days, with one of three arc-ce's doing scitokens and then will change the next one and then the one following..

Please hold the ticket open until all three arc-ce at Durham are changed.. but this is very good progress. thank you for your help thus far.
and, again, in the last line above I mean.. "please hold the ticket open until all three arc-ce at Brunel are changed", not Durham.
We have now instructed the OSG factory to shift the second Brunel entry (dc26) to use tokens.
update--they have not done it yet.. I have pinged them to change again.
OSG made the change on dc2_26 about a week ago.. we see a glidein pending on dc2_26. It does not have a job ID in your system which would indicate that there's likely some kind of an authentication failure. It was submitted today 6:43 our local time which would translate to 11:43 UTC. Can you please check the logs of dc2_26 to see what, if anything, happened to cause the auth failure?
WLCG #1001129 (id:1001129) DIRAC pilot submission fails for VO biomed into the site UKI-LT2-Brunel
State: in progress  |  Priority: urgent  |  Opened: 2025-11-14 14:41 (141d ago)  |  Updated: 2026-02-16 15:17
Conversation (16 messages)
Hello,

Pilot submission from DIRAC (VO biomed) fails with the following:

WARN: Issue while interacting with the delegations. Response: 403 - User can't be assigned configuration
ERROR: Could not get delegation IDs. Response: 403 - User can't be assigned configuration

The issue likely originates from missing or misconfigured biomed VO
setup in /etc/arc.conf, or missing entries in
/etc/grid-security/vomsdir/biomed.





subject : /DC=org/DC=terena/DC=tcs/C=FR/O=INSTITUT NATIONAL
DES SCIENCES APPLIQUEES DE LYON/CN=Sorina
Pop/CN=1658748987/CN=4945427788/CN=2474101952/CN=2854670634
issuer :
/DC=org/DC=terena/DC=tcs/C=FR/O=INSTITUT NATIONAL DES SCIENCES
APPLIQUEES DE LYON/CN=Sorina
Pop/CN=1658748987/CN=4945427788/CN=2474101952
identity :
/DC=org/DC=terena/DC=tcs/C=FR/O=INSTITUT NATIONAL DES SCIENCES
APPLIQUEES DE LYON/CN=Sorina
Pop/CN=1658748987/CN=4945427788/CN=2474101952
type
: RFC compliant proxy
strength :
2048 bits
path
: proxy.scamarasu.biomed_pilot
timeleft :
11:58:02
key usage : Digital
Signature, Key Encipherment, Data Encipherment
=== VO biomed
extension information ===
VO
: biomed
subject :
/DC=org/DC=terena/DC=tcs/C=FR/O=INSTITUT NATIONAL DES SCIENCES
APPLIQUEES DE LYON/CN=Sorina Pop
issuer :
/DC=org/DC=terena/DC=tcs/C=FR/L=Paris/O=Centre national de la
recherche scientifique/CN=voms-biomed.ijclab.in2p3.fr
attribute :
/biomed/Role=NULL/Capability=NULL
attribute :
/biomed/lcg1/Role=NULL/Capability=NULL
attribute :
/biomed/team/Role=NULL/Capability=NULL
timeleft :
11:58:03
uri
: voms-biomed.ijclab.in2p3.fr:443

Please check the local configuration to accept the above credentials.
Currently we are not setup for biomed. We can set up the CE's for it if you can send us the details of what we need to do.
Thanks for your reply,
Yes, please set up your site to support the biomed VO, which will send jobs with the credentials above.
We have setup one of our CE's (dc2-grid-21.brunel.ac.uk) for Biomed VO. If jobs start going through it without issue then we will enable it on more.
We have still errors when submitting to your site:
dc2-grid-22.brunel.ac.uk ERROR: Could not get delegation IDs. Response: 403 - User can't be assigned configuration
ERROR: Failed submission to queue Queue dc2-grid-22.brunel.ac.uk_nordugrid-Condor-default:
Could not get delegation IDs

Looks like failing to match the identity of the X509 certificate posted above.
Actually I was not quite attentive, you spoke about the dc2-grid-21 CE. This one also fails but differently:
WARN: Issue while interacting with the delegations. Connection failed, consider checking the state of the CE: HTTPSConnectionPool(host='dc2-grid-21.brunel.ac.uk', port=443): Max retries exceeded with url: /arex/rest/1.0/delegations (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9bafa78d10>: Failed to establish a new connection: [Errno 111] Connection refused'))

Can you please check that the CE is actually online ?
Dear Andrei,
Sorry I haven't got back to you sooner. We had put dc-grid-21 in downtime yesterday as we were updating the ARC on the node from v6 to v7. We also upgraded the HT Conder to v25. Would be able to resubmit the jobs to see if they are now going through so if there is an issue we can try to see why they are failing.

Kind regards,

Nalin
Dear Nalin,

Pilot submissions to dc2-grid-28.brunel.ac.uk (queue nordugrid-Condor-default) is still failing with:

User can't be assigned configuration
Could not get delegation IDs

Can you please double check the ARC delegation and VO mapping on the CE for the biomed VO (crednetials above). If you'd like to see more logs from our side please let me know.
Dear
Andrei,
Sorry, I forgot to add the mapping to the ARC conf file on dc2-grid-21.brunel.ac.uk. Could you please try again on dc2-grid-21.brunel.ac.uk?

Kind regards,

Nalin
Thanks Nalin,
However the issue remain periststent on dc2-grid-21.brunel.ac.uk.... below are the logs from dirac..

DIRAC project will be installed by pilots
2025-12-17 06:30:06 UTC WorkloadManagement/SiteDirectorBiomed/dc2-grid-25.brunel.ac.uk WARN: Issue while interacting with the delegations. Response: 500 - User can't be assigned configuration
2025-12-17 06:30:06 UTC WorkloadManagement/SiteDirectorBiomed/dc2-grid-25.brunel.ac.uk ERROR: Could not get delegation IDs. Response: 500 - User can't be assigned configuration
2025-12-17 06:30:06 UTC WorkloadManagement/SiteDirectorBiomed/WorkloadManagement/SiteDirectorBiomed ERROR: Failed submission to queue Queue dc2-grid-25.brunel.ac.uk_nordugrid-Condor-default:
Could not get delegation IDs
2025-12-17 06:30:06 UTC WorkloadManagement/SiteDirectorBiomed/WorkloadManagement/SiteDirectorBiomed INFO: Failed pilot submission Queue: dc2-grid-25.brunel.ac.uk_nordugrid-Condor-default

Thanks so much.
Dear Mazen,
We have set biomed up on dc2-grid-21.brunel.ac.uk. If everything works on there then we will follow the same process on the other CE's at Brunel. If you have sent jobs to dc2-grid-21 and they fail can you please send us the errors so we can try to fix the issue. The ones above are for dc2-grid-25.

Kind regards,

Nalin
Thanks so much Nalin. Yes the logs above were for dc2-grid-25.brunel.ac.uk. Sorry about that
Nevertheless i am seeing the same issues/logs on dc2-grid-21.brunel.ac.uk, below are some recent logs from DIRAC. Note that I have sent jobs the site and they are still in waiting state....

2025-12-18 09:38:58 UTC WorkloadManagement/SiteDirectorBiomed/WorkloadManagement/SiteDirectorBiomed INFO: dc2-grid-21.brunel.ac.uk_nordugrid-Condor-default: Slots=25, TQ jobs(pilotsWeMayWantToSubmit)=151, Pilots: waiting 0, to submit=25
2025-12-18 09:39:00 UTC WorkloadManagement/SiteDirectorBiomed/WorkloadManagement/SiteDirectorBiomed INFO: Going to submit pilots (a maximum of 25 pilots to dc2-grid-21.brunel.ac.uk_nordugrid-Condor-default queue)
2025-12-18 09:39:00 UTC WorkloadManagement/SiteDirectorBiomed/dc2-grid-21.brunel.ac.uk WARN: Issue while interacting with the delegations. Response: 403 - Operation is not allowed
2025-12-18 09:39:00 UTC WorkloadManagement/SiteDirectorBiomed/dc2-grid-21.brunel.ac.uk ERROR: Could not get delegation IDs. Response: 403 - Operation is not allowed
2025-12-18 09:39:00 UTC WorkloadManagement/SiteDirectorBiomed/WorkloadManagement/SiteDirectorBiomed ERROR: Failed submission to queue Queue dc2-grid-21.brunel.ac.uk_nordugrid-Condor-default:
2025-12-18 09:39:00 UTC WorkloadManagement/SiteDirectorBiomed/WorkloadManagement/SiteDirectorBiomed INFO: Failed pilot submission Queue: dc2-grid-21.brunel.ac.uk_nordugrid-Condor-default

Thank you.
Dear Mazen,
Thank you for the error information. We shall look into this and see if we can fix the issue for you.

Kind regards,

Nalin
Dear Mazen,
I have made some more changes to the Arc config files. Could you please try again and check if the jobs are going through.

Kind regards,

Nalin
Dear Nalin,
Unfortunately, the submission to dc2-grid-21.brunel.ac.uk_nordugrid-Condor-default is still failing with the following issue : HTTP 500 – User can't be assigned configuration

This is a snippet of the logs :
SiteDirector: Submitting pilot to queue dc2-grid-21.brunel.ac.uk_nordugrid-Condor-default
ARC submit command issued for dc2-grid-21.brunel.ac.uk

Submission failed for dc2-grid-21.brunel.ac.uk_nordugrid-Condor-default
Response: 500 - User can't be assigned configuration

Thank you for double checking.
Dear Nalin,

We are still observing the same pilot submission failure on
dc2-grid-21.brunel.ac.uk_nordugrid-Condor-default.
The issue consistently occurs at the delegation step, returning:

HTTP 500 – “User can't be assigned configuration”

Any update please on this ticket, please let me know if you need more logs/details from our side.

Thank you.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (6)

CMS tickets (1)
CMS #1001308 (id:1001308) Intermittent WebDAV SAM test failure at T2_UK_SGrid_Bristol
State: in progress  |  Priority: urgent  |  Opened: 2025-12-04 12:28 (121d ago)  |  Updated: 2026-02-06 13:57
Conversation (4 messages)
Good afternoon, Bristol admins.
Since yesterday (3 Dec). Your WebDAV endpoint has several SAM "4crt-read" test [1]. The log file shows "...failed: timeout of 300s" error message [2]. Could you please take a look and check WebDAV services?
Best regards,
Noy
[1]
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_UK_SGrid_Bristol[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-4crt-read&var-dst_hostname=xrootd.phy.bris.ac.uk&var-timestamp=1764845764000
Hi Noy,
Was just looking at it - cannot quite make sense.
It looks like the login is working at first, but after the redirect get's dropped to `nobody` [1]

It certainly does not help that we are still being bombarded by a user requesting non-existent data
251204 12:50:00 74656 ofs_fsctl: mariadlf.3913524:41@submit82.mit.edu Unable to locate /store/mc/RunIII2024Summer24NanoAODv15/TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/; no such file or directory
-> 14 million entries in xrootd.log

I am really tempted to either block submit82.mit.edu or blacklist the user. This has been going on for months now

Cheers,
Luke

[1]
251204 12:42:52 26776 XrdVomsFun: found VO: cms
<snip>
251204 12:42:52 26776 XrootdBridge: 307b5c70.362784:49@etf-04.cern.ch login as 307b5c70.0
251204 12:43:01 26776 TPC_PullRequest: event=PULL_START, local=/xrootd/cms/store/temp/user/cmssam/se_webdav_20251204-1242_etf-04_2034721_mcrn.cpy, remote=https://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/temp/user/cmssam/se_webdav_20251204-1242_etf-04_2034721_mcrn.txt, user=(anonymous); Starting a pull request
251204 12:43:01 26776 TPC_PullRequest: event=REDIRECT, local=/xrootd/cms/store/temp/user/cmssam/se_webdav_20251204-1242_etf-04_2034721_mcrn.cpy, remote=https://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/temp/user/cmssam/se_webdav_20251204-1242_etf-04_2034721_mcrn.txt, user=(anonymous), status=307; Location: https://io02.phy.bris.ac.uk:1094//xrootd/cms/store/temp/user/cmssam/se_webdav_20251204-1242_etf-04_2034721_mcrn.cpy?authz=Bearer%20MDAyNGxvY2F0aW9uIFVLSS1TT1VUSEdSSUQtQlJJUy1IRVAKMDAzNGlkZW50aWZpZXIgNmIyZDI0NzQtODcwMC00NGY0LWEwOWYtOTJiZjk5MGUzYjlmCjAwMThjaWQgbmFtZTozMDdiNWM3MC4wCjAwNTJjaWQgYWN0aXZpdHk6UkVBRF9NRVRBREFUQSxVUExPQUQsRE9XTkxPQUQsREVMRVRFLE1BTkFHRSxVUERBVEVfTUVUQURBVEEsTElTVAowMDJiY2lkIGFjdGl2aXR5OkxJU1QsTUFOQUdFLFVQTE9BRCxERUxFVEUKMDA2MGNpZCBwYXRoOi94cm9vdGQvY21zL3N0b3JlL3RlbXAvdXNlci9jbXNzYW0vc2Vfd2ViZGF2XzIwMjUxMjA0LTEyNDJfZXRmLTA0XzIwMzQ3MjFfbWNybi5jcHkKMDAyNGNpZCBiZWZvcmU6MjAyNS0xMi0wNFQxMjo1ODowMFoKMDAyZnNpZ25hdHVyZSDeh3z4pNngoDPcb-akpKYijBfkPd_A_D6Qd0jufxaEDAo

251204 12:43:06 26776 XrootdXeq: unknown.362795:49@etf-04.cern.ch disc 0:00:00 (link SSL read error)
251204 12:43:07 26776 XrootdBridge: unknown.362800:49@etf-04.cern.ch login as nobody
Hello Luke. Your WebDAV endpoint is still unstable, which makes your site transition from a “waiting room” to a “morgue”. The various errors from the SAM test results turning red more than the acceptable threshold. Could you please take a look and check this endpoint's status/services?
Thank you,
Noy
Good afternoon, Luke and Bristol admins. Since 16:00UTC yesterday (5 Feb). Your endpoint has been failed SAM "4crt-read" tests on both XRootD and WebDAV. The log files show "Couldn't connect to server. After 1 retry" and "Operation expired" messages [1]. Could you please take a look and check this server's status/services?

Cheers,
Noy
[1]

https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-4crt-read&var-dst_hostname=xrootd.phy.bris.ac.uk&var-timestamp=1770385969000
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-4crt-read&var-dst_hostname=xrootd.phy.bris.ac.uk&var-timestamp=1770385991000
WLCG tickets (5)
WLCG #681712 (id:1813) NGI_UK - April 2024 - RP/RC OLA performance
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:09 (430d ago)  |  Updated: 2026-04-02 09:05
Conversation (47 messages)
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2024-05-02 15:27:58
Subject: NGI_UK - April 2024 - RP/RC OLA performance
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_UK
Issue type: Operations
Description:
Dear NGI/ROC,

the EGI RC OLA and RP OLA Report for April 2024 has been produced and is available at the following links:
- NGIs reports: http://argo.egi.eu/egi/report-ar/Critical/NGI?month=2024-04 (Clicking on the NGI name, it will be displayed the resource centres A/R figures)
- RCs reports: http://argo.egi.eu/egi/report-ar/Critical/SITES?month=2024-04

According to the Service targets reports for Resource infrastructure Provider [1] and Resource Centre[2] OLAs, the following problems occurred:

============= RC Availability Reliability [2]==========

According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (February, March, and April):

UKI-SOUTHGRID-BRIS-HEP
continuation of the ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=166173
site currently in downtime until June: Upgrade to EL9, migration of storage system, Batch system revamp
https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=35244

* During the 10 working days after receiving this ticket the NGI can suspend the site or ask to not suspend the site by providing adequate explanation. If no answer is provided to this ticket, the NGI will be contacted by email; if no reply will be provided to the email, the site will be suspended[6].

If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.

If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.

**********************

Links:

[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"

[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"

[3] https://confluence.egi.eu/x/SiAmBg "EGI Infrastructure Oversight escalation"

[4] https://confluence.egi.eu/x/0h4mBg "Recomputation of SAM results or availability reliability statistics"

[5] https://docs.egi.eu/providers/operations-manuals/man05_top_and_site_bdii_high_availability/ "top-BDII and site-BDII High Availability"

[6] https://confluence.egi.eu/x/xx4mBg "Quality verification of monthly availability and reliability statistics"

Best Regards,
EGI Operations
GGUS ID: 166699
Last modifier: Matt Doidge
Date: 2024-05-07 09:36:36

Status: on hold
Responsible Unit: NGI_UK
Public Diary:
Hello,
As described in the downtime the site is offline for a major infrastructure overhaul. This is understood by the UK NGI, and preferable to an equal or longer period of unreliable operations whilst commissioning new services, so we fully support the site and ask for it not to be suspended.

Cheers,

Matt

Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2024-07-01 12:36:15

Public Diary:
downtime for migration to EL9: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=35627
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2024-05-07 10:24:02

Public Diary:
Hi Matt,

thanks for providing further information. Let's keep this ticket on hold.

cheers,
Alessandro
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2024-11-01 15:19:48

Public Diary:
dear all, any news about the migration?
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2024-09-02 10:37:21

Public Diary:
Hi,

I see a new downtime event until Oct 31st: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=35823

I would suggest to suspend the site until is ready for production again.
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2024-11-01 15:50:57

Public Diary:
Hi Lukasz,

thanks for the news. I see indeed that the xrootd tests became green today: https://argo.egi.eu/egi/report-status/OPERATORS/SITES/UKI-SOUTHGRID-BRIS-HEP

What about the webdav protocol?

Cheers,
Alessandro
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Lukasz Kreczko
Date: 2024-11-01 15:37:28

Public Diary:
Ciao Alessandro,

The migration to EL9 has been completed and new storage and batch systems commissioned.
As of last week we have the xrootd service working again and I am currently working on the HTCondor-CEs and related tickets.
We have first jobs working, but the SSL side (ticket 164018) and per-VO job routing are still not working.

Cheers,
Luke
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Lukasz Kreczko
Date: 2024-11-01 16:03:57

Public Diary:
Webdav is working for CMS, just not for ops [1].
Still trying to find documentation for how to use EGI tokens in xrootd (found HTCondor-CE).
BTW: Do the EGI test only use token method?

Cheers,
Luke

[1]
241101 15:32:45 56026 XrdVomsFun: retrieval successful
241101 15:32:45 56026 XrdVomsFun: found VO: ops
241101 15:32:45 56026 XrdVomsFun: ---> group: '/ops', role: 'NULL', cap: 'NULL'
241101 15:32:45 56026 XrdVomsFun: ---> fqan: '/ops/Role=NULL/Capability=NULL'
241101 15:32:45 56026 XrootdBridge: 041e6d98.206638:121@sensu-agent-egi-devel-el9.cro-ngi.hr login as 041e6d98.0
241101 15:32:45 56026 041e6d98.206638:121@sensu-agent-egi-devel-el9.cro-ngi.hr ofs_stat: fn=/xrootd/ops
241101 15:32:45 56026 041e6d98.206638:121@sensu-agent-egi-devel-el9.cro-ngi.hr Xrootd_Protocol: rc=0 stat /xrootd/ops
241101 15:32:45 56026 041e6d98.206638:121@sensu-agent-egi-devel-el9.cro-ngi.hr ofs_opendir: fn=/xrootd/ops
241101 15:32:45 56026 scitokens_Access: Trying token-based access control
241101 15:32:45 56026 scitokens_Access: Token not found in recent cache; parsing.
241101 15:32:45 56026 scitokens_Parse: Token does not appear to be a valid JWT; skipping.
241101 15:32:45 56026 scitokens_Access: Failed to generate ACLs for token
241101 15:32:45 56026 acc_Audit: 041e6d98.206638:121@sensu-agent-egi-devel-el9.cro-ngi.hr deny gsi 041e6d98.0@[::ffff:161.53.0.244] readdir /xrootd/ops
241101 15:32:45 56026 ofs_opendir: 041e6d98.206638:121@sensu-agent-egi-devel-el9.cro-ngi.hr Unable to open directory /xrootd/ops; permission denied

Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Matt Doidge
Date: 2024-11-01 16:12:44

Public Diary:
Hi Luke,
That looks to me like its a classic voms access failing over to a token method (that then naturally fails). Ops tests don't use tokens yet AFAIK (and when they do I presume they will be JWTs).

My first thought would be check your .lsc for Ops[1], but then I realised that xrootd tests are passing which confuses me...

Cheers,
Matt


[1] # cat /etc/grid-security/vomsdir/ops/voms-ops-auth.cern.ch.lsc
/DC=ch/DC=cern/OU=computers/CN=ops-auth.cern.ch
/DC=ch/DC=cern/CN=CERN Grid Certification Authority

Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Lukasz Kreczko
Date: 2024-11-01 16:19:07

Public Diary:
Hi Matt,

Thanks for your message. We are taking vomsdir from CVMFS:
cat /cvmfs/grid.cern.ch/etc/grid-security/vomsdir/ops/voms-ops-auth.cern.ch.lsc
/DC=ch/DC=cern/OU=computers/CN=ops-auth.cern.ch
/DC=ch/DC=cern/CN=CERN Grid Certification Authority


> but then I realised that xrootd tests are passing which confuses me...
Yup, really easy to debug ;).

Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2024-11-29 09:59:42

Public Diary:
Hi Lukasz,

both the webdav and xrootd tests are failing with the "permission denied" error in the readwrite-LsDir metric, for example:

https://argo.egi.eu/egi/report-status/OPERATORS/SITES/metrics/xrootd.phy.bris.ac.uk/egi.webdav.readwrite-LsDir/2024-11-29T08:33:03Z/CRITICAL/webdav

Is this expected due to the extended downtime of the site?

Cheers,
Alessandro
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2024-11-04 09:01:18

Public Diary:
Hi Lukasz,

the only things I've noticed is that for the xroot metrics, the url root://xrootd.phy.bris.ac.uk:1094/ is used, and the metrics are successful.

For webdav, instead, the endpoint https://xrootd.phy.bris.ac.uk:1094/xrootd/ops/ is used, and here the lsdir metric fails

You could check if the permissions of that folder need some fix, or also you could try to modify the URL and ARGO_WEBDAV_OPS_URL in the page https://goc.egi.eu/portal/index.php?Page_Type=Service&id=13689

Cheers,
Alessandro
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Lukasz Kreczko
Date: 2024-11-29 11:05:08

Public Diary:
Ciao Alessandro,

No, this is not expected. CMS seems happy with the endpoint [1], so it should not be failing for OPS tests.
I will have a look.

Cheers,
Luke

[1]
https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dxrootd.phy.bris.ac.uk%26site%3Detf%26view_name%3Dhost
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Matt Doidge
Date: 2025-01-27 17:48:47

Public Diary:
Any news on this ticket?
Cheers,
Matt
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2025-01-28 09:15:31

Public Diary:
Hi,

unfortunately the tests are still failing:
- OPERATORS profile: https://argo.egi.eu/egi/report-status/OPERATORS/SITES/UKI-SOUTHGRID-BRIS-HEP

For webdav failures: https://argo.egi.eu/egi/report-status/OPERATORS/SITES/metrics/xrootd.phy.bris.ac.uk/egi.webdav.readwrite-LsDir/2025-01-28T07:32:59Z/CRITICAL/webdav

can you confirm that the webdav url set here https://goc.egi.eu/portal/index.php?Page_Type=Service&id=13689 is correct (ARGO_WEBDAV_OPS_URL) ?
Does the ops VO have both read/write rights?

For Xrootd failures: https://argo.egi.eu/egi/report-status/OPERATORS/SITES/metrics/xrootd.phy.bris.ac.uk/egi.xrootd.readwrite-LsDir/2025-01-28T06:43:35Z/UNKNOWN/XRootD

there is a timeout: is the used url correct?

Concerning HTCondorCE: which version do you have?

Cheers,
Alessandro
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Lukasz Kreczko
Date: 2025-01-28 09:29:47

Status: in progress
Responsible Unit: NGI_UK
Public Diary:
Hi Alessandro,

Thank you for the links.
> Concerning HTCondorCE: HTCondorCEVersion: 24.2.0

> there is a timeout: is the used url correct?
root://xrootd.phy.bris.ac.uk:1094/ should be root://xrootd.phy.bris.ac.uk:1094/xrootd/ops

>Does the ops VO have both read/write rights?
Yes.

> can you confirm that the webdav url set here https://goc.egi.eu/portal/index.php?Page_Type=Service&id=13689 is correct (ARGO_WEBDAV_OPS_URL) ?
These are the entries in GOCDB
ARGO_WEBDAV_OPS_URLhttps://xrootd.phy.bris.ac.uk:1094/xrootd/ops
ARGO_XROOTD_OPS_URLroot://xrootd.phy.bris.ac.uk:1094//xrootd/ops

Copy and pasting previous response as you submitted as I was writing ;).

Cannot figure out why ops is broken for storage as I get a OPS proxy [1].

Compute has been fixed recently: https://ggus.eu/index.php?mode=ticket_info&ticket_id=164018
Our permission setup is similar to other VOs [2]

Cheers,
Luke



[1]
Contacting voms2.cern.ch:15009 [/DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch] "ops" Failed
[2]
# Topics to summarize permissions
t writeopsdata /xrootd/ops/ a
t readopsdata /xrootd/ops/ lr
# "/ops/Role=production" has full access to managed OPS data and read for OPS
= opsprod o: ops g: /ops r: production
x opsprod writeopsdata readopsdata

# "/ops/Role=lcgadmin"
= opssgm o: ops g: /ops r: lcgadmin
x opssgm writeopsdata readopsdata

g /ops writeopsdata readopsdata

#g /ops /xrootd/ops/user/ a \
# /xrootd/ops/temp/ a \
# readopsdata


Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2025-01-28 09:58:05

Public Diary:
Hi Lukasz,

concerning the CE, in the last ticket indeed you fixed the ops VO settings but today I was still seeing 0% Av/Re:

https://argo.egi.eu/egi/report-ar-group-details/Critical/SITES/UKI-SOUTHGRID-BRIS-HEP/details
https://argo.egi.eu/egi/report-status/Critical/SITES/UKI-SOUTHGRID-BRIS-HEP/org.opensciencegrid.htcondorce?start_date=2025-01-01&end_date=2025-02-01

Then I've just noticed that the CE endpoints are still in downtime until Jan 31st:
https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=36101

XROOTD: maybe you should simply remove a "/" after the port from the Url set here https://goc.egi.eu/portal/index.php?Page_Type=Service&id=13660 ?
ARGO_XROOTD_OPS_URL = root://xrootd.phy.bris.ac.uk:1094//xrootd/ops

Concerning the ops VO settings in general: note that voms2.cern.ch was decommissioned.
Maybe you need to update the lsc files on the storage endpoints as described here: https://twiki.cern.ch/twiki/bin/view/LCG/VOMSLSCfileConfiguration

Cheers,
Alessandro
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2025-01-28 10:18:11

Public Diary:
- CE: at the moment there is only the metric that checks the host certificate validity, so no test jobs submitted.
- XROOTD: let's see if anything changes after the new information is retrieved.

- ops VO: indeed now there is only the IAM server ops-auth.cern.ch but even if the old lsc files are still in the folder, this is not a problem since there is also the new one.
Can you try to make a test to the webdav and xrootd endpoints using the proxy released by the new server?

Cheers,
Alessandro
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Lukasz Kreczko
Date: 2025-01-28 10:08:34

Public Diary:
Hi Alessandro,

>concerning the CE, in the last ticket indeed you fixed the ops VO settings but today I was still seeing 0% Av/Re:
And unfortunately, while I see the jobs on the CE side, I cannot see them on the batch system side -> looking into it.

>XROOTD: maybe you should simply remove a "/" after the port from the Url
Sure, let's try that again.

>Concerning the ops VO settings in general: note that voms2.cern.ch was decommissioned. Maybe you need to update the lsc files on the storage endpoints as described here: https://twiki.cern.ch/twiki/bin/view/LCG/VOMSLSCfileConfiguration
Probably. However, I get my info from /cvmfs/grid.cern.ch/etc/grid-security/ -> someone else needs to update that.
There is a mention of another one
/cvmfs/grid.cern.ch/etc/grid-security/vomsdir/ops/
??? lcg-voms2.cern.ch.lsc
??? voms2.cern.ch.lsc
??? voms-ops-auth.cern.ch.lsc
but only one in vomses; /cvmfs/grid.cern.ch/etc/grid-security/vomses/ops-voms2.cern.ch.

Cheers,
Luke
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Alessandro Paolini
Date: 2025-01-28 11:26:38

Public Diary:
ok, try to add that final slash, please
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 166699
Last modifier: Lukasz Kreczko
Date: 2025-01-28 11:08:10

Public Diary:
> - ops VO: indeed now there is only the IAM server ops-auth.cern.ch but even if the old lsc files are still in the folder, this is not a problem since there is also the new one.
Yes, there is still a problem.
To fix it, I had to create
~/.voms/vomses/voms-ops-auth.cern.ch
with content
"ops" "voms-ops-auth.cern.ch" "443" "/DC=ch/DC=cern/OU=computers/CN=ops-auth" "ops"
that information is not available on CVMFS.

OK, now that I have that:
gfal-ls davs://xrootd.phy.bris.ac.uk:1094/xrootd/ops/
temp
testfile-put-1729593330-5c2b9a50-9061-11ef-a159-aa00005fe8f1.txt
user
gfal-ls davs://xrootd.phy.bris.ac.uk:1094/xrootd/ops
gfal-ls error: 1 (Operation not permitted) - HTTP 403 : Permission refused

That looks OK to me.
Oh, test checks for /xrootd/ops not /xrootd/ops
deny gsi 041e6d98.0@egi.sensu.argo.grnet.gr stat /xrootd/ops

I guess I need to add a `/` in the URLs?
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
Same issue. I tested it with an OPS proxy (with the work-around), so that should not be an issue.
VO mapping works:
250130 08:34:29 134510 XrdVomsFun: adding cert: /DC=EU/DC=EGI/C=HR/O=Robots/O=SRCE/CN=Robot:argo-egi@cro-ngi.hr

250130 08:34:29 134510 XrdVomsFun: retrieval successful
250130 08:34:29 134510 XrdVomsFun: found VO: ops

Access seems to work (tests):
250130 08:34:31 13067 sensu.716162:56@egi.sensu.argo.grnet.gr ofs_remove: f fn=/xrootd/ops/testfile-put-1738226069-0593caaa-dee5-11ef-8521-aa00005fe8f1.txt
250130 08:34:31 13067 acc_Audit: sensu.716162:56@egi.sensu.argo.grnet.gr grant gsi 041e6d98.0@egi.sensu.argo.grnet.gr delete /xrootd/ops/testfile-put-1738226069-0593caaa-dee5-11ef-8521-aa00005fe8f1.txt
250130 08:34:31 13067 sensu.716162:56@egi.sensu.argo.grnet.gr Xrootd_Protocol: rc=0 rm /xrootd/ops/testfile-put-1738226069-0593caaa-dee5-11ef-8521-aa00005fe8f1.txt
250130 08:34:31 13100 XrdTLS: sensu.716162:60@egi.sensu.argo.grnet.gr TLS error rc=0 ec=6 (zero_return) errno=0.
250130 08:34:31 13067 XrdTLS: sensu.716162:56@egi.sensu.argo.grnet.gr TLS error rc=0 ec=6 (zero_return) errno=0

but tests are still red.

Any ideas?
I see that the CE tests became green starting from Jan 31st, Xroot d became green from Jan 30th: https://argo.egi.eu/egi/report-status/OPERATORS/SITES/UKI-SOUTHGRID-BRIS-HEP/XRootD/xrootd.phy.bris.ac.uk

while for webdav there is still the permission error:https://argo.egi.eu/egi/report-status/Critical/SITES/metrics/xrootd.phy.bris.ac.uk/egi.webdav.readwrite-LsDir/2025-02-03T13:33:02Z/CRITICAL/webdav

what did you change for the xrootd protocol?
The only thing that changed was gocdb:
xrootd enpoint hasXROOTD_URL=root://xrootd.phy.bris.ac.uk:1094/xrootd/ops/
ARGO_XROOTD_OPS_URL=root://xrootd.phy.bris.ac.uk:1094/xrootd/ops/
ARGO_WEBDAV_OPS_URL=https://xrootd.phy.bris.ac.uk:1094/xrootd/ops/

webdav has
ARGO_WEBDAV_OPS_URL=https://xrootd.phy.bris.ac.uk:1094/xrootd/ops/
ARGO_XROOTD_OPS_URL=root://xrootd.phy.bris.ac.uk:1094/xrootd/ops/
(all the same except for XROOTD_URL).
I can see ops transfers happening and other VOs are using webdav hapilly (e.g. CMS). But all other VOs also ignore goddb ;).
Let me involve a colleague of mine to debug this issue.

************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
for some reasons the ticket was reassigned to a CMS suopprt group and I wasn't able to access it any longer. Let me assign it back to NGI_UK (and the site).
There is a permission error with the webdav test:https://argo.egi.eu/egi/report-status/Critical/SITES/metrics/xrootd.phy.bris.ac.uk/egi.webdav.readwrite-LsDir/2025-05-02T08:32:58Z/CRITICAL/webdav

Endpoint URL :https://xrootd.phy.bris.ac.uk:1094/xrootd/ops/
CRITICAL - HTTP 403 : Permission refused

@Andrea: do you have any hints to understand where the issue might be?
Hi Luke,
What do you have in your vomsdir for Ops? I just have:
# cat /etc/grid-security/vomsdir/ops/voms-ops-auth.cern.ch.lsc
/DC=ch/DC=cern/OU=computers/CN=ops-auth.cern.ch
/DC=ch/DC=cern/CN=CERN Grid Certification Authority

This was a semi-recent change that might have got missed. But my apologies if I'm trying to tell you how to suck eggs.

Cheers,
Matt
Hi all
sorry the late investigation on this ticket

from what i see the endpoint listing without a trailing slash is failing

gfal-ls davs://xrootd.phy.bris.ac.uk:1094/xrootd/ops

The probe is removing the trailing slash, when present, cause it creates problems with dCache sites.

is the trailing slash needed for webdav over xrootd?

thanks
Andrea
A trailing slash shouldn't be needed:
$ gfal-ls https://xgate.hec.lancs.ac.uk:1094/cephfs/grid/dteam

marchmatttest.txt

smoke-test-lxplus756.cern.ch-29893

smoke-test-lxplus765.cern.ch-7876
...

Luke do the Ops users have listing permissions in your authdb?

Cheers,
Matt
Hi Luke,did you make any progress?

Let us know,
Alessandro
Hi Matt, hi Alessandro,
Yes, OPS has listing permissions:
https://github.com/BristolComputing/xrootd-se/blob/main/etc/xrootd/Authfile#L284
and directory listings are enabled:
https://github.com/BristolComputing/xrootd-se/blob/main/etc/xrootd/config.d/20-https-and-security.cfg#L62

Cheers,
Luke
Hi Luke,
as suggested by Andrea, the final trailing slash in the webdav url could be the reason of the failures: could you please remove it from the several "url variables" registered in GOCDB?
https://goc.egi.eu/portal/index.php?Page_Type=Service&id=13689

Let me know,
Alessandro
Hi Luke,
could you please try to remove the final trailing slash from the several "url variables" registered in GOCDB?
https://goc.egi.eu/portal/index.php?Page_Type=Service&id=13689

Cheers,
Alessandro
Hi Alessandro,

Done.
If that is confirmed to work, will it be fixed in the probes?

Cheers,
Luke
Hi Luke,
let's see later if the tests will become green.

cheers,
Alessandro
Hi Luke,unfortunately nothing changed yet. could you please try tro remove the last slash also from the "Url" field https://goc.egi.eu/portal/index.php?Page_Type=Service&id=13689 (grid information section)?

Let me know,
Alessandro
Hi Alessandro,
Changed that one too just now.

Cheers,
Luke
unfortunately, even without the final slash, the same error is still occurring: https://argo.egi.eu/egi/report-status/Critical/SITES/UKI-SOUTHGRID-BRIS-HEP/webdav/xrootd.phy.bris.ac.uk
https://argo.egi.eu/egi/report-status/Critical/SITES/metrics/xrootd.phy.bris.ac.uk/egi.webdav.readwrite-LsDir/2025-10-31T11:32:56Z/CRITICAL/webdav

Host :xrootd.phy.bris.ac.uk
Metric :egi.webdav.readwrite-LsDir
Timestamp :2025-10-31T11:32:56Z
Endpoint URL :https://xrootd.phy.bris.ac.uk:1094/xrootd/ops
CRITICAL - HTTP 403 : Permission refused
'HTTP 403 : Permission refused'

@andrea: do you have any other hints?
251103 13:32:45 28650 acc_Audit: 041e6d98.128032:52@egi.sensu.argo.grnet.gr deny gsi 041e6d98.0@[2001:648:2ffe:11:a800:ff:fe5f:e8f1] stat /xrootd/ops
251103 13:32:45 28650 ofs_stat: 041e6d98.128032:52@egi.sensu.argo.grnet.gr Unable to locate /xrootd/ops; permission denied

Authfile says
t writeopsdata /xrootd/ops/ a
t readopsdata /xrootd/ops/ lr

the only difference is the `/`. And I cannot remove it from the Authfile since the previous admin noted: "### NOTE: all directories must end with a '/' otherwise regex is used? BAD!!!"

In the config list dir is enabled, and tests confirm it:
```
gfal-ls https://xrootd.phy.bris.ac.uk:1094/xrootd/ops

gfal-ls error: 1 (Operation not permitted) - HTTP 403 : Permission refused
[phxlk@hm01 ~]$ gfal-ls https://xrootd.phy.bris.ac.uk:1094/xrootd/ops/

testfile-put-1757669694-bd079f18-8fbb-11f0-a781-0050569d783e.txt
testfile-put-1742693837-5b27e5da-0787-11f0-9ae0-aa00005fe8f1.txt
testfile-put-1744796127-2181e8d0-1aa6-11f0-8732-0050569dda71.txt
testfile-put-1759520203-4a69821a-a090-11f0-a531-0050569dda71.txt
testfile-put-1751171789-a00ccf40-54a2-11f0-b675-0050569d783e.txt
testfile-put-1742870131-d2542380-0921-11f0-86fe-aa00005fe8f1.txt
testfile-put-1749915434-725cb63c-4935-11f0-bb78-aa00005fe8f1.txt
temp
testfile-put-1744745728-c9491df2-1a30-11f0-9270-0050569d783e.txt
testfile-put-1729593330-5c2b9a50-9061-11ef-a159-aa00005fe8f1.txt
testfile-put-1739871525-2598e83e-eddc-11ef-a461-aa00005fe8f1.txt
user
testfile-put-1757669693-bcb51676-8fbb-11f0-b7ff-0050569dda71.txt
testfile-put-1759520203-4a6e8a26-a090-11f0-8c7e-0050569d783e.txt
testfile-put-1751171860-ca800f62-54a2-11f0-b0a8-aa00005fe8f1.txt
testfile-put-1759520204-4b13a3b2-a090-11f0-8a48-aa00005fe8f1.txt

```

Making sure that the test properly adds the `/` at the end, would fix it.
Making sure that the test does not double up on the `/` when reading and writing files would also fix the gocDB entry issue.
No other VO uses listDir.
Creation, stat, read, and delete are working:
251103 13:34:32 27783 acc_Audit: sensu.1088759:52@egi.sensu.argo.grnet.gr grant gsi 041e6d98.0@egi.sensu.argo.grnet.gr create /xrootd/ops/testfile-put-1762176871-d40dd55c-b8b9-11f0-8b96-aa00005fe8f1.txt
251103 13:34:32 28645 acc_Audit: sensu.2664704:45@sensu-agent-egi-el9.cro-ngi.hr grant gsi 041e6d98.0@sensu-agent-egi-el9.cro-ngi.hr delete /xrootd/ops/testfile-put-1762176871-d40bcf00-b8b9-11f0-b85b-0050569dda71.txt
251103 13:34:36 28214 acc_Audit: sensu.1088759:48@egi.sensu.argo.grnet.gr grant gsi 041e6d98.0@egi.sensu.argo.grnet.gr stat /xrootd/ops/testfile-put-1762176871-d40dd55c-b8b9-11f0-8b96-aa00005fe8f1.txt
251103 13:34:36 27783 acc_Audit: sensu.1088759:52@egi.sensu.argo.grnet.gr grant gsi 041e6d98.0@egi.sensu.argo.grnet.gr read /xrootd/ops/testfile-put-1762176871-d40dd55c-b8b9-11f0-8b96-aa00005fe8f1.txt
251103 13:34:36 28214 acc_Audit: sensu.1088759:48@egi.sensu.argo.grnet.gr grant gsi 041e6d98.0@egi.sensu.argo.grnet.gr delete /xrootd/ops/testfile-put-1762176871-d40dd55c-b8b9-11f0-8b96-aa00005fe8f1.txt

Cheers,
Luke
hi Luke,
given this situation, it is convenient to skip the lsdir check: this option was introduced to mitigate issues like this with webdav and xrootd endpoints.
If you follow the example of the RAL-LCG2 endpoints,
https://goc.egi.eu/portal/index.php?Page_Type=Service&id=13697
https://goc.egi.eu/portal/index.php?Page_Type=Service&id=8855

for your webdav and xrootd endpoints define the following extension properties:
ARGO_WEBDAV_SKIP_DIR_TEST
ARGO_XROOTD_SKIP_LS_DIR

with the value: 0

and let's see the result of the tests after a few hours.

Cheers,
Alessandro
Hi Alessandro,
Done:
https://gocdb.iris.ac.uk/portal/index.php?Page_Type=Service&id=13689
https://gocdb.iris.ac.uk/portal/index.php?Page_Type=Service&id=13660

Cheers,
Luke
Hi Luke,now the ls-dir is test is skipped, and the failure happens with the put test:
https://argo.egi.eu/egi/report-status/Critical/SITES/metrics/xrootd.phy.bris.ac.uk/egi.webdav.readwrite-Put/2025-11-04T04:33:06Z/CRITICAL/webdav

Host :xrootd.phy.bris.ac.uk
Metric :egi.webdav.readwrite-Put
Timestamp :2025-11-04T04:33:06Z
Endpoint URL :https://xrootd.phy.bris.ac.uk:1094/xrootd/ops
CRITICAL - Error copying to https://xrootd.phy.bris.ac.uk:1094/xrootd/ops/testfile-put-1762230765-4f37cc42-b937-11f0-b0b8-aa00005fe8f1.txt, [Err:DESTINATION MAKE_PARENT HTTP 403 : Permission refused ]
Can you check if this is always a problem with the slash or there are actually authorisation issues?

cheers,
Alessandro
Dear all,
the webdav test is still failing:

https://argo.egi.eu/egi/report-status/Critical/SITES/metrics/xrootd.phy.bris.ac.uk/egi.webdav.readwrite-Put/2026-02-04T05:33:06Z/CRITICAL/webdav

Host :xrootd.phy.bris.ac.uk
Metric :egi.webdav.readwrite-Put
Timestamp :2026-02-04T05:33:06Z
Endpoint URL :https://xrootd.phy.bris.ac.uk:1094/xrootd/ops
CRITICAL - Error copying to https://xrootd.phy.bris.ac.uk:1094/xrootd/ops/testfile-put-1770183165-eef7ebe0-018a-11f1-947c-aa00005fe8f1.txt, [Err:DESTINATION MAKE_PARENT HTTP 403 : Permission refused ]
Could you please check what is going wrong?

Best regards,
Alessandro
Hi Luke,
could you please check why the webdav tests are failing?

https://argo.egi.eu/egi/report-status/Critical/SITES/metrics/xrootd.phy.bris.ac.uk/egi.webdav.readwrite-Put/2026-03-02T04:32:59Z/CRITICAL/webdav

Host :xrootd.phy.bris.ac.uk
Metric :egi.webdav.readwrite-Put
Timestamp :2026-03-02T04:32:59Z
Endpoint URL :https://xrootd.phy.bris.ac.uk:1094/xrootd/ops
CRITICAL - Error copying to https://xrootd.phy.bris.ac.uk:1094/xrootd/ops/testfile-put-1772425965-dc138c3a-15f0-11f1-a827-aa00005fe8f1.txt, [Err:DESTINATION OVERWRITE Result Could not connect to server after 1 attempts]
Hi Luke,
could you please check why the webdav tests are failing?

Let me know,
Alessandro
WLCG #1002158 (id:1002158) Upgrade your HTCondorCE endpoints to 24.0.x series (UKI-SOUTHGRID-BRIS-HEP)
State: assigned  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-24 12:04
Conversation (2 messages)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
Luke;Lukasz Kreczko Did you see this ticket ? Note that if you are on htcondor25, this will trigger a false positive as it does not advertise it's version via the bdii.
WLCG #681709 (id:1810) Enable new monitoring flow for xrootd remote access (UKI-SOUTHGRID-BRIS-HEP)
State: waiting for submitter's reply  |  Priority: less urgent  |  Opened: 2025-01-29 10:09 (430d ago)  |  Updated: 2026-01-14 15:55
Conversation (6 messages)
GGUS ID: 164119
Last modifier: Julia Andreeva
Date: 2023-11-09 10:40:25
Subject: Enable new monitoring flow for xrootd remote access (UKI-SOUTHGRID-BRIS-HEP)
Ticket Type: USER
CC: ;borja.garrido.bear@cern.ch
Status: assigned
Responsible Unit: NGI_UK
Issue type: Monitoring
Description:
According to wlcg CRIC your site is running xrootd storage. WLCG Monitoring Task Force implemented a new monitoring flow which should monitor remote data access more reliably. We request the sites to setup and configure the new component 'shoveler' which has to be deployed at the site. Please, follow up the instructions which can be found on the twiki:
https://twiki.cern.ch/twiki/bin/view/LCG/MonitoringTaskForce#Shoveler
Please, accomplish this task before the end of 2023.
GGUS ID: 164119
Last modifier: Lukasz Kreczko
Date: 2023-12-08 14:37:41

Status: on hold
Responsible Unit: NGI_UK
Public Diary:
Hi Julia,

We are currently very restricted in terms of person power, this task cannot be accomplished this side of 2023.
Especially when hearing about the complications from other sites.
I will put this ticket on hold and will revisit it in January 2024.

Cheers,
Luke
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
GGUS ID: 164119
Last modifier: Julia Andreeva
Date: 2024-10-28 14:33:46

Status: in progress
Responsible Unit: NGI_UK
Public Diary:
Yes, the ticket is still relevant.
Internal Diary:
Added attachment uboone-manchester.png
https://ggus.eu/index.php?mode=download&attid=ATT119757
Luke;Lukasz Kreczko

2025 update ?
Hi,
shoveler has now been deployed, I just need to register it now with the queue. The monitoring command is
```
xrootd.monitor all fstat 60s lfn ops ssq xfr 5 ident 5m dest fstat info user redir xrootd.t2.ucsd.edu:9330 dest fstat files info user pfc tcpmon ccm io02.phy.bris.ac.uk:9993
```
publishing via `stomp`:
```yaml
stomp:
user:
password:
url: dashb-lb-mb.cern.ch:61123

topic: /topic/xrootd.shoveler
```

Cheers,
Luke
Hello,
The site has deployed shoveller and given the details (above). Can this ticket be closed?
Thanks,
Matt
WLCG #681703 (id:1804) Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (UKI-SOUTHGRID-BRIS-HEP)
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 10:08 (430d ago)  |  Updated: 2026-01-14 14:30
Conversation (6 messages)
GGUS ID: 166655
Last modifier: Julia Andreeva
Date: 2024-04-29 13:50:26
Subject: Missing CPU accounting data in the EGI portal for March. Monthly accounting validation has not been performed either. (UKI-SOUTHGRID-BRIS-HEP)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_UK
Issue type: Other
Description:
CPU accounting data is missing for your site in the EGI accounting portal for March, validation has not been performed either. Please, make sure that your site is properly reporting to APEL. While solving the problem, please, provide proper accounting metrics during monthly validation.
GGUS ID: 166655
Last modifier: Matt Doidge
Date: 2024-06-18 08:54:23

Public Diary:
Hi Julia,
The site is in a long downtime which might explain some of the issues with accounting:
https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=35244

Cheers,

Matt

Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 166655
Last modifier: Matt Doidge
Date: 2024-06-25 09:36:30

Status: on hold
Responsible Unit: NGI_UK
Public Diary:
Hi Julia,
The site is in a long downtime which might explain some of the issues with accounting:
https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=35244

Cheers,

Matt

Internal Diary:
Involved CMS Glidein Factory in this ticket.
GGUS ID: 166655
Last modifier: Lukasz Kreczko
Date: 2025-01-23 15:42:38

Public Diary:
Yes, still valid. About to restart the accounting as first jobs are landing again.
Internal Diary:
Involved CMS Glidein Factory in this ticket.
Luke;Lukasz Kreczko We are approaching the 1 year anniversary to the last update, can this be merged with https://helpdesk.ggus.eu/#ticket/zoom/1001399 which I think is the same issue ?
Yes, it is the same issue. Never got around to it after EL9 migration since everything kept moving - even today the storage is still not in a stable place.HEPSPEC has now been fully replaced by hepscore, hasn't it?
I've just recently benchmarked the existing nodes with it.
WLCG #1001399 (id:1001399) Site is not reporting accounting information (UKI-SOUTHGRID-BRIS-HEP)
State: in progress  |  Priority: urgent  |  Opened: 2025-12-16 15:04 (109d ago)  |  Updated: 2026-01-14 12:34
Conversation (3 messages)
The site is not reporting accounting information to the EGI/WLCG
accounting system. This needs to be fixed, as WLCG accounting data is
one of the indicators used to measure a site’s contribution to WLCG.
Luke;Lukasz Kreczko Did this ticket reach Bristol ?
Yes, it did. Will get to it as soon as possible.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM99%99%95%97%99%96%100%100%100%100%100%100%100%100%100%100%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #1002208 (id:1002208) Problems with file access and removal @RALPP SE
State: in progress  |  Priority: urgent  |  Opened: 2026-03-27 10:42 (8d ago)  |  Updated: 2026-04-02 08:51
Conversation (4 messages)
Dear site admins,
Since recently RALPP storage denies access to some of its directories for LHCb production users, e.g.
bash-5.1$ dirac-proxy-info
subject : /C=UK/O=eScience/OU=CLRC/L=RAL/CN=alexander rogovskiy/CN=8915737715/CN=899671709
issuer : /C=UK/O=eScience/OU=CLRC/L=RAL/CN=alexander rogovskiy/CN=8915737715
identity : /C=UK/O=eScience/OU=CLRC/L=RAL/CN=alexander rogovskiy
timeleft : 00:14:41
DIRAC group : lhcb_prmgr
DiracX : True
path : /afs/cern.ch/user/a/arogovsk/private/proxy.pem
username : arogovsk
properties : ProductionManagement, NormalUser, JobSharing, JobAdministrator, SiteManager, Operator, LimitedDelegation
VOMS : True
VOMS fqan : ['/lhcb/Role=production']
bash-5.1$ gfal-copy ./hw https://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/lhcb/lhcb/LHCb/Collision25/QEELOW.DST/00323827/0004/arogovsk_test
Copying file:///tmp/arogovsk/hw [FAILED] after 0s
gfal-copy error: 17 (File exists) - TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 403 : Permission refused (destination)
bash-5.1$Could you please have a look?
Best Regards,
Alex
Hi Chris,
Following up on our discussion, I can copy a file to the LHCb root directory without any problems:
bash-5.1$ gfal-copy ./hw https://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/lhcb/lhcb/arogovsk_test
Copying file:///tmp/arogovsk/hw [DONE] after 0s
bash-5.1$ However, uploads to /pnfs/pp.rl.ac.uk/data/lhcb/lhcb/LHCb/Collision25/QEELOW.DST/00323827/0004 do not work:
bash-5.1$ gfal-copy ./hw https://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/lhcb/lhcb/LHCb/Collision25/QEELOW.DST/00323827/0004/arogovsk_test
Copying file:///tmp/arogovsk/hw [FAILED] after 0s
gfal-copy error: 17 (File exists) - TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 403 : Permission refused (destination)
bash-5.1$In both tests i used the following proxy:
bash-5.1$ voms-proxy-info -all
subject : /C=UK/O=eScience/OU=CLRC/L=RAL/CN=alexander rogovskiy/CN=7232908778/CN=1297804902
issuer : /C=UK/O=eScience/OU=CLRC/L=RAL/CN=alexander rogovskiy/CN=7232908778
identity : /C=UK/O=eScience/OU=CLRC/L=RAL/CN=alexander rogovskiy/CN=7232908778
type : RFC compliant proxy
strength : 2048 bits
path : /afs/cern.ch/user/a/arogovsk/private/proxy.pem
timeleft : 0:20:26
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO lhcb extension information ===
VO : lhcb
subject : /C=UK/O=eScience/OU=CLRC/L=RAL/CN=alexander rogovskiy
issuer : /DC=ch/DC=cern/OU=computers/CN=lhcb-auth.cern.ch
attribute : /lhcb/Role=production/Capability=NULL
attribute : /lhcb/Role=NULL/Capability=NULL
attribute : nickname = arogovsk (lhcb)
timeleft : 0:20:26
uri : voms-lhcb-auth.cern.ch:443
bash-5.1$

Best Regards,
Alex
OK, I think I know the problem, it would appear that token transfers get mapped to a production user but for X509 transfers even the production role is mapped to an ordinary pool account:

grep lhcb /etc/dcache/oidc-tokens.map
op:lhcb group:prdcms username:pdlhcb01 uid:171901

grep lhcb grid-vorolemap
# Added role /lhcb
"*" "/lhcb" lhcb001
# Added role /lhcb/Role=lcgadmin
"*" "/lhcb/Role=lcgadmin" lhcb001
"*" "/lhcb/Role=user" lhcb001
# Added role /lhcb/Role=production
"*" "/lhcb/Role=production" lhcb001

The directory you're failing to upload to is owned by pdlhcb01:

ls -ld /pnfs/pp.rl.ac.uk/data/lhcb/lhcb/LHCb/Collision25/QEELOW.DST/00323827/0004
drwxr-xr-x 2 pdlhcb01 prdlhcb 512 Nov 9 15:23 /pnfs/pp.rl.ac.uk/data/lhcb/lhcb/LHCb/Collision25/QEELOW.DST/00323827/0004

so I'm guessing it was created by a token transfer.

Two possible fixes, change the mapping for the X509 production role or change the mapping for the token and then fix the file ownerships.

The production role mapping to a normal pool user is non standard and I'm guess it was done for a reason at the time which may or may not still be valid.

Are all transfers to our storage expected to use either tokens or X509 with a production role?

In which case it's probably cleaner to map both of those to pdlhcb01, and chown the whole tree.

But if you also expect X509 transfers to come in without the production role, unless they go into a different area of the namespace, we'll need to map everything to lhcb001.

What do you think?

Yours,
Chris.
Hi,
All write requests should be from production users (X509 & tokens).
Reads can come from users of course.
So far we don't have users writing to your site, and if that were to be the case we would indeed put it in a different namespace.
Thanks !
Chris
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%97%100%100%100%
HammerCloud100%100%100%100%100%100%100%99%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

CMS tickets (2)
CMS #1001298 (id:1001298) Two worker nodes at T2_US_Caltech
State: assigned  |  Priority: less urgent  |  Opened: 2025-12-03 13:12 (122d ago)  |  Updated: 2025-12-13 14:41
Conversation (4 messages)
Hello, Your CMS Frontier squids are OK. However, I notice there is fail-over from two (and only two) of your worker nodes.
It's not much but the fail-over from these two nodes is consistent. You might want to check their local network connectivity.
So far today:

compute-11-16.ultralight.org
597
597
404.23 MB
03 Dec 2025 - 06:40

compute-11-15.ultralight.org
151
151
194.25 MB
03 Dec 2025 - 06:21

Best Regards,
Barry
Hi,
We have fixed the network issue on the two nodes mentioned above. Can you please check if the issue is resolved?
Thanks,
Sravya
Hi,
The fail-over from those two workers has stopped.
There is just a tiny amount of fail-over from another source:

gw.ultralight.org
16
16
6.53 MB
10 Dec 2025 - 02:17
There was more from that source yesterday.

Best Regards,
Barry
Sorry, this is the fail-over from ultralight to FNAL so far today:

gw.ultralight.org
3,532
3,532
1.79 GB
13 Dec 2025 - 08:06

compute-11-16.ultralight.org
49
49
22.40 MB
13 Dec 2025 - 07:59

compute-11-15.ultralight.org
21
21
16.82 MB
13 Dec 2025 - 05:33

So, the issue is not solved.

Best Regards,
Barry
CMS #1001023 (id:1001023) Request for GPU resource verification – HEPSCORE benchmarking tests
State: assigned  |  Priority: less urgent  |  Opened: 2025-10-31 16:36 (155d ago)  |  Updated: 2025-10-31 16:36
Conversation (1 message)
Dear Site Admin,

I’m contacting you because I’m trying to run some GPU benchmarking tests using HEPSCORE, but the pilot jobs submitted to your site appear to be idle.

Caltech used to provide GPU resources, so I’d like to check whether GPU
pilots are still accepted and if there have been any configuration
changes on your side.

Below are some example jobs currently affected:

[mmascher@vocms0206 ~]$ entry_q CMSHTPC_T2_US_Caltech_cit_gpu -all -af gridjobid
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7465573.0
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7466665.0
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7466907.0
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7467187.0
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7467559.0
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7467853.0
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7468160.0
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7468481.0
condor cit-gatekeeper.ultralight.org cit-gatekeeper.ultralight.org:9619 7468560.0

Could you please:

Confirm whether GPU resources are still available for CMS pilot jobs;

Check if there are any local issues preventing pilot job execution.

Thank you for your help!

Best regards,
Marco Mascheroni
CMS Submission Infrastructure Team
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM96%100%100%100%100%100%100%100%99%99%100%100%94%95%100%94%
HammerCloud100%100%100%100%100%100%99%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

CMS tickets (2)
CMS #1001025 (id:1001025) Request for GPU resource verification – HEPSCORE benchmarking tests
State: assigned  |  Priority: less urgent  |  Opened: 2025-10-31 16:38 (155d ago)  |  Updated: 2025-11-05 12:20
Conversation (4 messages)
Dear Site Admin,

I’m contacting you because I’m trying to run some GPU benchmarking tests using HEPSCORE, but the pilot jobs submitted to your site appear to be held.

MIT used to provide GPU resources, so I’d like to check whether GPU
pilots are still accepted and if there have been any configuration
changes on your side.

Below are some example jobs currently affected:

[mmascher@vocms0206 ~]$ entry_q CMSHTPC_T2_US_MIT_ce01_gpu -all -af gridjobid
condor ce01.cmsaf.mit.edu ce01.cmsaf.mit.edu:9619 525886.0
condor ce01.cmsaf.mit.edu ce01.cmsaf.mit.edu:9619 533229.0

Could you please:

Confirm whether GPU resources are still available for CMS pilot jobs;

Check if there are any local issues preventing pilot job execution.

Thank you for your help!

Best regards,
Marco Mascheroni
CMS Submission Infrastructure Team
We do not have any GPU nodes available grid submission.
--Max
Let's disable the entry MIT GPU entry then Vaiva. Also, make sure the list of site in the probing cronjob is updated!
Hello,

Entries have been disabled and removed from the list.
State: in progress  |  Priority: less urgent  |  Opened: 2025-02-19 15:15 (409d ago)  |  Updated: 2025-03-27 16:36
Conversation (2 messages)
Please let us know when we can start production jobs again
Kind regards,

Jen Adelman-McCarthy

************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
Ticket Link: https://helpdesk.ggus.eu/#ticket/zoom/2391
we are now running production jobs. Max is reporting that he isn't seeing tickets so let's see if he sees this update before closing the ticket.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%98%92%99%100%100%100%100%100%100%100%100%
HammerCloud100%98%99%100%100%100%97%100%99%100%100%99%100%100%99%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #683047 (id:3181) Corrupt files in T2_US_Nebraska
State: assigned  |  Priority: less urgent  |  Opened: 2025-04-15 10:12 (354d ago)  |  Updated: 2025-04-15 17:53
Conversation (2 messages)
Dear site admins,

We have spotted that there's more than 60 corrupt replicas in Nebraska disk. Please find the list in the attachment. All the files I've checked were actually created at Nebraska. You can verify the correct checksum on DAS ( look for adler32) [1] Can you please help us why such file corruptions occurred?

Note that we are spotting these corrupt replicas when they are transferred. There could be more that we are not aware of. It would be useful to scan the disk and file more corrupt files if there are any. Would this be possible?

Best,
Hasan

[1] https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=file%3D%2Fstore%2Fmc%2FRunIISummer20UL17NanoAODv9%2FTTZ-ZToLightJets-TTToLplusNu_TuneCP5_13TeV-amcatnlo-pythia8%2FNANOAODSIM%2F106X_mc2017_realistic_v9-v2%2F2820000%2FAD28DAA2-14BA-6E4F-95A5-7241DAA45262.root
We're unclear as to how to walk our filesystem looking for corrupt files. Is that a script we'll have to produce in-house? Does CMS have any tools for that?

As to the files we'll just have to put in ticket to have them invalidated. If they were created here it's likely they're lost. Looking at some file timestamps the creation times do correspond to power outages to our data center, so likely that's related and it's not an ongoing systemic problem.

Carl Lundstedt
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%65%10%100%96%100%100%100%96%100%100%100%96%
HammerCloud99%96%98%99%79%44%99%100%100%100%100%100%100%100%100%100%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #1001159 (id:1001159) UCSD ceph upgrade
State: assigned  |  Priority: less urgent  |  Opened: 2025-11-17 19:10 (138d ago)  |  Updated: 2025-11-19 17:55
Conversation (2 messages)
Hello,

We need to carry out a Ceph upgrade in our cluster and for that we will need to remove as much data as possible.
Our plan is to backup only the unique data so we were thinking to use the script we use to fetch unique data from sites for the data challenge: https://gitlab.cern.ch/wlcg-doma/dc_inject/-/blob/master/get_unique_cms.sql?ref_type=heads

Could you please confirm we can use the above script to identify the unique data at UCSD?
If you have a better way, please let us know.

Cheers,
Diego Davila
Tracking here: https://its.cern.ch/jira/browse/CMSTRANSF-1271
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM84%77%100%100%99%96%100%100%100%100%100%91%100%100%100%100%
HammerCloud100%78%100%99%98%99%99%100%100%100%100%100%100%100%100%99%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

CMS tickets (2)
CMS #1002045 (id:1002045) JobSubmit SAM test failure at T2_US_Vanderbilt
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-10 13:23 (25d ago)  |  Updated: 2026-03-26 12:43
Conversation (5 messages)
Hello Vanderbilt admins.
Since Sunday (Mar 8). Your compute "ce5-vanderbilt.sites.opensciencegrid.org" endpoint has been failed SAM "JobSubmit" tests with X509 and token [1]. The log files show "GridResourceDownEvent" message [2] but ping test seems normal [3]. Could you please take a look and check this server's status/services?
Best Regards,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T2_US_Vanderbilt
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.sam.CONDOR-JobSubmit-/cms-ce-token&var-dst_hostname=ce5-vanderbilt.sites.opensciencegrid.org&var-timestamp=1773140724013
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.sam.CONDOR-JobSubmit-/cms/Role=lcgadmin&var-dst_hostname=ce5-vanderbilt.sites.opensciencegrid.org&var-timestamp=1773139840952
[3]
[crungphi@lxplus802 ~]$ ping -c 5 ce5-vanderbilt.sites.opensciencegrid.org

PING ce5-vanderbilt.sites.opensciencegrid.org (129.59.197.77) 56(84) bytes of data.
64 bytes from ce5-vanderbilt.sites.opensciencegrid.org (129.59.197.77): icmp_seq=1 ttl=47 time=114 ms
64 bytes from ce5-vanderbilt.sites.opensciencegrid.org (129.59.197.77): icmp_seq=2 ttl=47 time=114 ms
64 bytes from ce5-vanderbilt.sites.opensciencegrid.org (129.59.197.77): icmp_seq=3 ttl=47 time=114 ms
64 bytes from ce5-vanderbilt.sites.opensciencegrid.org (129.59.197.77): icmp_seq=4 ttl=47 time=115 ms
64 bytes from ce5-vanderbilt.sites.opensciencegrid.org (129.59.197.77): icmp_seq=5 ttl=47 time=115 ms

--- ce5-vanderbilt.sites.opensciencegrid.org ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 114.303/114.439/114.593/0.105 ms
Hi Noy,
Thank you for the alert and information. We are currently investigating a general failure of the condor-ce service on the host and are unable to fix this by simply rebooting the host or restarting the service:
[root@ce5-vanderbilt condor-ce]# condor_ce_q

-- Failed to fetch ads from: <129.59.197.77:2059?alias=ce5-vanderbilt.sites.opensciencegrid.org> : ce5-vanderbilt.sites.opensciencegrid.org

SECMAN:2007:Failed to end classad message.
Hello Eric. The issue resurfaced on yesterday (Mar 23). Could you please take a look?
Thank you,
Noy
Hi Noy,
For some reason, scitoken authentication is intermittantly failing. I'm trying to see why but nothing is jumping out to me (in particular, why is it only one host?)

More soon
Andrew
Hello Andrew. Current log files show "Error connecting to schedd ce5-vanderbilt.sites.opensciencegrid.org: SECMAN:2010:Received "DENIED" from server for user lcgadmin@users.htcondor.org using method SCITOKENS." message [1]. Could you please take a look and check authentication file/configutation?
Thank you,
Noy
[1]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.sam.CONDOR-JobSubmit-/cms/Role=lcgadmin&var-dst_hostname=ce5-vanderbilt.sites.opensciencegrid.org&var-timestamp=1774527898855
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.sam.CONDOR-JobSubmit-/cms-ce-token&var-dst_hostname=ce5-vanderbilt.sites.opensciencegrid.org&var-timestamp=1774527640362
CMS #682562 (id:2695) CMS Frontier Squids at T2_US_Vanderbilt
State: in progress  |  Priority: less urgent  |  Opened: 2025-03-09 20:47 (390d ago)  |  Updated: 2025-03-09 20:48
Conversation (1 message)
Hello, I am opening a ticket with the new system that was never responded to under the old system.

You currently use 4 squids for CMS Frontier:
(proxyurl=http://vm-cms-squid1.vampire:3128)
(proxyurl=http://vm-cms-squid2.vampire:3128)
(proxyurl=http://vm-infr-squid4.vampire:3128)
(proxyurl=http://vm-infr-squid5.vampire:3128)

External addresses:
squid2.accre.vanderbilt.edu
squid1.accre.vanderbilt.edu
squid5.accre.vanderbilt.edu
squid4.accre.vanderbilt.edu

Only the first two are registered in the OSG topology.
Could you please register the two newer ones?

In addition, the two that are monitored are running a very old version of squid (frontier-squid-4.15-2.1.osg36.el7)
and should be updated.

Best Regards,
Barry

************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
Ticket Link: https://helpdesk.ggus.eu/#ticket/zoom/2695
Tier-3
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (3)

CMS tickets (2)
CMS #681466 (id:1567) SRMv2/GSIftp/gridFTP phase out at T3_BG_UNI_SOFIA
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:36 (430d ago)  |  Updated: 2025-03-27 16:13
Conversation (5 messages)
GGUS ID: 166991
Last modifier: Radoslava Hristova
Date: 2024-05-30 10:03:26

Status: in progress
Responsible Unit: NGI_BG
Public Diary:
Hi Petter,
thanks for the info!

Cheers,
Katarina
Internal Diary:
Added attachment arcce_clean.out
https://ggus.eu/index.php?mode=download&attid=ATT119871
GGUS ID: 166991
Last modifier: Chan-anun Rungphitakchai
Date: 2024-05-29 20:34:09
Subject: SRMv2/GSIftp/gridFTP phase out at T3_BG_UNI_SOFIA
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_BG
Issue type: File Access
Description:
Hello SOFIA admin,
We are phasing out SRMv2/GSIftp/gridFTP in CMS. (FTS will drop support for GSI/gridFTP in early May.) Looking at T3_BG_UNI_SOFIA SITECONF, you have an SRMv2 protocol defined in storage.json for your RSE and use the SRMv2 protocol in the stage-out.
In detail, in JobConfig/site-local-config.xml should
change line 8 to

and in storage.json lines 21 to 24 should be removed.
Please check the above is what you want, typos, etc. and if you agree update SITECONF.
Thank you,
Noy
GGUS ID: 166991
Last modifier: Chan-anun Rungphitakchai
Date: 2024-09-26 19:31:43

Public Diary:
Any update? Thanks -- Noy
Internal Diary:
Added attachment arcce_clean.out
https://ggus.eu/index.php?mode=download&attid=ATT119871
GGUS ID: 166991
Last modifier: Chan-anun Rungphitakchai
Date: 2025-01-09 17:17:23

Public Diary:
Just reminding -- Thank you, Noy
Internal Diary:
Added attachment arcce_clean.out
https://ggus.eu/index.php?mode=download&attid=ATT119871
Any update for this ticket -- Thank you, Noy
CMS #681467 (id:1568) insufficient CVMFS cache quota on worker nodes of T3_BG_UNI_SOFIA
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:45 (430d ago)  |  Updated: 2025-01-30 13:30
Conversation (3 messages)
GGUS ID: 163218
Last modifier: Stephan Lammel
Date: 2023-09-01 20:00:56
Subject: insufficient CVMFS cache quota on worker nodes of T3_BG_UNI_SOFIA
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_BG
Issue type: Middleware
Description:
Dear Site Admins,
CMS is relying more and using more files from CVMFS. There was a significant step up going to Singularity. We are now also using Monte Carlo gridpack files from CVMFS and more CMSSW versions simultaneously due to Run 2 data analysis, Run 3 data processing, and HL-LHC simulations.
We noticed increased cache turn overs, slowness, and even I/O errors on worker nodes with very small CVMFS cache. We discussed this in July in the Facilities and Services meeting. CMS is recommending an increased cache (compared to the previous 50 GB) and are asking sites to provide at least 20 GB of cache quota/25 GB of cache space.
Your site has worker nodes with less than 20 GB cache quota. Could you please plan on increasing the quota/space. We understand that this may require re-sizing disk partitions and that you want to combine this with an upgrade/re-install. Please let us know in case hardware limits does not allow you to increase CVMFS cache to the minimum value. Please take a look at the cache sizes we recommend for different worker nodes, in case you can dedicate more than the minimum and for future planning/acquisitions.
https://twiki.cern.ch/twiki/bin/view/CMSPublic/FacilitiesServicesDocumentation#CVMFS
Thanks,
cheers, Stephan

P.S.: if you are using containers, we may not have the right CVMFS cache quota/space for your site. Please let us know. We would appreciate if you could bind the CVMFS config area into the container. Several sites do this already.
GGUS ID: 163218
Last modifier: Radoslava Hristova
Date: 2024-04-21 14:52:00

Status: in progress
Responsible Unit: NGI_BG
GGUS ID: 163218
Last modifier: Stephan Lammel
Date: 2025-01-23 14:30:27

Status: in progress
Responsible Unit: NGI_BG
Public Diary:
Yes, this is still relevant, although the CE is currently down. - Stephan
WLCG tickets (1)
WLCG #681469 (id:1570) Upgrade to a supported HTCondor version and enable SSL authentication (BG05-SUGrid)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 09:45 (430d ago)  |  Updated: 2025-01-30 13:30
Conversation (3 messages)
GGUS ID: 163970
Last modifier: Alessandro Paolini
Date: 2023-11-03 11:25:29
Subject: Upgrade to a supported HTCondor version and enable SSL authentication (BG05-SUGrid)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_BG
Issue type: Middleware
Description:
Dear site admins,

with this ticket we would like to follow-up the upgrade to a supported version of HTCondorCE and the migration from voms-based authentication with X509 certificates to AAI tokens for accessing the HTCondorCE endpoints.

The HTCondor team set-up an upgrade procedure to help sites and VOs with the migration from X509 personal certificates to tokens.
Essentially it was created an intermediate step where the plain SSL authentication can be used to authenticate a client' proxy, in addition to the GSI one or to the token one:
- https://confluence.egi.eu/x/EYAtDQ

In summary, the steps are:

- update to HTCondor 9.0.19
- enable the SSL authz (with priority over GSI)
- map the users' DNs
- test the SSL authz successfully
- update to latest HTCondor 10.x

You can find the HTCondor 9.0.19 version in WLCG repository for the time being, as explained in the instructions.

Please also note the usage in the last step of the HTCondor Feature channel (https://htcondor.org/htcondor/release-highlights/index.html#feature-channel) since it this the one supporting the EGI Check-in plugin from 10.4.0.
In this way the sites can accept clients’ proxies and tokens at the same time while waiting for the supported VOs moving completely to tokens.
You can find the latest HTCondor 10.x version in the HTCondor Feature Channel repository.

Please note that after the upgrade to HTCondor 10 version, you need to install and configure the EGI Check-in plugin in order to be compliant with the EGI tokens:
https://github.com/EGI-Federation/check-in-validator-plugin

Please get in contact with your supported communities to properly map the users' DNs to local accounts to ensure also the access via X509 personal certificates.

Concerning the ops VO, please map at least the following certificates:
- EGI Monitoring Service:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-egi@grnet.gr"
"/DC=EU/DC=EGI/C=HR/O=Robots/O=SRCE/CN=Robot:argo-egi@cro-ngi.hr"

- EGI Security monitoring:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-secmon@grnet.gr"

Please also configure properly the Accounting settings on the HTcondor 10 installation, as explained in the instructions.

Thanks for your collaboration,
EGI Operations
GGUS ID: 163970
Last modifier: Radoslava Hristova
Date: 2023-11-26 15:48:15

Status: in progress
Responsible Unit: NGI_BG
Public Diary:
Dear all,

I solved the problems I had with my grid certificate. In the GOCDB a scheduled downtime has been announced for the BG05-SUGrid site.

Best regards,
Radoslava Hristova
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 163970
Last modifier: Alessandro Paolini
Date: 2025-01-23 12:34:26

Status: in progress
Responsible Unit: NGI_BG
Public Diary:
yes, the migration hasn't been done yet.
please upgrade to 23.0.x version:
https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv23EL9
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM— no data —
HammerCloud— no data —
FTS— no data —

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (3)

CMS tickets (3)
CMS #681474 (id:1575) IAM Token for dCache at T3_CH_PSI
State: in progress  |  Priority: urgent  |  Opened: 2025-01-29 09:45 (430d ago)  |  Updated: 2025-07-31 23:19
Conversation (21 messages)
GGUS ID: 164577
Last modifier: Derek Feichtinger
Date: 2023-12-13 15:19:27

Status: in progress
Responsible Unit: NGI_CH
Public Diary:
Hi Chan-anun,
thanks for this information.

I will need to schedule a dcache upgrade from 7.2 to 8.2. I will try to still schedule that before Christmas.

Best regards,
Derek


Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164577
Last modifier: Chan-anun Rungphitakchai
Date: 2023-12-06 17:59:53
Subject: IAM Token for dCache at T3_CH_PSI
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_CH
Issue type: Storage Systems
Description:
Hello PSI admins,
IAM Token support for dCache is ready[1]. You could consider upgrading your dCache door node and configuration IAM token access. In the first place, you need to ensure your dCache version is 8.2.36 (recommended version). I attach the documents and wiki pages[2]. Please take a look.
Thank you and have a nice day,
Noy
[1]
https://twiki.cern.ch/twiki/bin/view/CMS/IAMTokens
[2]
https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheCMSsetup https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheXRootD
GGUS ID: 164577
Last modifier: Stephan Lammel
Date: 2024-01-11 15:02:29

Public Diary:
Hallo Derek,
we setup a IAM client for site admins to get IAM-issued tokens themselves for
testing but this uncovered a security issue which is not yet resolved.
You can use the SAM tests and trigger instantly a new check/see the results
on the ETF page. We can add your test setup to SAM/ETF with the non-production
flag so it's not included into site status/availablity/reliability. You would
need to put the three files of the SAM tests dataset on the instance. If you let
me know the endpoint and equivalent /store/mc/SAM and /store/temp/user path, we
will add it.
Thanks,
cheers, Stephan
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-16_16.24.35.txt

------------ e-mail with large body ------
GGUS ID: 164577
Last modifier: GGUS SYSTEM
Date: 2024-01-18 15:08:26

Public Diary:
Hallo Derek,
we setup a IAM client for site admins to get IAM-issued tokens themselves for
testing but this uncovered a security issue which is not yet resolved.
You can use the SAM tests and trigger instantly a new check/see the results
on the ETF page. We can add your test setup to SAM/ETF with the non-production
flag so it's not included into site status/availablity/reliability. You would
need to put the three files of the SAM tests dataset on the instance. If you let
me know the endpoint and equivalent /store/mc/SAM and /store/temp/user path, we
will add it.
Thanks,
cheers, Stephan
Internal Diary:
Sent 1st reminder to ticket submitter (rungphitakch@wisc.edu) requesting input.
GGUS ID: 164577
Last modifier: GGUS SYSTEM
Date: 2024-01-25 15:08:39

Public Diary:
Hallo Derek,
we setup a IAM client for site admins to get IAM-issued tokens themselves for
testing but this uncovered a security issue which is not yet resolved.
You can use the SAM tests and trigger instantly a new check/see the results
on the ETF page. We can add your test setup to SAM/ETF with the non-production
flag so it's not included into site status/availablity/reliability. You would
need to put the three files of the SAM tests dataset on the instance. If you let
me know the endpoint and equivalent /store/mc/SAM and /store/temp/user path, we
will add it.
Thanks,
cheers, Stephan
Internal Diary:
Sent 2nd reminder to ticket submitter (rungphitakch@wisc.edu) requesting input.
GGUS ID: 164577
Last modifier: Stephan Lammel
Date: 2024-01-25 16:36:53

Public Diary:
Hallo Derek,
sorry, i forgot about this ticket being on hold. We sorted out the security
issue with the tokens for site admin tests last week.
Please take a look at the "Acquiring an IAM-issued Token for Testing" section
on https://twiki.cern.ch/twiki/bin/view/CMS/IAMTokens
I'll send you the two pieces of client information in an email.
Thanks,
cheers, Stephan
Internal Diary:
Sent 2nd reminder to ticket submitter (rungphitakch@wisc.edu) requesting input.
GGUS ID: 164577
Last modifier: Derek Feichtinger
Date: 2024-01-25 15:32:36

Public Diary:
Hi Stephan and colleagues,

Is it ok if I keep this pending for the Tier-3 until CMS can again offer a service to site admins for producing IAM tokens ourselves (when security issue is fixed).

My current test instance is within the Firewall on a few VMs and I use an own local grid-ca... I would have to get full certificates for this temporary arrangements, and we have this year some issues, because our CA-company went belly-up (so we are getting intermediate ones from the Netherlands for now.. a bit of a conumdrum).
If we knew how exactly these tokens are composed (inner structure), we maybe even could generate some test ones ourselves... I was at CERN on Monday and talked to Andi Peters from EOS (an old colleague), and he made a comment about it being no too complicated.

Our T3 and also the T2 are using ACLs, and I would like to do deeper testing than just triggering SAM tests (e.g. testing different test paths first, testing ACL settings, etc.).

Cheers,
Derek


Internal Diary:
Sent 2nd reminder to ticket submitter (rungphitakch@wisc.edu) requesting input.
GGUS ID: 164577
Last modifier: Derek Feichtinger
Date: 2024-02-03 15:32:35

Public Diary:
Hi Stephan

Thanks for sending the ID/secret for token generation. I was able to get webdav token authz working with both my test and
production systems. Xroot still needs some looking into, had no time yet.

Please note that the info on https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheCMSsetup is too restrictive. Using
the following -authz-id options allows continuing to use the dcache authzdb login that we use with the x509 cert
mapping for single Swiss users to single user accounts.

gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any https://t3se01.psi.ch:2880 roots://t3dcachedb03.psi.ch:1094

gplazma.oidc.provider!cms=https://cms-auth.web.cern.ch/ -profile=wlcg -authz-id="uid:10001 gid:10001 username:cmsprod" -prefix=/pnfs/psi.ch/cms/trivcat -suppress=audience

Could the official documentation add some lines on how to do simple manual or scriptet tests. I lost most time there. But when I
then went to the SAM code in https://gitlab.cern.ch/etf/cmssam/-/blob/master/SiteTests/SE/se_webdav.py and from the error messages
realized that it was as simple as the following, things began to fall in place.

X509_USER_PROXY="" BEARER_TOKEN=$(cat rwtest.tkn) gfal-copy /t3home/feichtinger/testfile-df davs://t3se01.psi.ch:2880/store/temp/sitetest/test1

the gfal man pages regrettably also do not mention anything in regard to tokens or the env variable BEARER_TOKEN, but it works with my gfal2-util-scripts-1.8.0-1.el7.noarch.

For dcache admins there is nice debugging information in the dcache "namespace" domain logs, that nicely describe what a token allowed, and what was really tried and why it failed.

e.g.
03 Feb 2024 15:28:15 (PnfsManager) [door:WebDAV-t3se01@webdav-t3se01Domain:AAYQewoGr0A WebDAV-t3se01 PnfsGetFileAttributes] Error while retrieving file attributes: Restriction CompositeRestriction[Unrestricted, MultiTargetedRestriction[Authorisation{allowing [LIST, DOWNLOAD, READ_METADATA] on /pnfs/psi.ch/cms/trivcat/store/mc/SAM}, Authorisation{allowing [LIST, DOWNLOAD, MANAGE, UPLOAD, DELETE, READ_METADATA, UPDATE_METADATA] on /pnfs/psi.ch/cms/trivcat/store/temp/sitetest}, Authorisation{allowing [MANAGE, UPLOAD, DELETE, READ_METADATA, UPDATE_METADATA] on /upload}]] denied activity READ_METADATA on /pnfs/psi.ch/cms/trivcat/store/temp/user/dftest1

The nicest thing was that as soon as I was able to build confidence with the test system, configuring the prod system could be done with no downtime. I hope that having only webdav is ok for now, I will look after xrootd later on.

Thanks and cheers,
Derek

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: GGUS SYSTEM
Date: 2024-02-01 15:12:06

Public Diary:
Hallo Derek,
sorry, i forgot about this ticket being on hold. We sorted out the security
issue with the tokens for site admin tests last week.
Please take a look at the "Acquiring an IAM-issued Token for Testing" section
on https://twiki.cern.ch/twiki/bin/view/CMS/IAMTokens
I'll send you the two pieces of client information in an email.
Thanks,
cheers, Stephan
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Stephan Lammel
Date: 2024-02-03 18:36:08

Public Diary:
Hallo Derek,
thanks for looking into this. I cc Christoph regarding the suggested changes
to the recommended dCache config.
Your audience setting should have a URL for WLCG any and hostnames for the
WebDav (and XRootD endpoint if it allows writing), not URIs for the endpoint.
I know this is confusing but is what WLCG/FTS/Rucio decided on.
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any t3se01.psi.ch t3dcachedb03.psi.ch
I added removal of proxy, BEARER_TOKEN setting, and gfal example to the
IAM-issued token twiki "acquiring token for testing" section.
The xrootd part should be straight forward. Let us know when ready.
Thanks,
cheers, Stephan
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Chan-anun Rungphitakchai
Date: 2024-03-07 19:36:53

Status: in progress
Responsible Unit: NGI_CH
Public Diary:
Hello Derek,
Thank you for your work on WebDAV token enabled. Could you please update the configuration for XRootd [1]?
Cheers,
Noy
[1]
https://twiki.cern.ch/twiki/bin/view/CMSPublic/XRootDAndTokens
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Chan-anun Rungphitakchai
Date: 2024-05-30 19:38:03

Public Diary:
Do you have any update for XRootd token support. -- Noy
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Derek Feichtinger
Date: 2024-06-03 08:29:42

Public Diary:
Dear Noy,

I was not yet able to update our dCache to version 9.2. I started updating my test installation. If that is successful I may be able to do the production version next week, or week after next (I also will need to get CA certs for all pools, now).

Best regards,

Derek

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Derek Feichtinger
Date: 2024-07-02 05:46:52

Public Diary:
Hello Noy,
sorry, we were not able to carry out the upgrade during June, we're under a lot of additional load in this period. The next slot where I can schedule the update would be in the weeks of July 15 or 22nd.
I hope that the webdav based doors will be sufficient until then.

Best regards,
Derek

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Chan-anun Rungphitakchai
Date: 2024-07-01 20:53:04

Public Diary:
Hello Derek.
Could you please update dCache upgrade and XRootd token support progress.
Thank you,
Noy
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Chan-anun Rungphitakchai
Date: 2024-08-27 21:38:45

Public Diary:
Hello Derek,
Do you have any update about upgrade and implement iAM token support.
Thank you,
Noy

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Chan-anun Rungphitakchai
Date: 2024-11-05 17:16:47

Public Diary:
Hello Derek,
There is new issuer site (https://cms-auth.cern.ch). We are now asking site to update their SciToken configuration to support new-CMS issuer. Could you please add new line on /etc/dcache/layouts/.conf under Gplazma section.
gplazma.oidc.provider!cmsnew=https://cms-auth.cern.ch/ -profile=wlcg -authz-id="uid:10001 gid:10001 username:cmsprod" -prefix=/pnfs/psi.ch/cms/trivcat -suppress=audience
and add several lines for /etc/xrootd/scitokens.conf to make XRootd service support new issuer.

[Issuer CMS]
issuer = https://cms-auth.cern.ch/
base_path = /
map_subject = False
default_user = cmsprod
You need to modify default_user to match with old issuer. Please verify new config and deploy for your server.
Cheers,
Noy
[1]
https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheCMSsetup
https://twiki.cern.ch/twiki/bin/view/CMSPublic/XRootDAndTokens#5_The_scitokens_conf_configurati
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Chan-anun Rungphitakchai
Date: 2025-01-22 06:22:25

Public Diary:
Hello Derek,
Could you please provide an update on new k8s issuer (https://cms-auth.cern.ch) support for XRootd. After checking [1], your XRootd endpoint does not support new k8s issuer on X509 and IAM token. Could you please update configuration for k8s support in this week?
Cheers,
Noy
[1]
[crungphi@lxplus807 ~]$ xrdcp -f root://t3se01.psi.ch:1095//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root /dev/null
[0B/0B][100%][==================================================][0B/s]
Run: [FATAL] Auth failed: No protocols left to try (source)

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 164577
Last modifier: Chan-anun Rungphitakchai
Date: 2025-01-09 17:19:38

Public Diary:
Any update? Thank you -- Noy
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
Any plan or update -- Noy
Dear PSI team,
where do things stand with enabling IAM token support also on your site redirector, t3se01.psi.ch ?
You are running a very recent xrootd version, so this should be a 10 minute action. Would it be possible to do/round this next week?
Thanks,
cheers, Stephan
CMS #681478 (id:1579) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_CH_PSI
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 09:46 (430d ago)  |  Updated: 2025-03-11 14:09
Conversation (11 messages)
GGUS ID: 168939
Last modifier: Jakrapop Akaranee
Date: 2024-11-07 09:49:08
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_CH_PSI
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_CH
Issue type: CMS_SAM tests
Description:
Dear PSI Site Administrators,

We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack network, specifically for the following endpoint:

Both WebDAV [1] and XRootD [2] services at t3se01.psi.ch

Could you please review dual stack support on your storage element? Thank you for your assistance. Best Regards,Jakrapop

---------
[1]https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dstorm.mib.infn.it%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
[2]https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dstorm.mib.infn.it%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
GGUS ID: 168939
Last modifier: GGUS SYSTEM
Date: 2024-12-10 08:21:14

Public Diary:
Dear Jakrapop,

Sorry, had been on a longer vacation.
Regrettably PSI does not provide IPv6 routing. How criticial is this for a Tier-3 like ours?

Best regards,
Derek

Internal Diary:
Sent 1st reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
GGUS ID: 168939
Last modifier: Jakrapop Akaranee
Date: 2024-12-16 14:14:27

Status: in progress
Responsible Unit: NGI_CH
Public Diary:
Dear Darek,

Thank you for your reply and for letting me know about the current status of IPv6 routing at ISP ( I guess you mean Internet Service Provider). Could you please share the plan or any considerations regarding dual stack support for your storage element? This will help us better understand the situation and the overall progress of the sites.

Looking forward to your response.

Best regards,
Jakrapop

Internal Diary:
Sent 1st reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
GGUS ID: 168939
Last modifier: Derek Feichtinger
Date: 2024-12-16 14:38:26

Public Diary:
Dear Jakrapop,

PSI is the research institution where the Tier-3 is hosted. I need to discuss with the network team of central IT wether this can be done some time next year. I will update the ticket. I give an answer as to the timeline before end of the year, I expect.

Best regards,
Derek

Internal Diary:
Sent 1st reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
GGUS ID: 168939
Last modifier: Stephan Lammel
Date: 2025-01-06 15:02:56

Public Diary:
Hallo Derek,
no problem for right now, as all other storage systems of CMS
still support IPv4. But we have first IPv6-only worker nodes. So,
T3_CH_PSI will not be reachable from them in the xrootd/AAA
federation. This should not be a big deal. But there are sites
with limited IPv4 addresses (that us NAT right now) and globally
routable IPv6 addresses. At some point they will want to shutdown
the NAT part and then PSI could not transfer data anymore from all
sites.
Thanks,
cheers, Stephan
Internal Diary:
Sent 1st reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
GGUS ID: 168939
Last modifier: GGUS SYSTEM
Date: 2025-01-13 16:19:25

Public Diary:
Hallo Derek,
no problem for right now, as all other storage systems of CMS
still support IPv4. But we have first IPv6-only worker nodes. So,
T3_CH_PSI will not be reachable from them in the xrootd/AAA
federation. This should not be a big deal. But there are sites
with limited IPv4 addresses (that us NAT right now) and globally
routable IPv6 addresses. At some point they will want to shutdown
the NAT part and then PSI could not transfer data anymore from all
sites.
Thanks,
cheers, Stephan
Internal Diary:
Sent 1st reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
GGUS ID: 168939
Last modifier: GGUS SYSTEM
Date: 2025-01-21 08:19:32

Public Diary:
Hallo Derek,
no problem for right now, as all other storage systems of CMS
still support IPv4. But we have first IPv6-only worker nodes. So,
T3_CH_PSI will not be reachable from them in the xrootd/AAA
federation. This should not be a big deal. But there are sites
with limited IPv4 addresses (that us NAT right now) and globally
routable IPv6 addresses. At some point they will want to shutdown
the NAT part and then PSI could not transfer data anymore from all
sites.
Thanks,
cheers, Stephan
Internal Diary:
Sent 2nd reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
GGUS ID: 168939
Last modifier: Jakrapop Akaranee
Date: 2025-01-21 10:28:32

Status: in progress
Responsible Unit: NGI_CH
Public Diary:
Dear Derek,

Thank you for your updates and for discussing the IPv6 routing support status with the PSI central IT team. As part of the CMS operations team, we understand that enabling dual-stack (IPv4 and IPv6) support may require coordination and resources, and we appreciate your efforts to investigate the matter.

I would like to follow up on the future plans for IPv6 support at PSI. Specifically, are there any long-term plans to enable IPv6 routing for the storage element?

Understanding your plans or constraints will help us assess the timeline or planning for whole CMS sties.
Thank you for your continued support and collaboration.
I look forward to your response.

PS. I have placed this ticket in "on hold" status to keep track the progress and avoid creating duplicate tickets.

Regards,
Jakrapop
---------------

ETF new stance metric status for t3se01.psi.ch

https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dt3se01.psi.ch%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice

https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dt3se01.psi.ch%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice

Internal Diary:
Sent 2nd reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
GGUS ID: 168939
Last modifier: Jakrapop Akaranee
Date: 2025-01-22 14:56:02

Status: on hold
Responsible Unit: NGI_CH
Public Diary:
I hold this ticket to keep track the progress and avoid creating duplicate tickets.

Regards,
Jakrapop
Internal Diary:
Sent 2nd reminder to ticket submitter (akaranee.jakrapop@cern.ch) requesting input.
Dear Jakrapop.

I am only getting my new GGUS account working now, we had a severe manpower shortage over the last months, sorry.
The answer from the PSI networkers to my request was:

"IPv4 should always be reachable and will be supported for many years,
even as the industry pushes towards broader adoption of IPv6. There are
mechanisms to translate between IPv4 and IPv6. We do not offer IPv6 (although we tried to convince people internally to use it). We removed all IPv6 configurations from the network."

So, I'm sorry, but I think this will not be possible for us anytime soon. Since we're only a Tier-3, I think the impact will not be too extreme. We may have to go by routing first to our Tier-2, and then to the PSI Tier-3 in the worst case.

Best regards,
Derek
Dear Derek,

Thank you for your update. As Stephan mentioned earlier, there is no immediate issue since CMS storage still supports IPv4. However, with IPv6-only worker nodes now in use, T3_CH_PSI won’t be reachable via xrootd/AAA. While CMS doesn’t have a definite plan yet, preparing for IPv6 compatibility could help ensure smooth connectivity as more sites transition in the future. We understand that IPv6 support won’t be available for now. Just to keep it on the radar, feel free to reach out if there are any plans or changes in the future regarding IPv6 compatibility. I will keep this ticket on hold.

Best regards,
Jakrapop
CMS #681479 (id:1580) XRootD access with new VOMS service extension failing at T3_CH_PSI
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 09:46 (430d ago)  |  Updated: 2025-03-11 09:07
Conversation (1 message)
GGUS ID: 168776
Last modifier: Stephan Lammel
Date: 2024-10-23 20:46:12
Subject: XRootD access with new VOMS service extension failing at T3_CH_PSI
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_CH
Issue type: Middleware
Description:
Dear PSI admins,
WLCG setup a new VOMS service for CMS at voms-cms-auth.cern.ch.
It looks like your XRootD endpoint at t3se01.psi.ch
does not accept access with a CMS VOMS proxy from that service.
Could you please take a look and add support for the service?
(Most likely adding an LCS file at /etc/grid-security/vomsdir/cms/voms-cms-auth.cern.ch.lsc via latest VOMS RPM.)
Thanks,
cheers, Stephan

https://etf-cms-preprod.cern.ch/etf/check_mk/view.py?host=t3se01.psi.ch&service=org.cms.SE-XRootD-4crt-read&site=etf&view_name=service
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM99%100%100%100%100%100%100%100%100%100%100%100%97%97%67%99%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%99%99%100%100%100%100%100%100%100%100%100%85%100%100%100%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #681899 (id:2000) TPC WebDAV protocol deployment T3_FR_IPNL
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 16:08 (430d ago)  |  Updated: 2025-10-21 07:45
Conversation (23 messages)
GGUS ID: 154859
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2021-11-08 18:06:31
Subject: TPC WebDAV protocol deployment T3_FR_IPNL
Ticket Type: USER
CC: cms-comp-ops-transfer-team@cern.ch;
Status: assigned
Responsible Unit: NGI_FRANCE
Issue type: CMS_Data Transfers
Description:
Dear site Admin,
The WLGC-TPC group[1] is trying to get all WLCG sites to enable an endpoint that supports the WebDAV[2] transfer protocol in order to begin the replacement of GridFTP.
The majority of Tier-1s and Tier-2s have already enabled davs as the primary production protocol, and eventually no gsiftp transfers will be done from/to Tier-1s and Tier-2s.
We strongly recommend to migrate to WebDAVS as soon as possible. We have created a guide [3] to do help on the installation and testing process.
Could you tell us if you have currently and endpoint that supports such protocol, if you are working on it or if you have a plan to work on it in the near future?

Thanks in advance.
Best regards,
Felipe Gomez
on behalf of CMS Transfer Team
[1] https://twiki.cern.ch/twiki/bin/view/LCG/ThirdPartyCopy
[2] https://twiki.cern.ch/twiki/bin/view/LCG/HttpTpc
[3] https://twiki.cern.ch/twiki/bin/view/LCG/CMSWebDAVProtocolInstallationAndTesting
GGUS ID: 154859
Last modifier: Pau Cutrina Vilalta
Date: 2021-11-17 09:49:40

Public Diary:
Hello Tibor, Denis,
Are there any updates? Have you had a chance to take a look at it?
Feel free to ask any questions you may have.
Thanks,
Pau
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Pau Cutrina Vilalta
Date: 2021-11-17 09:49:52

Public Diary:
Pau Cutrina Vilalta has subscribed to this ticket.

Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-02-03 20:03:37

Public Diary:
Priority has been changed from urgent to very urgent.
Dear site admin,
Any news regarding WebDAV protocol?

Best,
Felipe

Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-02-09 16:45:24

Public Diary:
Priority has been changed from very urgent to top priority.
Dear site admin,
Any news regarding WebDAV protocol?
Best,
Felipe
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-03-09 17:03:01

Public Diary:
Dear site admin,

Any news regarding WebDAV protocol?

Best,
Felipe
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-03-24 16:57:54

Public Diary:
Dear site admin,
Any news regarding WebDAV protocol?
Best,

Felipe
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-03-24 17:31:30

Status: assigned
Responsible Unit: VOSupport
Public Diary:
Dear site admin,
Any news regarding WebDAV protocol?
Best,
Felipe
CMS Transfer Team
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Denis Pugnere
Date: 2023-03-15 16:11:43

Public Diary:
Hi Pau,
I'm in late on this process, this is the long time I don't worked on this.
I'm back configuring our new endpoint EOS (lyoeos.in2p3.fr) with WebDAV protocol.
Cheers,
Denis
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Pau Cutrina Vilalta
Date: 2022-05-31 07:43:20
Changed CC to cms-comp-ops-transfer-team@cern.ch;g.baulieu@ipnl.in2p3.fr;d.pugnere@ipnl.in2p3.fr;kurca@in2p3.fr;stephane.perries@cern.ch

Public Diary:
Hello Denis, Guillaume, Tibor, Stephane,
This ticket must be handled as soon as possible.
Can you please take a look at it?
Thanks,
Pau

Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Denis PUGNERE
Date: 2024-05-07 14:56:30

Status: in progress
Responsible Unit: VOSupport
Public Diary:
Hi Stephan,
Sorry for the long offline delay, I'm back,
I wan't to decomission lyogrid06.in2p3.fr RSE as soon as possible
and commission lyoeos.in2p3.fr at the same time
lyoeos.in2p3.fr is at the lastest update (diopside 5.2.22),

For the Webdav support and TPC, I've just done some tests (with a x509 CMS voms proxy)
$ davix-put -E /tmp/x509up_u$(id -u) /etc/hosts davs://lyoeos.in2p3.fr:8443/store/user/dpugnere/test-davix-put-$(uuidgen)
$ gfal-copy -f file:///etc/hosts davs://lyoeos.in2p3.fr:8443/store/user/dpugnere/test-gfal-copy-davs-$(uuidgen)

these tests are ok,
and also TPC is working :
$ gfal-copy davs://lyoeos.in2p3.fr:8443/store/user/dpugnere/file-pugnere-gridcms https://lyoeostestmgm.in2p3.fr:8443/store/user/dpugnere/test-davs-TPC-test-$(uuidgen)

But from the guide https://twiki.cern.ch/twiki/bin/view/LCG/CMSWebDAVProtocolInstallationAndTesting
the curl tests :
$ curl -L --capath /etc/grid-security/certificates/ -H 'X-No-Delegate:true' -H 'Credential: none' --cacert /tmp/x509up_u$(id -u) -E /tmp/x509up_u$(id -u) -T /etc/hosts https://lyoeos.in2p3.fr:8443/store/user/dpugnere/test-curl-put-$(uuidgen)
curl: (35) Peer does not recognize and trust the CA that issued your certificate.
Doesn't work


Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Stephan Lammel
Date: 2023-03-15 17:12:13

Public Diary:
Hallo Denis,
thanks for your comment. Let us know when the WebDAV endpoint is up and we'll test/help configure the RSE with it.
- Stephan
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Stephan Lammel
Date: 2024-05-07 15:45:36

Public Diary:
Hallo Denis,
great you have time to get back to this. If i look at the SAM
tests results for lyoeos.in2p3.fr at
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_FR_IPNL
it looks like the XRootD endpoint works fine for x509 (but is not
yet subscribed to the federation which is ok) but the WebDAV
endpoint fails reads with x509. Looking at the log file, for instance,
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-4crt-read&var-dst_hostname=lyoeos.in2p3.fr&var-timestamp=1715091264000
the read is forwarded to lyostorage33.in2p3.fr:9001 and connection
to that node/port fails ("Could not connect to server"). I get the
same on the command line:
% /usr/bin/nc -zv -w 15 lyostorage33.in2p3.fr 9001
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connection to 134.158.83.199 failed: TIMEOUT.
Ncat: Trying next address...
Ncat: Network is unreachable.
for both IPv4 and IPv6. So, maybe a firewall issue or wrong port?
'hope this helps,
cheers, Stephan
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Denis PUGNERE
Date: 2024-05-13 07:07:01

Public Diary:
Hello Stephan,
Yes, It was a miss-configuration from my part, corrected this morning.
Let's see how evolve the coming tests.
Sorry,
Denis
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Denis PUGNERE
Date: 2024-05-13 10:05:50
Changed CC to cms-comp-ops-transfer-team@cern.ch;d.pugnere@ipnl.in2p3.fr

Public Diary:
Hi again,
I would like to understand the SAM test result of SE-WebDAV-4crt-read
in https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-4crt-read&var-dst_hostname=lyoeos.in2p3.fr&var-timestamp=1715587006000

it seems that :
* file A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root bock 0 checksum match
* file AE237916-5D76-E711-A48C-FA163EEEBFED.root bock 2281 checksum match
* but not for CE860B10-5D76-E711-BCA8-FA163EAA761A.root ?

but I'm note sure,
I would like to launch SAM tests locally to speed up resolution of the problems, could you send me some instructions ?

thanks,
denis
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Stephan Lammel
Date: 2024-05-13 13:10:06

Public Diary:
Hallo Denis,
thanks for taking a look. Looking at the above log,
a stat of file A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root worked
a read of the first block of file A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root worked
a read of block 2281 of file AE237916-5D76-E711-A48C-FA163EEEBFED.root worked
then we hit the ETF log file limit
fetching the ADLER32 checksum of CE860B10-5D76-E711-BCA8-FA163EAA761A.root failed
(The summary shows the error message and the test rotates over the files in the
SAM dataset, thus me knowing the last line in the block above. The test also
printed the command line equivalent, below.)
Thanks,
cheers, Stephan

% gfal-sum davs://lyoeos.in2p3.fr:8443/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root ADLER32
gfal-sum error: 1 (Operation not permitted) - HTTP 403 : Permission refused

Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Denis PUGNERE
Date: 2024-06-11 15:17:29

Public Diary:
Hi Stefan,
I think I understood the problem,
for the WebDAV problem, using the path /store/... fail with gfal-sum :
$ gfal-sum davs://lyoeos.in2p3.fr:8443/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root ADLER32
gfal-sum error: 38 (Function not implemented) - checksum calculation for ADLER32 not supported for davs://lyoeos.in2p3.fr:8443/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root

But the same command, with the full path success :
$ gfal-sum davs://lyoeos.in2p3.fr:8443/eos/lyoeos.in2p3.fr/grid/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root ADLER32
davs://lyoeos.in2p3.fr:8443/eos/lyoeos.in2p3.fr/grid/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root 96fb2295

This is the same problem with the others files.

How to configure SAM tests to use the full path ?
best regards,
Denis
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Stephan Lammel
Date: 2024-06-11 15:53:21

Public Diary:
Hallo Denis,
thanks for taking a look. So, the /store/... path was the config we
initially added for testing. Since you switched the RSE, there is no
need for the test entry anymore. I have removed it and this should stop
SAM from checking the path (since you have /eos/lyoeos.in2p3.fr/grid/cms/store/...
as path in storage.json). It should take effect with the ETF reload
during the night.
Thanks,
cheers, Stephan

Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Denis PUGNERE
Date: 2024-06-17 14:14:38

Public Diary:
Hi folks,
Can this ticket can be closed ? or is there anything I need to do ?
Cheers,
Denis
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Stephan Lammel
Date: 2024-06-17 21:08:16

Public Diary:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/RedirectorsSubscription
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Stephan Lammel
Date: 2024-06-17 21:07:22

Public Diary:
Hallo Denis,
yes, almost ready to close. Can you please subscribe
lyoeos-redir.in2p3.fr to the transitional federation?
Thanks,
cheers, Stephan
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154859
Last modifier: Stephan Lammel
Date: 2025-01-23 15:04:02

Public Diary:
federation subscription remains to be done. - Stephan
Internal Diary:
Notified CMS Site Support team of this ticket.
Hello Denis PUGNERE,

Could you please take action and update this old ticket?

Cheers,
David
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

CMS tickets (2)
CMS #681897 (id:1998) TPC WebDAV protocol deployment T3_HR_IRB
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 16:08 (430d ago)  |  Updated: 2025-02-18 09:01
Conversation (15 messages)
GGUS ID: 154860
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2021-11-08 18:15:24
Subject: TPC WebDAV protocol deployment T3_HR_IRB
Ticket Type: USER
CC: cms-comp-ops-transfer-team@cern.ch;
Status: assigned
Responsible Unit: NGI_HR
Issue type: CMS_Data Transfers
Description:
Dear site Admin,

The WLGC-TPC group[1] is trying to get all WLCG sites to enable an endpoint that supports the WebDAV[2] transfer protocol in order to begin the replacement of GridFTP.
The majority of Tier-1s and Tier-2s have already enabled davs as the primary production protocol, and eventually no gsiftp transfers will be done from/to Tier-1s and Tier-2s.
We strongly recommend to migrate to WebDAVS as soon as possible. We have created a guide [3] to do help on the installation and testing process.
Could you tell us if you have currently and endpoint that supports such protocol, if you are working on it or if you have a plan to work on it in the near future?

Thanks in advance.

Best regards,

Felipe Gomez
on behalf of CMS Transfer Team

[1] https://twiki.cern.ch/twiki/bin/view/LCG/ThirdPartyCopy
[2] https://twiki.cern.ch/twiki/bin/view/LCG/HttpTpc
[3] https://twiki.cern.ch/twiki/bin/view/LCG/CMSWebDAVProtocolInstallationAndTesting
GGUS ID: 154860
Last modifier: Emir Imamagic
Date: 2022-01-05 14:08:11

Status: in progress
Responsible Unit: NGI_HR
Public Diary:
Hasan, can you please comment on why T2_FR_IPHC is not assigned any production jobs?
Thanks,
- Stephan
Internal Diary:
Notified CMS Workflows team of this ticket.
GGUS ID: 154860
Last modifier: Oscar Garzon
Date: 2022-01-12 17:46:30

Public Diary:
Any updates here?
Internal Diary:
Notified CMS Workflows team of this ticket.
GGUS ID: 154860
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-02-03 17:46:54

Public Diary:
Hello Emir,
Do you have any update?
Cheers!
Felpe
Internal Diary:
Notified CMS Workflows team of this ticket.
GGUS ID: 154860
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-02-09 16:46:48

Public Diary:
Priority has been changed from urgent to very urgent.
Dear Emir,
Any news regarding WebDAV protocol?
Best,
Felipe
Internal Diary:
Notified CMS Workflows team of this ticket.
GGUS ID: 154860
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-03-14 20:54:40

Public Diary:
Priority has been changed from very urgent to top priority.
Hello Emir,
Do you have news about WebDAV Protocol?
Cheers,
Felipe

Internal Diary:
Notified CMS Workflows team of this ticket.
GGUS ID: 154860
Last modifier: Vuko Brigljevic
Date: 2022-03-30 16:49:02

Public Diary:
Hello,
Sorry for the very late reply!
storm-webdav is running and has actually been running on our site
for several months now. I am however having problems in testing it
following instructions at [1].
Working on lxplus: after getting a grid proxy with
voms-proxy-init --voms cms
I execute
curl -v -L --capath /etc/grid-security/certificates/ -H
'X-No-Delegate:true' -H 'Credential: none' --cacert /tmp/x509up_u13456
-E /tmp/x509up_u13456 -T ./wjets-2016.list
https://lorienmaster.irb.hr:8443/cms/store/user/vuko/wjets.list
I get the following error:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:--
--:--:-- 0* About to connect() to lorienmaster.irb.hr port 8443
(#0)
* Trying 161.53.131.101...
* Connected to lorienmaster.irb.hr (161.53.131.101) port 8443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* failed to load
'/etc/grid-security/certificates//edca0fc0.namespaces' from
CURLOPT_CAPATH
* failed to load
'/etc/grid-security/certificates//04f60c28.namespaces' from
CURLOPT_CAPATH
(... and many more such messages...)
* failed to load '/etc/grid-security/certificates//policy-lcg.info'
from CURLOPT_CAPATH
* CAfile: /tmp/x509up_u13456
CApath: /etc/grid-security/certificates/
* NSS: client certificate from file
* subject: CN=68145606,CN=Vuko
Brigljevic,CN=378129,CN=vuko,OU=Users,OU=Organic Units,DC=cern,DC=ch
* start date: Mar 30 14:44:00 2022 GMT
* expire date: Apr 06 18:44:00 2022 GMT
* common name: 68145606
* issuer: CN=Vuko Brigljevic,CN=378129,CN=vuko,OU=Users,OU=Organic
Units,DC=cern,DC=ch
* SSL connection using TLS_DHE_RSA_WITH_AES_256_CBC_SHA256
* Server certificate:
* subject: CN=lorienmaster.irb.hr,OU=irb,O=edu,C=HR
* start date: Oct 20 06:48:51 2021 GMT
* expire date: Nov 19 06:48:51 2022 GMT
* common name: lorienmaster.irb.hr
* issuer: CN=SRCE CA,OU=srce,O=edu,C=HR
> PUT /cms/store/user/vuko/wjets.list HTTP/1.1
> User-Agent: curl/7.29.0
> Host: lorienmaster.irb.hr:8443
> Accept: */*
> X-No-Delegate:true
> Credential: none
> Content-Length: 1346
> Expect: 100-continue
>
< HTTP/1.1 500 Server Error
< Set-Cookie: JSESSIONID=node01g6x89lnicqp01qrlocpqxk4ef4.node0; Path=/; Secure
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< Strict-Transport-Security: max-age=31536000 ; includeSubDomains
< X-Frame-Options: DENY
< Connection: close
<
{ [data not shown]
0 1346 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Closing connection 0
Any idea about what could be wrong?
Cheers,
Vuko
[1] https://twiki.cern.ch/twiki/bin/view/LCG/CMSWebDAVProtocolInstallationAndTesting

--
Vuko Brigljevic
Senior Scientist
Head of Laboratory for Particle Physics
Rudjer Boskovic Institute
Bijenicka 54, HR-10000 Zagreb (Croatia)
Phone: +385-1-4571318 GSM: +385-98-965 8104
Swiss GSM (while @ CERN): +41-76-278 3615

Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154860
Last modifier: Stephan Lammel
Date: 2022-03-31 12:45:49

Public Diary:
Hallo Vuko,
i added an entry of your WebDAV endpoint for testing to SAM
yesterday:

https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_HR_IRB

but i probably guessed the path wrong and SAM gets:

https://lorienmaster.irb.hr:8443/cms/store/mc/SAM/: HTTP Error 404: Not Found

Is this the right path to the SAM dataset area? If so, there area is either
not accessible to the web service or the web service doesn't not support
the WebDAV extensions.
Thanks,
cheers, Stephan
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154860
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-05-04 16:31:18

Public Diary:
Hello Vuko,

I can do gfal-copy using my own certificates.
But for TPC I got the following error in the TPC log [1]:
Copy failed with mode 3rd push, with error: Transfer failed: failure: SocketException while pushing https://redirector.t2.ucsd.edu:1095/store/temp/user/fgomezco/TPC/T3_HR_IRB/10M_001: Connection reset
Can you check that? I am asking also to Diego.

Best,
Felipe

[1] TPC Transfer log: https://fts3-cms.cern.ch:8449/fts3/ftsmon/#/job/0f6dbc36-cbc7-11ec-a0f1-fa163e255be7
INFO Wed, 04 May 2022 18:27:16 +0200; Configured to skip retrieval of SE-issued tokens
INFO Wed, 04 May 2022 18:27:16 +0200; Transfer accepted
INFO Wed, 04 May 2022 18:27:16 +0200; Proxy: /tmp/x509up_h7468342906483470881XdcXchXdcXcernXouXorganicXunitsXouXusersXcnXfgomezcoXcnX850533XcnXfelipeXgomez
INFO Wed, 04 May 2022 18:27:16 +0200; VO: cms
INFO Wed, 04 May 2022 18:27:16 +0200; Job id: 0f6dbc36-cbc7-11ec-a0f1-fa163e255be7
INFO Wed, 04 May 2022 18:27:16 +0200; File id: 3337358994
INFO Wed, 04 May 2022 18:27:16 +0200; Source url: davs://lorienmaster.irb.hr:8443/STORE/se/cms/store/temp/user/fgomezco/10M_001
INFO Wed, 04 May 2022 18:27:16 +0200; Dest url: davs://redirector.t2.ucsd.edu:1095/store/temp/user/fgomezco/TPC/T3_HR_IRB/10M_001
INFO Wed, 04 May 2022 18:27:16 +0200; Overwrite enabled: 1
INFO Wed, 04 May 2022 18:27:16 +0200; Disable delegation: 0
INFO Wed, 04 May 2022 18:27:16 +0200; Disable local streaming: 0
INFO Wed, 04 May 2022 18:27:16 +0200; Dest space token:
INFO Wed, 04 May 2022 18:27:16 +0200; Source space token:
INFO Wed, 04 May 2022 18:27:16 +0200; Checksum:
INFO Wed, 04 May 2022 18:27:16 +0200; Checksum enabled: Both checksum comparison
INFO Wed, 04 May 2022 18:27:16 +0200; User filesize: 0
INFO Wed, 04 May 2022 18:27:16 +0200; File metadata: null
INFO Wed, 04 May 2022 18:27:16 +0200; Job metadata: {\"auth_method\":?\"certificate\"}
INFO Wed, 04 May 2022 18:27:16 +0200; Bringonline token:
INFO Wed, 04 May 2022 18:27:16 +0200; UDT: 0
INFO Wed, 04 May 2022 18:27:16 +0200; BDII:lcg-bdii.cern.ch:2170
INFO Wed, 04 May 2022 18:27:16 +0200; Source token issuer:
INFO Wed, 04 May 2022 18:27:16 +0200; Destination token issuer:
INFO Wed, 04 May 2022 18:27:16 +0200; Report on the destination tape file: 0
INFO Wed, 04 May 2022 18:27:16 +0200; Getting source file size
INFO Wed, 04 May 2022 18:27:17 +0200; File size: 10485760
INFO Wed, 04 May 2022 18:27:17 +0200; IPv6: indeterminate
INFO Wed, 04 May 2022 18:27:17 +0200; TCP streams: 0
INFO Wed, 04 May 2022 18:27:17 +0200; TCP buffer size: 0
INFO Wed, 04 May 2022 18:27:17 +0200; Timeout set to: 620
INFO Wed, 04 May 2022 18:27:17 +0200; [1651681637931] BOTH GFAL2:CORE:COPY LIST:ENTER
INFO Wed, 04 May 2022 18:27:17 +0200; [1651681637932] BOTH GFAL2:CORE:COPY LIST:ITEM davs://lorienmaster.irb.hr:8443/STORE/se/cms/store/temp/user/fgomezco/10M_001 => davs://redirector.t2.ucsd.edu:1095/store/temp/user/fgomezco/TPC/T3_HR_IRB/10M_001
INFO Wed, 04 May 2022 18:27:17 +0200; [1651681637932] BOTH GFAL2:CORE:COPY LIST:EXIT
INFO Wed, 04 May 2022 18:27:17 +0200; [1651681637932] BOTH http_plugin PREPARE:ENTER davs://lorienmaster.irb.hr:8443/STORE/se/cms/store/temp/user/fgomezco/10M_001 => davs://redirector.t2.ucsd.edu:1095/store/temp/user/fgomezco/TPC/T3_HR_IRB/10M_001
INFO Wed, 04 May 2022 18:27:26 +0200; [1651681646633] BOTH http_plugin PREPARE:EXIT davs://lorienmaster.irb.hr:8443/STORE/se/cms/store/temp/user/fgomezco/10M_001 => davs://redirector.t2.ucsd.edu:1095/store/temp/user/fgomezco/TPC/T3_HR_IRB/10M_001
INFO Wed, 04 May 2022 18:27:26 +0200; [1651681646633] BOTH http_plugin TRANSFER:ENTER davs://lorienmaster.irb.hr:8443/STORE/se/cms/store/temp/user/fgomezco/10M_001 => davs://redirector.t2.ucsd.edu:1095/store/temp/user/fgomezco/TPC/T3_HR_IRB/10M_001
INFO Wed, 04 May 2022 18:27:26 +0200; [1651681646633] BOTH http_plugin TRANSFER:TYPE 3rd pull
INFO Wed, 04 May 2022 18:27:29 +0200; [1651681649296] BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed with mode 3rd pull, with error: Transfer failed: failure: Remote side failed with status code 500; error message: "{"timestamp":"2022-05-04T16:27:29.064+0000","status":500,"error":"Internal Server Error","message":"java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;","path":"/STORE/se/cms/store/temp/user/fgomezco/10M_001"}"
INFO Wed, 04 May 2022 18:27:30 +0200; [1651681650336] BOTH http_plugin TRANSFER:ENTER davs://lorienmaster.irb.hr:8443/STORE/se/cms/store/temp/user/fgomezco/10M_001 => davs://redirector.t2.ucsd.edu:1095/store/temp/user/fgomezco/TPC/T3_HR_IRB/10M_001
INFO Wed, 04 May 2022 18:27:30 +0200; [1651681650336] BOTH http_plugin TRANSFER:TYPE 3rd push
INFO Wed, 04 May 2022 18:27:34 +0200; [1651681654938] BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed with mode 3rd push, with error: Transfer failed: failure: SocketException while pushing https://redirector.t2.ucsd.edu:1095/store/temp/user/fgomezco
GGUS ID: 154860
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-04-01 15:33:01

Public Diary:
Hello Vuko,

Thanks for your answer.
I have started manual DAVS testing on your site. The endpoint I used is:
davs://lorienmaster.irb.hr:8443/STORE/se/cms

gfal-ls OK
gfal-copy OK
gfal-copy + overwrite OK

The Third Party Copy using FTS fails. I am attaching the log of the test I performed[1] plus the FTS transfer log[2].

I have few questions.
Is davs://lorienmaster.irb.hr:8443/STORE/se/cms the correct WebDAV endpoint?
Which storage solution do you have implenented on your site? (XRooD, dCache, DPM, EOS, StoRM, ECHO)

Let me check with Diego Davila the FTS log errors.

Cheers!

Felipe

[1]
[fgomezco@lxplus732 tpc_transfers]$ RSE=T3_HR_IRB
[fgomezco@lxplus732 tpc_transfers]$ ENDPOINT=davs://lorienmaster.irb.hr:8443/STORE/se/cms
[fgomezco@lxplus732 tpc_transfers]$ bash test_tpc.sh_FILIS $ENDPOINT $RSE 10M_001
Testing: T3_HR_IRB
Endpoint: davs://lorienmaster.irb.hr:8443/STORE/se/cms
Checking I can write to davs://lorienmaster.irb.hr:8443/STORE/se/cms/store/temp/user/fgomezco/10M_001
Copying file:///afs/cern.ch/user/f/fgomezco/github/davila/tpc_transfers/10M_001 [DONE] after 0s
gfal-copy: OK
Testing davs://lorienmaster.irb.hr:8443/STORE/se/cms as source
TRANSFER ID: 12fbc59c-b1cc-11ec-9adc-fa163eac1edd
RESULT: SUBMITTED
RESULT: ACTIVE
RESULT: ACTIVE
RESULT: ACTIVE
RESULT: ACTIVE
fts-transfer-submit: FAILED

[2] FTS transfer log
https://fts-cms-04.cern.ch:8449/var/log/fts3/transfers/2022-04-01/lorienmaster.irb.hr__redirector.t2.ucsd.edu/2022-04-01-1454__lorienmaster.irb.hr__redirector.t2.ucsd.edu__3293339256__a33f543a-b1cb-11ec-8b40-fa163e36d89b

Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154860
Last modifier: Diego Davila
Date: 2022-04-14 15:20:37

Public Diary:
Hello Vuko,

Is there any news on this?

Cheers,
Diego Davila
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154860
Last modifier: Diego Davila
Date: 2022-05-26 15:15:51

Public Diary:
Hello Vuko,

Have you had time to take a look at your configs? seems to me that the TPC part might be not configured correctly.

Cheers,
Diego Davila
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154860
Last modifier: Felipe Leonardo Gomez Cortes
Date: 2022-06-10 13:40:43

Public Diary:
Hi Vuko,
Any news regarding WebDav deployment?
Cheers,
Felipe
Internal Diary:
Notified CMS Site Support team of this ticket.
GGUS ID: 154860
Last modifier: Stephan Lammel
Date: 2025-01-23 15:00:01

Public Diary:
lorienmaster.irb.hr / SRM entry still exists/needs to be decommissioned. - Stephan
Internal Diary:
Notified CMS Site Support team of this ticket.
Assigning missing CMS site name error during import to new GGUS.
Jakrapop
CMS #681552 (id:1653) SAM Connection test failure at T3_HR_IRB
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 09:54 (430d ago)  |  Updated: 2025-01-30 14:09
Conversation (7 messages)
GGUS ID: 168138
Last modifier: Chan-anun Rungphitakchai
Date: 2024-09-12 19:05:27
Subject: SAM Connection test failure at T3_HR_IRB
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_HR
Issue type: Network problem
Description:
Hello IRB admin.
Since 15:00UTC yesterday (11 Sep). Your storage endpoint "lorienmaster.irb.hr" has been failed SAM "1connection" test [1]. There is "No route to host" from nc command [2] but manually ping looks normal [3]. Could you please take a look at your service/network connection on server.
Thank you,
Noy
[1]
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_HR_IRB
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-GSIftp-1connection&var-dst_hostname=lorienmaster.irb.hr&var-timestamp=1726161727000
[3]
[chananun@cmslpc236 work]$ ping -c 5 lorienmaster.irb.hr
PING lorienmaster.irb.hr (161.53.131.101) 56(84) bytes of data.
64 bytes from lorienmaster.irb.hr (161.53.131.101): icmp_seq=1 ttl=46 time=133 ms
64 bytes from lorienmaster.irb.hr (161.53.131.101): icmp_seq=2 ttl=46 time=133 ms
64 bytes from lorienmaster.irb.hr (161.53.131.101): icmp_seq=3 ttl=46 time=133 ms
64 bytes from lorienmaster.irb.hr (161.53.131.101): icmp_seq=4 ttl=46 time=133 ms
64 bytes from lorienmaster.irb.hr (161.53.131.101): icmp_seq=5 ttl=46 time=133 ms
--- lorienmaster.irb.hr ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 132.609/132.680/132.774/0.238 ms
GGUS ID: 168138
Last modifier: Stephan Lammel
Date: 2024-10-04 12:52:18

Public Diary:
Thanks for the update Emir! - Stephan
Internal Diary:
Escalated this ticket to NGI_GRNET
GGUS ID: 168138
Last modifier: Vuko Brigljevic
Date: 2024-10-09 10:55:02

Public Diary:
I confirm what Emir said and thank him
for replying while I was traveling.
The site is currently suspended and we
will have to go through the full certification
again when we see that we are able to
bring it up again, but I cannot give a time
estimate for this.
Vuko

--
Vuko Brigljevic
Senior Scientist
Head of Laboratory for Particle Physics
Rudjer Boskovic Institute
Bijenicka 54, HR-10000 Zagreb (Croatia)
Phone: +385-1-4571318 GSM: +385-98-965 8104
Internal Diary:
Escalated this ticket to NGI_GRNET
GGUS ID: 168138
Last modifier: GGUS SYSTEM
Date: 2024-10-11 08:41:17

Public Diary:
I confirm what Emir said and thank him
for replying while I was traveling.
The site is currently suspended and we
will have to go through the full certification
again when we see that we are able to
bring it up again, but I cannot give a time
estimate for this.
Vuko

--
Vuko Brigljevic
Senior Scientist
Head of Laboratory for Particle Physics
Rudjer Boskovic Institute
Bijenicka 54, HR-10000 Zagreb (Croatia)
Phone: +385-1-4571318 GSM: +385-98-965 8104
Internal Diary:
Sent 1st reminder to ticket submitter (rungphitakch@wisc.edu) requesting input.
GGUS ID: 168138
Last modifier: GGUS SYSTEM
Date: 2024-10-18 08:41:24

Public Diary:
I confirm what Emir said and thank him
for replying while I was traveling.
The site is currently suspended and we
will have to go through the full certification
again when we see that we are able to
bring it up again, but I cannot give a time
estimate for this.
Vuko

--
Vuko Brigljevic
Senior Scientist
Head of Laboratory for Particle Physics
Rudjer Boskovic Institute
Bijenicka 54, HR-10000 Zagreb (Croatia)
Phone: +385-1-4571318 GSM: +385-98-965 8104
Internal Diary:
Sent 2nd reminder to ticket submitter (rungphitakch@wisc.edu) requesting input.
GGUS ID: 168138
Last modifier: GGUS SYSTEM
Date: 2024-10-25 08:59:47

Public Diary:
I confirm what Emir said and thank him
for replying while I was traveling.
The site is currently suspended and we
will have to go through the full certification
again when we see that we are able to
bring it up again, but I cannot give a time
estimate for this.
Vuko

--
Vuko Brigljevic
Senior Scientist
Head of Laboratory for Particle Physics
Rudjer Boskovic Institute
Bijenicka 54, HR-10000 Zagreb (Croatia)
Phone: +385-1-4571318 GSM: +385-98-965 8104
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 168138
Last modifier: Chan-anun Rungphitakchai
Date: 2024-11-07 19:14:25

Status: on hold
Responsible Unit: NGI_HR
Public Diary:
I confirm what Emir said and thank him
for replying while I was traveling.
The site is currently suspended and we
will have to go through the full certification
again when we see that we are able to
bring it up again, but I cannot give a time
estimate for this.
Vuko

--
Vuko Brigljevic
Senior Scientist
Head of Laboratory for Particle Physics
Rudjer Boskovic Institute
Bijenicka 54, HR-10000 Zagreb (Croatia)
Phone: +385-1-4571318 GSM: +385-98-965 8104
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud0%
FTS— no data —

Open GGUS tickets (1)

CMS tickets (1)
CMS #681738 (id:1839) SRMv2 phase out at T3_IN_TIFRCloud
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:15 (430d ago)  |  Updated: 2025-02-03 11:17
Conversation (4 messages)
GGUS ID: 166843
Last modifier: Chan-anun Rungphitakchai
Date: 2024-05-23 18:58:11
Subject: SRMv2 phase out at T3_IN_TIFRCloud
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: File Access
Description:
Hello TIFR admin.
After looking at your SITECONF, you have an SRMv2 protocol defined in storage.json for your RSE and use the SRMv2 protocol in the local-stage-out.
In detail, in JobConfig/site-local-config.xml should
change line 16 to
remove line 27

There is no production storage endpoint from your site, could you please change line 23 in storage.json to

"rse": "null",
The last file is PhEDEx/storage.xml, need to add a line under line 23 with

Please check the above is what you want, typos, etc. and if you agree update SITECONF.
Thank you,
Noy
GGUS ID: 166843
Last modifier: Chan-anun Rungphitakchai
Date: 2024-05-23 19:02:05

Public Diary:
Sorry for typo. In storage.json, line 23 should be
"rse": null,
I attach other SITECONF that use "rse":null.
Thank you,
Noy
[1]
https://gitlab.cern.ch/SITECONF/T2_FR_IPHC/-/blob/master/storage.json
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166843
Last modifier: Chan-anun Rungphitakchai
Date: 2024-12-03 20:18:45

Public Diary:
Hello TIFRCloud admin.
Do you have any update?
Cheers
Noy
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 166843
Last modifier: Chan-anun Rungphitakchai
Date: 2024-09-26 19:36:17

Public Diary:
Any update? Thanks -- Noy
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (4)

CMS tickets (2)
CMS #681750 (id:1851) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_IR_IPM
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 10:18 (430d ago)  |  Updated: 2025-02-13 14:59
Conversation (4 messages)
GGUS ID: 168937
Last modifier: Jakrapop Akaranee
Date: 2024-11-07 09:39:58
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_IR_IPM
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: CMS_SAM tests
Description:
Dear IPM Site Administrators,

We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack network, specifically for the following endpoint:

Both WebDAV [1] and XRootD [2] services at se1.hep.ipm.ir

Could you please review dual stack support on your storage element? Thank you for your assistance.
Best Regards,Jakrapop
---------
[1] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dse1.hep.ipm.ir%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
[2] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dse1.hep.ipm.ir%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice
GGUS ID: 168937
Last modifier: Masoud Mosalman Tabar
Date: 2024-11-09 04:08:50

Public Diary:
Hello Jakrapop,
According to following addresses (monitoring), the Webdav (token-based) is setup on Dcache new version along with Xrootd:


https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Ffilled_in%3Dfilter%26host%3Dse1.hep.ipm.ir%26view_name%3Dhost

https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_IR_IPM&update=auto

Additionally, in another Ticket (below address), we made the migrations and now we are chasing (for a long while) an issue related to the Xrootd-Dcache token-based in our SE1 (se1.hep.ipm.ir):

https://ggus.eu/index.php?mode=ticket_info&ticket_id=164275


Please let me know about another procedures if we need to do more, in detail (provide us with configuration files and settings).

Regards,
Masoud

Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168937
Last modifier: Masoud Mosalman Tabar
Date: 2024-11-09 07:38:41

Public Diary:
Hello again,

Regarding the dual-stack network and IPv6 implementation for the country of Iran and our national network infrastructure, it remains neither possible nor feasible at this time.

Over the past year, this issue has faced numerous significant challenges, with little hope for resolution.

Bests,
Masoud



Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
GGUS ID: 168937
Last modifier: Jakrapop Akaranee
Date: 2024-11-27 12:44:26

Status: on hold
Responsible Unit: ROC_Asia/Pacific
Public Diary:
Dear Masoud,

Thank you for your update and information. I will kindly hold this ticket until you have a plan for IPv6 implementation in the country.
I appreciate your ongoing efforts and look forward to hearing about your progress.

Cheers,
Jakrpop

Internal Diary:
Added attachment aaaRFServerCount.pdf
https://ggus.eu/index.php?mode=download&attid=ATT119564
CMS #681734 (id:1835) IAM Token for dCache at T3_IR_IPM
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:15 (430d ago)  |  Updated: 2025-02-03 11:15
Conversation (67 messages)
GGUS ID: 164275
Last modifier: Chan-anun Rungphitakchai
Date: 2023-11-22 17:07:33
Subject: IAM Token for dCache at T3_IR_IPM
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: ROC_Asia/Pacific
Issue type: Storage Systems
Description:
Hello IPM admins,
IAM Token support for dCache is ready[1]. You could consider upgrading your dCache door node and configuration IAM token access. In the first place, you need to ensure your dCache version is 8.2.33 (recommended version). I attach the documents and wiki pages[2]. Please take a look.
Thank you and have a nice day,
Noy
[1]
https://twiki.cern.ch/twiki/bin/view/CMS/IAMTokens
[2]
https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheCMSsetup https://twiki.cern.ch/twiki/bin/view/CMSPublic/DCacheXRootD
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2023-11-25 09:59:37

Public Diary:
Hello,
According to the instructions, the following configurations has been changed. However, it turned the entire service to failure.
Please check the steps that I took and let me know about another changes our Storage Element needs to be fixed.

rpm -qa | grep dcache
#dcache-8.2.33-1.noarch


vi /etc/dcache/layouts/layout-se1.hep.ipm.ir.conf
#ADDED:
[centralDomain/gplazma]
[${host.name}_gplazmaDomain]
[${host.name}_gplazmaDomain/gplazma]

gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any https://se1.hep.ipm.ir

gplazma.oidc.provider!cms=https://cms-auth.web.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001" -prefix=/data/store -suppress=audience

gplazma.oidc.provider!wlcg=https://wlcg.cloud.cnaf.infn.it/ -profile=wlcg -prefix=/data/store -suppress=audience

[doorsDomain/xrootd]
#CHANGED
#xrootd.plugins=gplazma:gsi
xrootd.plugins=gplazma:ztn,gplazma:gsi,authz:cms-tfc,authz:scitokens
xrootd.security.tls.mode=STRICT


vi /etc/dcache/gplazma.conf
auth optional x509
auth optional voms
auth optional oidc
auth optional scitoken

map optional gridmap
map optional vorolemap
map optional vogroup vo-group-path=/etc/dcache/vo-group.json
map optional vogroup vo-group-path=/etc/dcache/vo-user.json

map sufficient multimap gplazma.multimap.file=/etc/dcache/multimap-prod-user.conf
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.vorole
map optional multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.group
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.user
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.vo
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.unmapped

account requisite banfile
session requisite roles
session sufficient omnisession


vi /etc/dcache/multimap-prod-user.conf
group:cms username:cms_production uid:1001


Bests,
Masoud

Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2023-11-26 19:32:33

Public Diary:
Hallo Christoph,
do you know/can you guess what went wrong with dCache upgrade/config at
T3_IR_IPM? All services, SRM, WebDAV, and Xrootd fail already the connection
check, i.e. a "/usr/bin/nc -zv -4 -w 45 94.184.210.155 443" from CERN,
so this has to be something fundamental.
Masoud, you didn't change ports etc. and checked your firewalls, right?
Thanks,
cheers, Stephan
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2023-11-26 21:15:10

Public Diary:
Thanks Christoph! - Stephan
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Christoph Wissing
Date: 2023-11-26 20:07:32

Public Diary:
Hello,

if none of the protocol works I would guess something went wrong in the gplazma config. If a door has a bad configuration only the affected door is expected to fail.

Although the logfiles are not particularly good to read there should be hints inside, what is failing. The logs are usually found in /var/log/dcache/.

Cheers, Christoph


Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2023-11-27 06:01:30

Public Diary:
Hello Stephan and Christoph,
For your information, no ports, firewall or any other policy has been changed since the new IAM procedure.
Besides, nothing looks like a helpful or detailed info can be found in the log directory (/var/log/dcache/).
Its outputs are thoroughly general messages like this, just showcasing Authentication failure.
2023-11-27 08:57:04,271 [CURATOR: STARTED] unhandled error "Background retry gave up": KeeperErrorCode = ConnectionLoss
2023-11-27 08:57:04,271 [CURATOR: STARTED] connection state now SUSPENDED

level=INFO ts=2023-11-27T09:01:15.795+0330 event=org.dcache.ftp.response session=door:GFTP-se1-AAYLG5xblLA@doorsDomain command="ENC{USER :globus-mapping:}" reply="ENC{530 Login denied}"
level=INFO ts=2023-11-27T09:09:05.141+0330 event=org.dcache.xrootd.connection.start session=door:Xrootd-se1@doorsDomain:AAYLG7hZMyA socket.remote=94.184.210.139:40892 socket.local=94.184.210.155:1094
level=INFO ts=2023-11-27T09:09:05.144+0330 event=org.dcache.xrootd.request session=door:Xrootd-se1@doorsDomain:AAYLG7hZMyA request=protocol response=ok
level=INFO ts=2023-11-27T09:09:05.171+0330 event=org.dcache.xrootd.request session=door:Xrootd-se1@doorsDomain:AAYLG7hZMyA request=login username=masoodmm capver=5 pid=49783 token=xrd.cc=us&xrd.tz=3&xrd.appname=xrdcp&xrd.info=&xrd.hostname=SuperMicro6&xrd.rn=v5.5.4 response=ok sessionId=505C6D6F4F27F7E333BF0C2C89DC8D21 sec=&P=gsi,v:10400,c:ssl,ca:f8598272
level=INFO ts=2023-11-27T09:09:05.278+0330 event=org.dcache.xrootd.request session=door:Xrootd-se1@doorsDomain:AAYLG7hZMyA request=auth response=authmore
level=ERROR ts=2023-11-27T09:09:05.316+0330 event=org.dcache.xrootd.request session=door:Xrootd-se1@doorsDomain:AAYLG7hZMyA request=auth response=error error.code=NotAuhorized error.msg=
level=INFO ts=2023-11-27T09:09:05.317+0330 event=org.dcache.xrootd.connection.end session=door:Xrootd-se1@doorsDomain:AAYLG7hZMyA


Today, I made some changes that varied from my latest configurations, providing ports return back to running mode.
However, there are still multiple vague authentication errors within the service due to new configs, which I faced via file sending as a client.
I hope, IAM admins find these information useful:

#SERVER side:
[root@se1 dcache]# netstat -ntpul | grep 1094
tcp6 0 0 :::1094 :::* LISTEN 2428/java
[root@se1 dcache]# netstat -ntpul | grep 443
tcp6 0 0 :::443 :::* LISTEN 2428/java


#CLIENT side:
gfal-ls davs://se1.hep.ipm.ir/data/
gfal-ls error: 13 (Permission denied) - Result HTTP 401 : Authentication Error after 1 attempts

gfal-stat davs://se1.hep.ipm.ir:443/data/store/mc/SAM
gfal-stat error: 13 (Permission denied) - Result HTTP 401 : Authentication Error after 1 attempts

xrdcp xroot://se1.hep.ipm.ir:1094//data/store/XMP2-4 File1
[0B/0B][100%][==================================================][0B/s]
Run: [FATAL] Auth failed: No protocols left to try (source)

gfal-copy test1 gsiftp://se1.hep.ipm.ir/dpm/hep.ipm.ir/home/cms/XMP-7
Copying file:///home/masoodmmt/test1 [FAILED] after 0s
gfal-copy error: 70 (Communication error on send) - globus_xio: Unable to connect to se1.hep.ipm.ir:2811 globus_xio:
System error in connect: Connection refused globus_xio: A system call failed: Connection refused



Finally, Here are my new configuration files' contents:

cat gplazma.conf
auth optional x509
auth optional voms
auth optional oidc
map optional vorolemap
map sufficient multimap gplazma.multimap.file=/etc/dcache/multimap-prod-user.conf
# Only needed for special scenarios and xrootd in releases < 8.2.32:
#map optional multimap gplazma.multimap.file=/etc/dcache/multimap-id-to-username.conf



less layouts/layout-se1.hep.ipm.ir.conf
...
..
.
[doorsDomain/xrootd]
xrootd.plugins=gplazma:gsi,gplazma:ztn,authz:scitokens
xrootd.root=/data/store
#xrootd.security.tls.mode=STRICT


[gplazmaDomain_${host.name}_gplazmaDomain]
[gplazmaDomain_${host.name}_gplazmaDomain/gplazma]
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any davs://se1.hep.ipm.ir:443 root://se1.hep.ipm.ir:1094

gplazma.oidc.provider!cms=https://cms-auth.web.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001" -prefix=/data/store -suppress=audience

# Uncomment line below for WLCG testing VO:
gplazma.oidc.provider!wlcg=https://wlcg.cloud.cnaf.infn.it/ -profile=wlcg -prefix=/data/store -suppress=audience



cat /etc/dcache/multimap-prod-user.conf
group:cms username:cms_production uid:1001
group:cms username:cmsusr001 uid:1001
group:cms username:cms uid:1001



Bests,
Masoud
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Christoph Wissing
Date: 2023-11-27 08:53:52

Public Diary:
Dear Masoud,

thanks for sharing more details of the configuration. It looks like there is a mixture of a configuration applied by the conversion tools and our CMS recommendations. Likely the two approaches do not play well together.

To approach the issue systematically, can you please _revert_ the changes that you made for CMS token based authentication. We should first ensure that the setup works with X509 in the most recent dCache version. Then we try to establish the token support on top of it.

Cheers, Christoph


Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2023-11-27 09:23:49

Public Diary:
Hello Christoph,
All configurations have been reverted to their state from before this IAM Token. However, the new DCache version seems to be stable, and client connection as well as data transfer appears to be fine as shown below:

#Server:
[root@se1 ~]# rpm -qa | grep dcache
dcache-8.2.33-1.noarch

#Client:

gfal-copy XMP gsiftp://se1.hep.ipm.ir/dpm/hep.ipm.ir/home/cms/XMP-77
Copying file:///home/masoodmmt/XMP [DONE] after 0s

gfal-stat davs://se1.hep.ipm.ir:443/data/store/mc/SAM

File: 'davs://se1.hep.ipm.ir:443/data/store/mc/SAM'
Size: 0 directory
Access: (0777/drwxrwxrwx) Uid: 0 Gid: 0
Access: 1970-01-01 03:30:00.000000
Modify: 2023-07-12 11:25:29.000000
Change: 2023-07-12 11:25:29.000000



gfal-ls davs://se1.hep.ipm.ir/data/

#file1
#file2


Bests,
Masoud

Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Christoph Wissing
Date: 2023-11-27 16:47:49

Status: in progress
Responsible Unit: ROC_Asia/Pacific
Public Diary:
Since the configuration without tokens works when using tokens. Again my guess is that the configurations from the DPM conversion does not play out well with the CMS suggestion. So we have to play a bit with the gplazma config.

Do you have 'oidc' or 'scitokens' enabled in the auth part in gplazma.conf? Do you have a file /etc/grid-security/storage-authzdb?

It is also woth taking a look here:

https://wlcg-authz-wg.github.io/wlcg-authz-docs/token-based-authorization/configuration/dcache/

which discusses some token related config options.

Cheers, Christoph
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2023-11-28 05:59:46

Public Diary:
Hello Christoph,

Here is your requested file content: (these are local users).

vi /etc/grid-security/storage-authzdb
authorize cmsusr001 read-write 1001 1001 / / /
authorize cmsprd001 read-write 1002 1001 / / /
authorize cmsana001 read-write 1003 1001 / / /

And according to your mentioned link below, I changed the configurations which are pasted here again, but authorizations error was remained as well as service failure. So I had to rolled them back to revive the service:

https://wlcg-authz-wg.github.io/wlcg-authz-docs/token-based-authorization/configuration/dcache/

vi gplazma.conf
auth optional oidc
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.oidc

cat /etc/dcache/multi-mapfile.oidc
username:cms_production uid:1001 gid:1001,true group:writer
username:cms uid:1001 gid:1001,true group:writer



vi layouts/layout-se1.hep.ipm.ir.conf
[centralDomain/gplazma]
gplazma.oidc.provider!cms = https://cms-auth.web.cern.ch/ -profile=wlcg -prefix=/cms -authz-id="uid:3999 gid:3999,true username:cms_oidc"
gplazma.oidc.provider!wlcg = https://wlcg.cloud.cnaf.infn.it/ -profile=wlcg -prefix=/wlcg -authz-id="uid:1999 gid:1999,true username:wlcg_oidc"
gplazma.oidc.audience-targets = https://wlcg.cern.ch/jwt/v1/any https://se1.hep.ipm.ir root://se1.hep.ipm.ir:1094 roots://se1.hep.ipm.ir:1094 davs://se1.hep.ipm.ir:443


Bests,
Masoud



Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Christoph Wissing
Date: 2023-11-28 07:33:39

Public Diary:
Have you _added_ the quoted lines or did you replace the other multimap entries. I guess there are incompatible settings inside.

Can you try to remove ALL multimap files first and verify that VOMS access remains working. You might need to add these lines to the gplazma.conf

map sufficient authzdb
session sufficient authzdb

If things are failing have a look at the log files, in /var/log/dcache there is usually a file that logs messages from gplazma.

Cheers, Christoph


Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2023-11-28 07:57:48

Public Diary:
Hello Christoph,
Thanks for your attention.
For your information, this time, I have tested replacement of the gplazma entries as below and the dcache service was restarted:
Here is my entire file's content now, but failing again with OIDC enabled:

cat /etc/dcache/gplazma.conf
auth optional oidc
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.oidc
map sufficient authzdb
session sufficient authzdb
account requisite banfile
session requisite roles
session sufficient omnisession


Regarding to the log files, there cannot be find even a single text about gplazma in there as I followed with GREP commands shown below:

cd /var/log/dcache/

[root@se1 dcache]# grep -Ri 'gplazma' .
[root@se1 dcache]# grep -Ri 'oidc' .

[root@se1 dcache]# grep -Ri 'auth' .
./doorsDomain.access:level=INFO ts=2023-11-28T11:10:43.073+0330 event=org.dcache.ftp.response session=door:GFTP-se1-AAYLMYkoNqg@doorsDomain command="AUTH GSSAPI" reply="334 ADAT must follow"


ls -ltr /var/log/dcache/
-rw-r--r-- 1 dcache dcache 504 Nov 28 11:09 srmDomain.zookeeper
-rw-r--r-- 1 dcache dcache 180 Nov 28 11:09 poolsDomain_se1_mypool.zookeeper
-rw-r--r-- 1 dcache dcache 180 Nov 28 11:09 srmmanagerDomain.zookeeper
-rw-r--r-- 1 dcache dcache 504 Nov 28 11:09 informationDomain.zookeeper
-rw-r--r-- 1 dcache dcache 180 Nov 28 11:09 doorsDomain.zookeeper
-rw-r--r-- 1 dcache dcache 504 Nov 28 11:09 adminDoorDomain.zookeeper
-rw-r--r-- 1 dcache dcache 504 Nov 28 11:09 centralDomain.zookeeper
-rw-r--r-- 1 root root 48 Nov 28 11:14 111
-rw-r--r-- 1 dcache dcache 1120 Nov 28 11:19 informationDomain.access
-rw-r--r-- 1 dcache dcache 51381 Nov 28 11:20 doorsDomain.access


Bests,
Masoud
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2023-11-28 08:16:23

Public Diary:
I also removed the mapping lines as below, but failure continues in this situation too.

vi gplazma.conf
auth optional x509
auth optional voms
auth optional scitoken

map sufficient authzdb
session sufficient authzdb

map optional gridmap
map optional vorolemap
map optional vogroup vo-group-path=/etc/dcache/vo-group.json
map optional vogroup vo-group-path=/etc/dcache/vo-user.json

account requisite banfile
session requisite roles
session sufficient omnisession



And here is the Gplazma file that works fine, regarless of using Oidc:

vi gplazma.conf
auth optional x509
auth optional voms
auth optional scitoken

map optional gridmap
map optional vorolemap
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.vorole
map optional multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.group
map optional vogroup vo-group-path=/etc/dcache/vo-group.json
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.user
map optional vogroup vo-group-path=/etc/dcache/vo-user.json
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.vo
map sufficient multimap gplazma.multimap.file=/etc/dcache/multi-mapfile.unmapped


account requisite banfile
session requisite roles
session sufficient omnisession


Bests,
Masoud

Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Christoph Wissing
Date: 2023-11-28 11:26:57

Public Diary:
It is very hard to debug the setup, because your token based config is super complex (as introduced by the conversion tool). The CMS recipe is a very simple addon on a working VOMS based system.

Perhaps it is the easiest to try to take out all token related configuration first. I suggest to remove the lines:

auth optional scitoken
ALL lines with "mulitmap"

Note that the voms auth plugin and the vorolemap plugin relay on /etc/grid-security/grid-vorolemap with proper mappings to existing local accounts.

Your present config uses omnisession, which is very fancy. What is inside /etc/dcache/omnisession.conf ?

If you want to get really to the bottom, this is the detailed documentation how gplazma works:

https://www.dcache.org/manuals/Book-8.2/config-gplazma.shtml

Cheers, Christoph

Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2023-12-04 07:09:09

Public Diary:
Hello Christoph,
I got really to the bottom of gplazma link that you offered, I found nothing unclear regarding our configurations.
As you mentioned the conversion tools made it difficult for us to migrate to IAM token
, I think only the coordination between conversion tools programmer and IAM token based documentation writers would result in a proper instruction for us to follow.

By the way, I tested your previous suggested method which also led to Auth failure again. :(
Besides, your requested new configuration and other config files are pasted here.
Many thanks for your attention.

cat gplazma.conf

#auth optional scitoken

#removed ALL lines with "mulitmap"

auth optional x509
auth optional voms
map optional gridmap
map optional vorolemap
map optional vogroup vo-group-path=/etc/dcache/vo-group.json
map optional vogroup vo-group-path=/etc/dcache/vo-user.json


cat /etc/dcache/omnisession.conf
group:writer root:/ home:/
username:cms_production root:/ home:/


cat /etc/grid-security/grid-vorolemap
## CMS ##
# Need mapping for each VOMS Group(!), roles only for special mapping
"*" "/cms" cmsreadonly
"*" "/cms/Role=cmsuser" cmsreadonly
"*" "/cms/Role=cmsphedex" cmsreadwrite
"*" "/cms/Role=production" cmsreadwrite
"*" "/cms/Role=cmsusr001" cmsusr001
"*" "/cms/Role=lcgadmin" cmsusr001
"*" "/cms/Role=production" cmsprd001
"*" "/cms/Role=priorityuser" cmsana001
"*" "/cms/Role=pilot" cmsusr001
"*" "/cms/Role=hiproduction" cmsprd001
"*" "/cms/dcms/Role=cmsphedex" cmsprd001
"*" "/cms/integration" cmsusr001
"*" "/cms/becms" cmsusr001
"*" "/cms/dcms" cmsusr001
"*" "/cms/escms" cmsusr001
"*" "/cms/ptcms" cmsusr001
"*" "/cms/itcms" cmsusr001
"*" "/cms/frcms" cmsusr001
"*" "/cms/production" cmsusr001
"*" "/cms/muon" cmsusr001
"*" "/cms/twcms" cmsusr001
"*" "/cms/uscms" cmsusr001
"*" "/cms/ALARM" cmsusr001
"*" "/cms/TEAM" cmsusr001
"*" "/cms/dbs" cmsusr001
"*" "/cms/uscms/Role=cmsphedex" cmsusr001
"*" "/cms" cmsusr001

#MASOOD:
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mmosalma/CN=847569/CN=Masoud Mosalman Tabar 10005" "/cms" cmsusr001
grid-vorolemap:"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mmosalma/CN=847569/CN=Masoud Mosalman Tabar 10005" "/cms/Role=production" cmsusr001
#"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mmosalma/CN=847569/CN=Masoud Mosalman Tabar 10005" "/cms" cms_vorole
#"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mmosalma/CN=847569/CN=Masoud Mosalman Tabar 10005" "/cms/Role=production" cmsprd_vorole


# vo+role mapping example
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba/CN=148436759/CN=462061404/CN=1744101223/CN=1968232471"
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac" "/atlas" atlas_vokac_vorole
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac" "/atlas/Role=pilot" atlasplt_vokac_vorole
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac" "/atlas/Role=production" atlasprd_vokac_vorole
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac" "/atlas/Role=lcgadmin" atlassgm_vokac_vorole
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac" "/atlas/cz" atlascz_vokac_vorole
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac" "/atlas/de" atlasde_vokac_vorole
"/DC=org/DC=terena/DC=tcs/C=CZ/O=Czech Technical University in Prague/CN=Petr Vokac 252509" "/atlas" atlas_vokac_vorole
"/DC=org/DC=terena/DC=tcs/C=CZ/O=Czech Technical University in Prague/CN=Petr Vokac 252509" "/atlas/Role=pilot" atlasplt_vokac_vorole
"/DC=org/DC=terena/DC=tcs/C=CZ/O=Czech Technical University in Prague/CN=Petr Vokac 252509" "/atlas/Role=production" atlasprd_vokac_vorole
"/DC=org/DC=terena/DC=tcs/C=CZ/O=Czech Technical University in Prague/CN=Petr Vokac 252509" "/atlas/Role=lcgadmin" atlassgm_vokac_vorole
"/DC=org/DC=terena/DC=tcs/C=CZ/O=Czech Technical University in Prague/CN=Petr Vokac 252509" "/atlas/cz" atlascz_vokac_vorole
"/DC=org/DC=terena/DC=tcs/C=CZ/O=Czech Technical University in Prague/CN=Petr Vokac 252509" "/atlas/de" atlasde_vokac_vorole


Best Regards,
Masoud
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-04-06 20:10:35

Public Diary:
Hallo Masoud,
thanks for your reply and sorry we dropped the ball. Since i see
lots of entries for Petr, are you a multi-VO site or are you supporting
only CMS?
Thanks,
cheers, Stephan
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-06 05:02:55

Public Diary:
Hello Admins,

Since then, no one answered to the issues. Besides, reviewing entire this ticket confirms that we did all the way of configurations which were mentioned here.
However, none of these steps brought us to your goal.
How can we help and What is next?


Best Regards,
Masoud

Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-04-08 13:14:51

Public Diary:
Ok, thanks Masoud! We'll check with another site that "simplified" the
auto-migration config if they can share their configs with you.
- Stephan
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-08 04:37:22

Public Diary:
Hello Stephan,
I hope this message finds you well.
It can be just CMS after passing the misconfiguration.

Best Regards,
Masoud
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Csaba Hajdu
Date: 2024-04-10 19:22:38

Public Diary:
Hello,

the details of what we did are in this ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=164207

If you need the contents of specific files, let me know which ones.

Cheers:
Csaba
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Chan-anun Rungphitakchai
Date: 2024-04-10 18:53:50

Public Diary:
Involve person(s) has been changed to christoph.wissing@desy.de;Lioudmila.Stepanova@cern.ch.
Hello Csaba and Lioudmila,
Look like T2_HU_Budapest and T2_RU_INR are token-enabled sites which migrated from DPM to dCache like IPM. Could you please help and provide instruction/configuration to IPM admin for completing dCache migration and proper config IAM token.
Thank you very much,
Noy
Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-15 06:40:24

Public Diary:
Hello Csaba,

Thank you for your response. I have thoroughly reviewed your ticket text related to our current issue. Unfortunately, I did not find a solution that precisely addresses our problem.

Therefore, I kindly request that you provide us with your most up-to-date files. These files will allow us to compare configurations and, hopefully, identify the root cause of the trouble.

Please attach the following files:

/etc/dcache/layouts/layout-SITE.conf
/etc/dcache/gplazma.conf
/etc/dcache/multimap-prod-user.conf



If there are any additional files or adjustments relevant to this matter, please inform us accordingly.

Thank you in advance for your cooperation.

Best regards,

Masoud

Internal Diary:
Involved T2_HU_Budapest in this ticket.
GGUS ID: 164275
Last modifier: Csaba Hajdu
Date: 2024-04-16 05:35:17

Public Diary:
Hello Csaba,

Thank you for your response. I have thoroughly reviewed your ticket text related to our current issue. Unfortunately, I did not find a solution that precisely addresses our problem.

Therefore, I kindly request that you provide us with your most up-to-date files. These files will allow us to compare configurations and, hopefully, identify the root cause of the trouble.

Please attach the following files:

/etc/dcache/layouts/layout-SITE.conf
/etc/dcache/gplazma.conf
/etc/dcache/multimap-prod-user.conf



If there are any additional files or adjustments relevant to this matter, please inform us accordingly.

Thank you in advance for your cooperation.

Best regards,

Masoud

Internal Diary:
Added attachment grid143_config_files.zip
https://ggus.eu/index.php?mode=download&attid=ATT119000
GGUS ID: 164275
Last modifier: Csaba Hajdu
Date: 2024-04-16 05:35:18

Public Diary:
Hi Masoud,
here are the requested files.
Cheers:
Csaba

Internal Diary:
Added attachment grid143_config_files.zip
https://ggus.eu/index.php?mode=download&attid=ATT119000
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-16 06:24:34

Public Diary:
Dear Csaba,

Thank you for your prompt response. I greatly appreciate it.

Could you also kindly provide me with your omnisession.conf file? It would be immensely helpful for our investigation.

While reviewing your configuration files, I noticed the following line related to your xrootd door configuration:

[doorsDomain/xrootd]
# Tovabbi konfiguracio a /etc/dcache/layouts/dcache-my-xrootd-door.layout.conf -ban!

If you have the file /etc/dcache/layouts/dcache-my-xrootd-door.layout.conf, please attach it as well.

Thank you in advance for your cooperation.

Best regards,

Masoud





Internal Diary:
Added attachment grid143_config_files.zip
https://ggus.eu/index.php?mode=download&attid=ATT119000
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-04-16 15:06:42

Public Diary:
Many Thanks Csaba!
Hallo Masoud,
please take a look at
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_IR_IPM
it shows the token test results. You can click on the time bin of
interest in the "SAM Service Status" line to get more information
about a failed test. You can also schedule additional tests (after
a change without waiting 15 minutes for the next test to run
automatically) on the ETF page.
https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Ffilled_in%3Dfilter%26host%3Dse1.hep.ipm.ir%26view_name%3Dhost
Thanks,
cheers, Stephan
Internal Diary:
Added attachment grid143_config_files.zip
https://ggus.eu/index.php?mode=download&attid=ATT119000
GGUS ID: 164275
Last modifier: Csaba Hajdu
Date: 2024-04-16 06:48:17

Public Diary:
Hi :)

[root@grid143 ~]# cat /etc/dcache/omnisession.conf

# Omnisession plugin (omnisession) configuration
# ==============================================
# documentation: https://www.dcache.org/old/manuals/Book-8.2/config-gplazma.shtml#omnisession
# location: gplazma.omnisession.file=/etc/dcache/omnisession.conf
#
# This file replaces storage-authzdb used in the past by authzdb plugin.
# It is available already in 6.2.26+ and all 7.2.x, 8.2.x dCache releases.
# You should not use deprecated authzdb plugin and this simple configuration
# where "group:writer" have privileges to the whole namespace should be
# sufficient (fine grained permissions comes from user/group/ACLs).

# set home directories for local users that we support on this storage
#username:vokac home:/users/vokac
#username:admin home:/users/admin root:/

# default root path and home directory for supported VOs
#group:atlascz home:/groups/atlas/cz

username:cms root:/ home:/
username:cms_oidc root:/ home:/

# fallback
group:writer root:/ home:/
group:reader read-only root:/ home:/
DEFAULT root:/ home:/


[root@grid143 ~]# cat /etc/dcache/layouts/dcache-my-xrootd-door.layout.conf

[xrootd-${host.name}Domain]
[xrootd-${host.name}Domain/xrootd]

xrootd.root=/dpm/kfki.hu/home/cms/phedex/
xrootd.plugins=gplazma:ztn,gplazma:gsi,authz:scitokens
#xrootd.plugins=gplazma:ztn,gplazma:gsi,authz:cms-tfc,authz:scitokens
xrootd.security.tls.mode=STRICT

Internal Diary:
Added attachment grid143_config_files.zip
https://ggus.eu/index.php?mode=download&attid=ATT119000
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-16 07:12:42

Public Diary:
Dear Admins,

I appreciate the assistance provided by Csaba regarding the configuration files.

I have successfully adjusted all of my configuration files to their new settings, and Dcache 8.2.33 is functioning well.

However, I am unsure how to test the IAM Token. Could you please offer new commands for testing, or if possible, test it yourself and share the feedback with me?

Bests,

Masoud

Internal Diary:
Added attachment grid143_config_files.zip
https://ggus.eu/index.php?mode=download&attid=ATT119000
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-20 05:05:41

Public Diary:
Hello Csaba,

I hope this email finds you well.

I would greatly appreciate it if you could also attach your dcache.conf file here.

Perhaps comparing your dcache.conf with our config file could help address the issue.

Thank you in advance for your assistance.

Best regards,

Masoud

Internal Diary:
Added attachment grid143_config_files.zip
https://ggus.eu/index.php?mode=download&attid=ATT119000
GGUS ID: 164275
Last modifier: Csaba Hajdu
Date: 2024-04-20 16:13:33

Public Diary:
Hello Csaba,

I hope this email finds you well.

I would greatly appreciate it if you could also attach your dcache.conf file here.

Perhaps comparing your dcache.conf with our config file could help address the issue.

Thank you in advance for your assistance.

Best regards,

Masoud

Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-23 05:16:49

Public Diary:
Hello Stephan,

I hope this message finds you in good health.

Our current status, as shown below, indicates that IAM Token for Webdav is functioning well:

https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_IR_IPM&update=auto

However, please be aware that IAM Token for Xrootd is still a work in progress and faces some vague issues.

Best regards,

Masoud



Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Csaba Hajdu
Date: 2024-04-20 16:13:34

Public Diary:
Hi Masoud

attached :)

Cheers:
Csaba
Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-04-23 22:12:08

Public Diary:
Hallo Masoud,
if i make an "xrdfs se1.hep.ipm.ir:1094 ls -l //data/store" from my
desktop here at Fermilab i get an:
[FATAL] TLS error: Unable to validate se1.hep.ipm.ir; hostname not in SAN extension.
error. This suggests your server has multiple names. Can you please check
your machine's name/FQHN is set to "se1.hep.ipm.ir". If you need additional
hostnames, for local access, etc. you will need to get a host certificate
that has it listed in the subject alternative name.
That could be the token issue, since the connection is promoted to TLS
in this case.
Thanks,
cheers, Stephan
Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-04-23 14:49:39

Public Diary:
Thanks for the update Masoud!
(I assume you double checked the -prefix in the gplazma.oidc.provider!cms entry
of Gplazma section in /etc/dcache/layouts/.conf, right? Do you get more information
on the denied access from the dCache logs?)
Thanks,
cheers, Stephan
Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-24 04:36:53

Public Diary:
Hello Stephan,

This command works in my LXPlus account as well as on our local UI at our site, albeit with voms proxy:

[mmosalma@lxplus931 ~]$ xrdfs se1.hep.ipm.ir:1094 ls -l //data/store
-rw- 2023-09-10 08:53:35 41 //data/store/MMT1
-rw- 2023-09-10 09:00:12 41 //data/store/MMT2
-rw- 2023-09-11 04:40:10 41 //data/store/MMT4
...
..
[mmosalma@lxplus931 ~]$ xrdfs root://se1.hep.ipm.ir query config version
dCache 8.2.33
[mmosalma@lxplus931 ~]$


And for your information my gplazma/cms lines in layout directory are as follows:

[centralDomain/gplazma]
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any https://se1.hep.ipm.ir root://se1.hep.ipm.ir:1094 davs://se1.hep.ipm.ir:443
gplazma.oidc.provider!cms-legacy=https://cms-auth.web.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001" -prefix=/data/ -suppress=audience
gplazma.oidc.provider!cms=https://cms-auth.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001" -prefix=/data/ -suppress=audience



Our xrootd is functioning well with voms-proxy but showcases (login-failed/denied) with token in the logs.

I appreciate your recommendations, and please let me know if you find anything to change.



Best regards,

Masoud

Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-24 04:46:23

Public Diary:
Hello again,

I wonder whether something should be changed or added in our Storage.JSON file regarding with new config for Token?



https://gitlab.cern.ch/SITECONF/T3_IR_IPM/-/blob/master/storage.json

[
{ "site": "T3_IR_IPM",
"volume": "IPM_dCache",
"protocols": [
{ "protocol": "WebDAV",
"access": "global-rw",
"prefix": "davs://se1.hep.ipm.ir/data"
},
{ "protocol": "XRootD",
"access": "global-rw",
"prefix": "root://se1.hep.ipm.ir//data"
},
{ "protocol": "SRMv2",
"access": "global-rw",
"prefix": "gsiftp://se1.hep.ipm.ir/data"
}
],
"type": "DISK",
"rse": "T3_IR_IPM",
"fts": [ "https://fts3-cms.cern.ch:8446", "https://lcgfts3.gridpp.rl.ac.uk:8446" ]
}
]



Best regards,

Masoud

Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-04-24 12:48:00

Public Diary:
Hallo Masoud,
the audience-targets line needs to be
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any se1.hep.ipm.ir
we are not using URIs but hostnames. The xrdfs working from CERN was confusing
me yesterday too. It depends on the xrootd version is SSL to TLS promotion
is done and fallback. Please do check and make sure your entry in /etc/hostname
is the long/fully qualified host name and in case you have other network
interfaces, etc. they are all listsed in the host certificate.
We are phasing out SRMv2/GSIftp/gridFTP. So, yes, you could check your
JobConfig/site-local-config.xml, add a WebDAV entry to storage.xml if needed,
and then remove the SRMv2 entry from storage.json.
Thanks,
cheers, Stephan
Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-04-24 23:33:11

Public Diary:
Hallo Masoud,
we are now phasing out SRMv2/GSIftp/gridFTP in CMS. (FTS will drop support for
GSI/gridFTP in early May.) Looking at the T3_IR_IPM SITECONF, you have an SRMv2
protocol defined in storage.json. It's not used anymore. Could you please delete
lines 12 to 15 in storage.json. (I assume the T3_IR_IPM RSE is set to auto-update
and the change will propagate automatically to Rucio.)
Many Thanks,
cheers, Stephan
Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-04-25 13:52:52

Public Diary:
Thanks Masoud for your quick action updating SITECONF and for
checking the hostname/cert. I'll look into this some more this
afternoon when i have some more time and try to think what
else may be causing the token access issue.
Thanks,
- Stephan
Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-04-25 04:23:20

Public Diary:
Hello Stephan,

Thank you kindly.
Our Storage.json file has just been adjusted upon your request.

The layout file changed yesterday as follows:

gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any se1.hep.ipm.ir

And here what these files contain:

[root@se1 ~]# openssl x509 -in /etc/grid-security/hostcert.pem -noout -text | grep se1
Subject: C=IR, O=IRAN-GRID, OU=GCG, CN=se1.hep.ipm.ir

[root@se1 ~]# cat /etc/hostname
se1.hep.ipm.ir

[root@se1 ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
94.184.210.155 se1.hep.ipm.ir se1

Bests,

Masoud

Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-05-12 19:51:26

Public Diary:
Hallo Christoph,
do you know what might be wrong?
Thanks,
- Stephan
Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-05-12 09:25:58

Public Diary:
Hello Admins,

I have reviewed the log files for over two weeks, and I’d like to share my findings below:
For me, adjusting the xrootd section in the layout file as shown below causes everything to work fine, except for xrootd-Token.
Here’s the relevant configuration snippet:

[doorsDomain/xrootd]
[xrootd-${host.name}Domain]
[xrootd-${host.name}Domain/xrootd]
xrootd.plugins=gplazma:ztn,gplazma:gsi,authz:scitokens
xrootd.root="PreFix"
xrootd.security.tls.mode=STRICT


And here is the log:
xrootd-se1Domain.log:

May 12 12:19:48 se1 dcache@xrootd-se1Domain:
PropertyAccessException 1:
org.springframework.beans.MethodInvocationException:
Property 'rootPath' threw exception; nested exception is java.lang.IllegalArgumentException:
Not an absolute path: "PreFix"



However, changing the below line to our default path(/data/), causes the entire xrootd service to stop working, with no logs generated (zero info for these file (xrootd-se1Domain.*)):
xrootd.root=/data/


Could these issues mentioned above be the main cause of the Token errors observed in the ETF monitoring? And if this is the issue, what could be the solution?
The error message appears as follows:
Starting CMS XRootD token read test of se1.hep.ipm.ir:1094 on 2024-May-12 08:47:42
Read target /data//store/mc/SAM/
--------------------------------
Checking stat of SAM dataset test file
root://se1.hep.ipm.ir:1094/data//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/AE237916-5D76-E711-A48C-FA163EEEBFED.root
[E] xrootd file stat error, XRootDStatus.code=400 "[ERROR] Server responded with an error: [3010] Restriction FullyRestricted denied activity READ_METADATA on /data/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/AE237916-5D76-E711-A48C-FA163EEEBFED.root"



Kind regards,
Masoud

Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-06-19 05:52:43

Public Diary:
Hello Stephan,
I hope you are doing well.

As you know, we have made the WebDAV/Token service consistent based on the new configuration. However, it seems that no one has a solution to address the xrootd/Token issue that we faced during the upgrade of our site to their recommended configurations. Therefore, I have decided to integrate all the config files from the dCache directory into a zip file for any member of the token programming staff who may find it useful to review.

I look forward to checking the results here to make adjustments on our site.

Bests,
Masoud



Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-06-19 05:51:43

Public Diary:
Hello Stephan,
I hope you are doing well.

As you know, we have made the WebDAV/Token service consistent based on the new configuration. However, it seems that no one has a solution to address the xrootd/Token issue that we faced during the upgrade of our site to their recommended configurations. Therefore, I have decided to integrate all the config files from the dCache directory into a zip file for any member of the token programming staff who may find it useful to review.

I look forward to checking the results here to make adjustments on our site.

Bests,
Masoud



Internal Diary:
Added attachment dcache.conf
https://ggus.eu/index.php?mode=download&attid=ATT119007
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-08-06 04:23:31

Public Diary:
Hello Chan-anun,
Not really, as you can find on previous messages, no one could help resolve the issue which is still unclear. Because there are no unclear configuration or outdated packages for Xrootd Token support.

Regards,
Masoud
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Chan-anun Rungphitakchai
Date: 2024-08-05 20:59:58

Public Diary:
Hello Masoud
Do you have any update about XRootd token support?
Thank you,
Noy
[1]
[crungphi@lxplus9110 ~]$ xrdcp -f root://se1.hep.ipm.ir:1094/data//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root /dev/null
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Server responded with an error: [3010] Restriction FullyRestricted denied access for [READ_DATA] on /data/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root (source)

Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-10-28 06:22:41

Public Diary:
Hello Stephan,

After one day of monitoring our SE1, which indicated more errors on the newly revised line, I changed it back to reduce warnings:

#gplazma.oidc.provider!cms=https://cms-auth.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001 uid:1001 username:cmsusr001" -prefix=/data/ -suppress=audience
gplazma.oidc.provider!cms=https://cms-auth.web.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001 uid:1001 username:cmsusr001" -prefix=/data/ -suppress=audience


Bests,

Masoud



Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-10-28 07:38:18

Public Diary:
Ahhh, my bad. You need a different name, only one entry per name is
allowed. If you make it "cmsnew" that should do.
Thanks Masoud!
- Stephan
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-10-28 08:22:47

Public Diary:
I appreciate and need you to clarify this matter.

Is it enough to replace, (username:cmsusr001) to (username:cmsnew), as below:

#gplazma.oidc.provider!cms=https://cms-auth.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001 uid:1001 username:cmsusr001" -prefix=/data/ -suppress=audience

gplazma.oidc.provider!cms=https://cms-auth.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001 uid:1001 username:cmsnew" -prefix=/data/ -suppress=audience



Or this user should be added along with other physical users providing /home/ directory along with other configuration files which include CMSUSR001 as followings?

ls -ltr /home/
drwx------ 3 cmsusr001 cms 74 Sep 10 2023 cmsusr001



vo-user.json: "mapped_uname": "cmsusr001"

multi-mapfile.vo:username:cmsusr001 uid:1001

omnisession.conf:username:cmsusr001 root:/ home:/

multi-mapfile.oidc:group:cms username:cmsusr001 uid:1001

multi-mapfile.unmapped:fqan:/cms uid:1001 gid:1001,true username:cmsusr001

multi-mapfile.vorole:group:cms_vorole uid:1001 gid:1001,true username:cmsusr001 group:writer

multimap-prod-user.conf:group:cms username:cmsusr001 uid:1001

And If there is another alternative for how to insert (CMSNEW,) please let me know.

Many thanks

- Masoud

Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-10-28 08:30:42

Public Diary:
Hallo Masoud,
local user name should be the same but the entry needs a different name, i.e.
gplazma.oidc.provider!cmsnew=https://cms-auth.cern.ch/ ...
Thanks,
cheers, Stephan
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-10-28 09:42:41

Public Diary:
Hallo Masoud,
that line looks good and should not impact the current SAM test.
Did you maybe remove/replace the old line? You need that line in
addition to the old line you had with the gplazma.oidc.provider!cms=...
Thanks,
cheers, Stephan
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-10-28 09:24:58

Public Diary:
Hello,

The change has been made based on your request. However, the Webdav which showed Green status before, is also now ended up to failure as well as Xrootd on monitoring website.

new version of this line is:

gplazma.oidc.provider!cmsnew=https://cms-auth.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001 uid:1001 username:cmsusr001" -prefix=/data/ -suppress=audience



Bests,
Masoud

Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-10-28 10:37:03

Public Diary:
Hello Stephan,

Here it is the part related to the new version of this file, based on two lines' provider (the dCache service was also restarted now):

[centralDomain/gplazma]
gplazma.oidc.audience-targets=https://wlcg.cern.ch/jwt/v1/any se1.hep.ipm.ir
gplazma.oidc.provider!cms=https://cms-auth.web.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001 uid:1001 username:cmsusr001" -prefix=/data/ -suppress=audience
gplazma.oidc.provider!cmsnew=https://cms-auth.cern.ch/ -profile=wlcg -authz-id="group:cms gid:1001 uid:1001 username:cmsusr001" -prefix=/data/ -suppress=audience

Bests,

Masoud

Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-10-28 10:50:30

Public Diary:
Looks good and SAM seems to be happy too. Thanks Masoud!
- Stephan
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-10-28 11:05:32

Public Diary:

Hello Stephan,

I wonder where do you monitor our services indicating good status? Because these following monitoring sites still show errors on Xrootd Token (both Read and Write) status - shows similar errors to the previous days:

https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_IR_IPM&update=auto

https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Ffilled_in%3Dfilter%26host%3Dse1.hep.ipm.ir%26view_name%3Dhost

Bests,
Masoud



Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-10-28 13:02:20

Public Diary:
Hallo Masoud,
i looked at the WebDAV probe in production (which uses the old/curent
VOMS and IAM service) and in pre-production (which uses the new VOMS and
IAM service).
For XRootD token i still need to check with a dCache expert.
Thanks,
cheers, Stephan

https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3F_show_filter_form%3D0%26filled_in%3Dfilter%26host%3Dse1.hep.ipm.ir%26view_name%3Dhost
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3F_show_filter_form%3D0%26filled_in%3Dfilter%26host%3Dse1.hep.ipm.ir%26view_name%3Dhost
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-11-02 04:40:44

Public Diary:
Hello Stephan,
The xrootd protocols(plugins and security entries) were the same in the layout.conf file.

I have now added these three lines to the dcache.conf file and the dcache service was restarted too.



pool.mover.xrootd.tpc-authn-plugins=gsi,unix

pool.mover.xrootd.security.tls.mode = OPTIONAL

pool.mover.xrootd.security.tls.require-login = true



Thanks
Regards,
Masoud

Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-11-01 06:54:25

Public Diary:
Hallo Masoud,
Christoph will take a look at your token issue when he is back
in the office / finds some time. In the meantime, i checked with
another colleague who says:
For xrootd protocol, the door layout stanza has to have:
xrootd.plugins=gplazma:ztn,gplazma:gsi,authz:scitokens
xrootd.security.tls.mode=STRICT
on pools we also have :
pool.mover.xrootd.tpc-authn-plugins=gsi,unix
pool.mover.xrootd.security.tls.mode = OPTIONAL
pool.mover.xrootd.security.tls.require-login = true
Could you please double check those? Then we already eliminate
the more basic issue.
Thanks,
cheers, Stephan
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-11-02 05:59:04

Public Diary:
Thanks Masoud! Looks like this made no change and we need Christoph's expertise. - Stephan
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-11-04 16:26:34

Public Diary:
Just to make sure: se1.hep.ipm.ir:1094 is the dCache door, right? Or is this
a native xrootd service running on the door node and redirecting to dCache?
Thanks,
cheers, Stephan
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2024-11-05 04:18:40

Public Diary:

Hello Stephan,

There is no service named exactly xrootd in my systemctl list. The only active service is dCache and its components, one of which is xrootd.

Could this answer your question?



systemctl list-units --type=service | grep dcache

dcache@adminDoorDomain.service loaded active running dCache adminDoorDomain domain
dcache@centralDomain.service loaded active running dCache centralDomain domain
dcache@doorsDomain.service loaded active running dCache doorsDomain domain
dcache@informationDomain.service loaded active running dCache informationDomain domain
dcache@poolsDomain_se1_mypool.service loaded active running dCache poolsDomain_se1_mypool domain
dcache@srmDomain.service loaded active running dCache srmDomain domain
dcache@srmmanagerDomain.service loaded active running dCache srmmanagerDomain domain
dcache@xrootd-se1Domain.service loaded activating auto-restart dCache xrootd-se1Domain domain



netstat -ntpul | grep 1094
tcp6 0 0 :::1094 :::* LISTEN 1664/java


Thanks

Regards,

Masoud



Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Stephan Lammel
Date: 2024-11-05 04:30:27

Public Diary:
Ok, Thanks Masoud! I just wanted to make sure we are looking at the right config.
- Stephan
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Masoud Mosalman Tabar
Date: 2025-01-22 07:23:43

Public Diary:
Hello Noy,

Here is what I have done so far. Please let me know if there are any other configs.

vim /etc/xrootd/scitokens.conf
[Global]
onmissing = passthrough
audience = https://wlcg.cern.ch/jwt/v1/any,se1.hep.ipm.ir

[Issuer CMS_IAM]
issuer = https://cms-auth.web.cern.ch/
base_path = /
map_subject = False
default_user = cmsprod

[Issuer CMS]
issuer = https://cms-auth.cern.ch/
base_path = /
map_subject = False
default_user = cmsprod



systemctl restart dcache.target




Regards,

Masoud

Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
GGUS ID: 164275
Last modifier: Chan-anun Rungphitakchai
Date: 2025-01-22 06:55:08

Public Diary:
Hello Masoud.
After checking, your XRootd endpoint does not support IAM token and X509 via new k8s issuer. Could you please make an update for your endpoint in this week? I attached the document for your reference [2]
Thank you,
Noy
[1]
[crungphi@lxplus807 ~]$ xrdcp -f root://se1.hep.ipm.ir:1094/data//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root /dev/null
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Server responded with an error: [3010] Restriction FullyRestricted denied access for [READ_DATA] on /data/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root (source)
[2]
https://twiki.cern.ch/twiki/bin/view/CMSPublic/XRootDAndTokens#5_The_scitokens_conf_configurati
Internal Diary:
Added attachment dcache.zip
https://ggus.eu/index.php?mode=download&attid=ATT119209
Added attachment grid143_config_files.zip
Added attachment dcache.conf
Added attachment dcache.zip
WLCG tickets (2)
WLCG #1002253 (id:1002253) AsiaPacific - March 2026 - RP/RC OLA performance
State: assigned  |  Priority: less urgent  |  Opened: 2026-04-02 12:36 (2d ago)  |  Updated: 2026-04-02 12:36
Conversation (1 message)
Dear NGI/ROC,

the EGI RC OLA and RP OLA Report for February 2026 has been produced and is available at the following links:
- NGIs reports: http://argo.egi.eu/egi/report-ar/Critical/NGI?month=2026-03 (Clicking on the NGI name, it will be displayed the resource centres A/R figures)
- RCs reports: http://argo.egi.eu/egi/report-ar/Critical/SITES?month=2026-03

According to the Service targets reports for Resource infrastructure Provider [1] and Resource Centre[2] OLAs, the following problems occurred:

============= RC Availability Reliability [2]==========

According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (January, February, March):

IR-IPM-HEP

we are aware of the situation you are facing and we hope that you and your family are safe

we have to track with this ticket the under-performance of your site.

**********************

Links:

[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"

[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"

[3] https://confluence.egi.eu/x/SiAmBg "EGI Infrastructure Oversight escalation"

[4] https://confluence.egi.eu/x/0h4mBg "Recomputation of SAM results or availability reliability statistics"

[5] https://docs.egi.eu/providers/operations-manuals/man05_top_and_site_bdii_high_availability/ "top-BDII and site-BDII High Availability"

[6] https://confluence.egi.eu/x/xx4mBg "Quality verification of monthly availability and reliability statistics"

Best Regards,
EGI Operations
WLCG #1002139 (id:1002139) Upgrade your HTCondorCE endpoints to 24.0.x series (IR-IPM-HEP)
State: assigned  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 14:13
Conversation (1 message)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%40%0%0%0%0%36%100%95%78%100%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #1002219 (id:1002219) Storage SAM test failure at T3_IT_MIB
State: assigned  |  Priority: less urgent  |  Opened: 2026-03-30 11:20 (5d ago)  |  Updated: 2026-03-30 15:49
Conversation (2 messages)
Good afternoon, MIB admins.
Since 17:00UTC yesterday (29 Mar). Your storage endpoint has been failed SAM "3version" and "3crt-extension" tests [1]. The log files show "[FATAL] Auth failed: No protocols left to try" and "[SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] ssl/tls alert certificate unknown (_ssl.c:2651)" error messages [2]. Could you please take a look and check this server's status/services/certificate file?
Best regards,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_IT_MIB
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-3version&var-dst_hostname=storm.mib.infn.it&var-timestamp=1774868492096
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-3crt_extension&var-dst_hostname=storm.mib.infn.it&var-timestamp=1774869307136
Thnak you. Now the service should be fine again.
Best
Paolo
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
HammerCloud0%0%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (4)

CMS tickets (2)
CMS #681630 (id:1731) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_IT_Trieste
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:01 (430d ago)  |  Updated: 2026-01-19 13:12
Conversation (5 messages)
GGUS ID: 168895
Last modifier: Jakrapop Akaranee
Date: 2024-11-05 10:32:03
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_IT_Trieste
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_IT
Issue type: CMS_SAM tests
Description:
Dear Trieste Site Administrators,
We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack, specifically for the following endpoint:

cmsxrd.ts.infn.it (XrootD [1] and WebDAV [2] )

Could you please review dual stack support on your storage element?
Thank you for your assistance.
Best Regards,Jakrapop
-----------
[1]https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dcmsxrd.ts.infn.it%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice
[2]https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dcmsxrd.ts.infn.it%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
GGUS ID: 168895
Last modifier: Tullio Macorini
Date: 2024-12-17 08:18:37

Status: in progress
Responsible Unit: NGI_IT
Dear Trieste site administrators,

Could you provide any progress, update, or plan regarding dual-stack support for your storage element?

Best,
Jakrapop
Any update? -- Thanks, Noy
Good afternoon, Trieste admins.
Your storage endpoint only support IPv4. Could you please provide any update/plan for dual stack implementation?
Cheers,
Noy
[1]
[crungphi@lxplus816 ~]$ nslookup cmsxrd.ts.infn.it

Server: 127.0.0.1
Address: 127.0.0.1#53
Non-authoritative answer:
Name: cmsxrd.ts.infn.it

Address: 140.105.222.59
CMS #681978 (id:2079) Update HTCONDOR config for new issuer token support at T3_IT_Trieste
State: in progress  |  Priority: very urgent  |  Opened: 2025-02-03 10:25 (425d ago)  |  Updated: 2025-07-21 13:34
Conversation (6 messages)
GGUS ID: 169417
Last modifier: Chan-anun Rungphitakchai
Date: 2024-12-16 19:47:02
Subject: Update HTCONDOR config for new issuer token support at T3_IT_Trieste
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: NGI_IT
Issue type: Other
Description:
Hello Trieste admin.
There is new k8s issuer server (https://cms-auth.cern.ch). Your HTCONDOR endpoint do not support token authentication with a new issuer [1]. Could you please make additional config to support CMS new issuer?
Attach updated documentation [2].
Thank you,
Noy
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dce2.ts.infn.it%26site%3Detf%26view_name%3Dhost
[2]
https://twiki.cern.ch/twiki/bin/view/LCG/HTCondorCEtokenConfigTips#The_mapfile
GGUS ID: 169417
Last modifier: Tullio Macorini
Date: 2024-12-17 08:13:45

Status: in progress
Responsible Unit: NGI_IT
Public Diary:
Hallo Ale,
i just realized all access to storage at T2_IT_Bari failed starting
yesterday around 10 am your time, old and new VOMS extensions. Token
access works, so something went likely wrong in adding the VOMS service.
Thanks,
cheers, Stephan
Internal Diary:
Escalated this ticket to NGI_IT
Any update on this ticket -- Thanks, Noy
correct ticket assignment which seems to have gotten messed up during the import. - Stephan
Can you please add support for the K8s IAM instance today/tomorrow. ETF production will switch on Monday. Your site would then fail SAM tests.
Thanks,
- Stephan
Any update? -- Thanks, Noy
WLCG tickets (2)
WLCG #1002138 (id:1002138) Upgrade your HTCondorCE endpoints to 24.0.x series (INFN-TRIESTE)
State: assigned  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-19 14:13
Conversation (1 message)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
WLCG #681638 (id:1739) Upgrade to a supported HTCondor version and enable SSL authentication (INFN-TRIESTE)
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:01 (430d ago)  |  Updated: 2025-02-03 09:58
Conversation (3 messages)
GGUS ID: 164002
Last modifier: Alessandro Paolini
Date: 2023-11-03 11:26:53
Subject: Upgrade to a supported HTCondor version and enable SSL authentication (INFN-TRIESTE)
Ticket Type: USER
CC:
Status: assigned
Responsible Unit: NGI_IT
Issue type: Middleware
Description:
Dear site admins,

with this ticket we would like to follow-up the upgrade to a supported version of HTCondorCE and the migration from voms-based authentication with X509 certificates to AAI tokens for accessing the HTCondorCE endpoints.

The HTCondor team set-up an upgrade procedure to help sites and VOs with the migration from X509 personal certificates to tokens.
Essentially it was created an intermediate step where the plain SSL authentication can be used to authenticate a client' proxy, in addition to the GSI one or to the token one:
- https://confluence.egi.eu/x/EYAtDQ

In summary, the steps are:

- update to HTCondor 9.0.19
- enable the SSL authz (with priority over GSI)
- map the users' DNs
- test the SSL authz successfully
- update to latest HTCondor 10.x

You can find the HTCondor 9.0.19 version in WLCG repository for the time being, as explained in the instructions.

Please also note the usage in the last step of the HTCondor Feature channel (https://htcondor.org/htcondor/release-highlights/index.html#feature-channel) since it this the one supporting the EGI Check-in plugin from 10.4.0.
In this way the sites can accept clients’ proxies and tokens at the same time while waiting for the supported VOs moving completely to tokens.
You can find the latest HTCondor 10.x version in the HTCondor Feature Channel repository.

Please note that after the upgrade to HTCondor 10 version, you need to install and configure the EGI Check-in plugin in order to be compliant with the EGI tokens:
https://github.com/EGI-Federation/check-in-validator-plugin

Please get in contact with your supported communities to properly map the users' DNs to local accounts to ensure also the access via X509 personal certificates.

Concerning the ops VO, please map at least the following certificates:
- EGI Monitoring Service:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-egi@grnet.gr"
"/DC=EU/DC=EGI/C=HR/O=Robots/O=SRCE/CN=Robot:argo-egi@cro-ngi.hr"

- EGI Security monitoring:
"/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:argo-secmon@grnet.gr"

Please also configure properly the Accounting settings on the HTcondor 10 installation, as explained in the instructions.

Thanks for your collaboration,
EGI Operations
GGUS ID: 164002
Last modifier: Tullio Macorini
Date: 2023-11-06 10:19:12

Status: in progress
Responsible Unit: NGI_IT
Public Diary:
Ciao Andrea,
we didn’t enjoy the holidays so much as we were switched off since July due to a fault of the UPSes (2) and we are just getting out of the tunnel. We should switch back online by this week and we plan to resume the activity of both upgrading the nodes to el9 and adding IPv6 in the coming weeks. This holds for both INFN-ROMA1 and INFN-ROMA1-CMS, as we share the computing facility.
Thanks,

Alessandro
Internal Diary:
Escalated this ticket to NGI_IT
GGUS ID: 164002
Last modifier: Alessandro Paolini
Date: 2025-01-23 12:09:26

Status: in progress
Responsible Unit: NGI_IT
Public Diary:
yes, the migration hasn't been done yet.
Internal Diary:
Escalated this ticket to NGI_IT
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%100%100%100%100%100%100%100%100%100%100%92%99%100%100%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM88%77%80%96%89%94%93%96%94%99%100%97%93%81%58%84%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #1000922 (id:1000922) Report proper benchmark name in the accounting reports (KR-KNU-T3)
State: in progress  |  Priority: less urgent  |  Opened: 2025-10-21 14:07 (165d ago)  |  Updated: 2025-12-22 06:09
Conversation (2 messages)
Dear colleagues,
The WLCG is currently monitoring the migration of site infrastructure to the HEPSCORE23 benchmark. As part of this effort, we are tracking which benchmarks sites are reporting.
According to APEL data, your site is submitting accounting records without a valid benchmark name. In some cases, the benchmark is reported as Si2K, which is now obsolete; in others, the benchmark name cannot be resolved by APEL and remains undefined.
If your site is sending normalized summary records, please ensure that you are using the new accounting record format as described in the documentation:
https://docs.google.com/document/d/19EerfwmzhUM4gLijOK9XxAM46vg1E2oSJvWVoEltvRA/edit?tab=t.0#heading=h.vlqqhvvfi4e4
To resolve the issue, please configure the correct benchmark definition. This can be set either in BDII or in your local APEL configuration file. Detailed instructions can be found in the documentation:
https://twiki.cern.ch/twiki/bin/view/LCG/AccountingClientHEPscoreDeployment

We kindly ask you to perform necessary changes before the end of 2025.
Thank you for your cooperation.
Hello,

We have upgraded our accounting client at KR-KNU-T3 and are now publishing HEPscore23 benchmarks.
Please let us know if any further changes are needed.

Best regards,
Han.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM99%97%51%70%87%100%100%100%19%0%0%0%60%99%100%53%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #683577 (id:3713) how to add user to site admin for T3_KR_UOS
State: in progress  |  Priority: urgent  |  Opened: 2025-06-20 03:00 (288d ago)  |  Updated: 2025-11-04 14:55
Conversation (3 messages)
Dear administrator,
I am a administrator of T3_KR_UOS.
I would like to add a user to admin for T3_KR_UOS.
The user is Woojin Jang( wjang, woojin.jang@cern.ch)
If some way to add a user in admin, please, let me know.
I don't know how to add him.
Thank you.
Best regards,
Youngkwon
```
$ rucio-admin rse info T3_KR_UOS
:
CE_config.server: cms.sscc.uos.ac.kr:1094
T3_KR_UOS: True
cms_type: real
country: KR
fts: https://fts3-cms.cern.ch:8446,https://cmsfts3.fnal.gov:8446
lfn2pfn_algorithm: cmstfc
pnn: T3_KR_UOS
quota_approvers: sahan,yojo,namsoo,jlee
reaper: True
rule_approvers: jlee,sahan,namsoo,yojo
site_admins: jlee,sahan,namsoo,yojo
tier: 3
update_from_json: True
:
$ rucio-admin account info wjang
2025-06-20 11:56:44,429 WARNING This method is being deprecated. Please replace your command with `rucio account show`
suspended_at : None
account_type : USER
created_at : 2020-04-28T23:31:59
email : woojin.jang@cern.ch
account : wjang
status : ACTIVE
deleted_at : None
updated_at : 2020-04-28T23:31:59
```
Hello Youngkwon Could you please check again. Your account already in cms-KR_UOS-admin group.
Thank you,
Noy
Hello Youngkwon,
CMSRucio syncs the site administrators based on e-groups, in this case cms-KR_UOS-admin. I believe you as a member of cms-KR_UOS-exec should be able to add wjang as a member of said e-group. Once this is done, it may take up to 12 hours for the cronjob to propagate the change into rucio.

Let me know if you have any further questions.

Best,
Juan Pablo Salas
CMS DM Team
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #681818 (id:1919) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_MX_Cinvestav
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:47 (430d ago)  |  Updated: 2025-03-04 13:32
Conversation (3 messages)
GGUS ID: 169477
Last modifier: Jakrapop Akaranee
Date: 2024-12-19 14:49:12
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_MX_Cinvestav
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: USCMS
Issue type: CMS_SAM tests
Description:
Dear Cinvestav Site Administrators,

We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack network, specifically for the following endpoint:

proton.fis.cinvestav.mx (XrootD [1] and WebDAV [2] )

Could you please review dual stack support on your storage element?
Thank you for your assistance.
Best Regards,Jakrapop
-----------
[1] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dproton.fis.cinvestav.mx%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice
[2]https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dproton.fis.cinvestav.mx%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
Assigning missing CMS site name error during import to new GGUS.
Jakrapop
Dear Cinvestav site administrators,

Could you provide any progress, update, or plan regarding dual-stack support for your storage element?

Best regards,
Jakrapop
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud0%0%0%0%0%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM68%100%100%100%97%100%100%100%100%n/an/an/an/a0%82%97%
HammerCloud99%99%100%100%99%100%100%100%100%100%100%
FTS— no data —

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #1001058 (id:1001058) TW-QMUL transfers failing
State: on hold  |  Priority: less urgent  |  Opened: 2025-11-05 10:27 (150d ago)  |  Updated: 2026-03-04 13:46
Conversation (35 messages)
Dear site-admins,
We noticed that transfers from TW to QMUL are failing, typically timing out after several minutes.

Looks like 6 minutes after HTTP-TPC pull request to QMUL their storage give up on connection timeout. Somebody at QMUL should try to manually download file from TW, e.g.
voms-proxy-init -voms atlas
curl --verbose --silent --cert /tmp/x509up_u$(id -u) --key /tmp/x509up_u$(id -u) --cacert /tmp/x509up_u$(id -u) --capath /etc/grid-security/certificates -L -o /dev/null https://eos01.grid.sinica.edu.tw:9000/eos/atlas/atlasscratchdisk/SAM/1M
We guess this fails same way with connection timeout and than QMUL should try to understand why you are not able to reach TW site.
May be there is asymmetric routing QMUL <-> TW which can cause troubles in case there is a firewall on the path
Could you please investigate why QMUL cannot reach the TW site?

Best regards,
Yi-Ru
I don't have atlas permissions. is there a dteam file i can try?
Is there a perfsonar node we can test against?

we have
https://perfsonar-latency.esc.qmul.ac.uk/

https://perfsonar-bandwidth.esc.qmul.ac.uk/
adding UK cloud support into the loop (they should have necessary permissions)
From my observations
Transfers from TW to QM have 0% success with timeout errors.
TW is the only site we have these errors.
Transfers from QM to TW look OK.

I dot see any other sites wit the same error + rates of failure including other UK sites.

We run StoRM (webdav) as our transfer endpoint, other StoRM sites seem ok.
I'm unable to ping eos01.grid.sinica.edu.tw if i use a packet size larger than 1452 when using IPv6, I think the endpoint is mtu 1500 (we have 9000) so a guess this is an issue with MTU path discovery?

ping -s 1452 eos01.grid.sinica.edu.tw

PING eos01.grid.sinica.edu.tw(eos01.grid.sinica.edu.tw (2001:c08:ffff:ffff:ffff:ffff:fffa:85d)) 1452 data bytes
1460 bytes from eos01.grid.sinica.edu.tw (2001:c08:ffff:ffff:ffff:ffff:fffa:85d): icmp_seq=1 ttl=46 time=241 ms
1460 bytes from eos01.grid.sinica.edu.tw (2001:c08:ffff:ffff:ffff:ffff:fffa:85d): icmp_seq=2 ttl=46 time=241 ms
1460 bytes from eos01.grid.sinica.edu.tw (2001:c08:ffff:ffff:ffff:ffff:fffa:85d): icmp_seq=3 ttl=46 time=241 ms
1460 bytes from eos01.grid.sinica.edu.tw (2001:c08:ffff:ffff:ffff:ffff:fffa:85d): icmp_seq=4 ttl=46 time=241 ms
^C
--- eos01.grid.sinica.edu.tw ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 240.968/241.026/241.063/0.035 ms
[root@se04 ~]# ping -s 1453 eos01.grid.sinica.edu.tw

PING eos01.grid.sinica.edu.tw(eos01.grid.sinica.edu.tw (2001:c08:ffff:ffff:ffff:ffff:fffa:85d)) 1453 data bytes
^C
--- eos01.grid.sinica.edu.tw ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6149ms
works

tracepath -l 1400 eos01.grid.sinica.edu.tw
1: _gateway 0.285ms
2: 2a01:56c0:a020:1::1 0.268ms
2: 2001:630:0:9001:10::29 1.701ms asymm 3
3: 2001:630:0:9001:10::29 0.453ms
4: ae26.londtw-sbr2.ja.net 0.680ms
5: ae30.londpg-sbr2.ja.net 1.010ms
6: no reply
7: lag-2-0.rt0.lon.uk.geant.net 2.192ms
8: lag-4-0.rt0.ams.nl.geant.net 6.935ms
9: 2001:c08:aabb::1001 6.970ms
10: no reply
11: 2001:c08:aaaa::1:2 242.880ms asymm 14
12: 2001:c08:aaaa::2:100a 247.359ms asymm 15
13: 2001:c08:aaaa::2:101e 244.252ms asymm 16
14: 2001:c08:aaaa::2:1006 241.457ms asymm 17
15: no reply
16: eos01.grid.sinica.edu.tw 241.187ms !A
Resume: pmtu 1400
does not work

tracepath eos01.grid.sinica.edu.tw

1?: [LOCALHOST] 0.007ms pmtu 9000
1: _gateway 0.270ms
1: _gateway 0.268ms
2: 2a01:56c0:a020:1::1 0.249ms
3: no reply
4: no reply
5: no reply
6: no reply
7: no reply
8: no reply
9: no reply
10: no reply
11: no reply
12: no reply
13: no reply
14: no reply
15: no reply
16: no reply
17: no reply
18: no reply
19: no reply
20: no reply
21: no reply
22: no reply
23: no reply
24: no reply
25: no reply
26: no reply
27: no reply
28: no reply
29: no reply
30: no reply
Too many hops: pmtu 9000
Resume: pmtu 9000
a transfer from the WNs with curl also times out [1]

* Operation timed out after 300822 milliseconds with 0 out of 0 bytes received
* Closing connection 0
the same command works from Manchester UI

[1] https://bigpanda.cern.ch//media/filebrowser/e167c353-ee84-40be-90d8-3058c35691ef/user.aforti/tarball_PandaJob_6873229574_UKI-LT2-QMUL/payload.stdout
Dear Daniel, Ale - thanks!
So what is next here? Can we somehow involve the site network experts / network provider experts on both sides? Any other idea how to track MTU path discovery issues?
Cheers,
Ivan
Hi,
I have added WLCG Network Througput to notified groups. That could help.
Is Manchester on Jumbo frames?
it is possible to test ping with disabled IPv6 path MTU discovery ... I hope this could give some insight about issues related to PMTU diagnostics. Could somebody with more newtworking experience explain e.g. why for packet size 892-8952 there is no response for MTU path discovery?
[ui1.farm.particle.cz]$ mtr -6 -w -b -z -c 20 webdav.esc.qmul.ac.uk
Start: 2025-11-05T16:10:22+0100
HOST: ui1.farm.particle.cz Loss% Snt Last Avg Best Wrst StDev
1. AS2852 router.farm.particle.cz (2001:718:401:6025::1) 0.0% 20 1.8 0.9 0.7 1.8 0.2
2. AS2852 2001:718:409:6::31 0.0% 20 0.8 0.9 0.7 1.5 0.2
3. AS2852 cesnetzikova-bgpovc.ipv6.pasnet.cz (2001:718:1e00:1::2) 0.0% 20 1.0 1.7 0.9 5.6 1.4
4. AS20965 cesnet.rt1.pra.cz.geant.net (2001:798:13:10aa::1) 0.0% 20 1.0 1.0 0.9 1.3 0.1
5. AS20965 cesnet.rt1.pra.cz.geant.net (2001:798:13:10aa::1) 0.0% 20 0.7 0.8 0.6 1.5 0.3
6. AS20965 lag-6-0.rt0.fra.de.geant.net (2001:798:cc:1::c5) 0.0% 20 7.2 7.3 7.2 7.4 0.1
7. AS20965 lag-8-0.rt0.ams.nl.geant.net (2001:798:cc::21) 0.0% 20 13.8 14.0 13.8 14.7 0.2
8. AS20965 lag-4-0.rt0.lon.uk.geant.net (2001:798:cc::26) 0.0% 20 18.6 18.7 18.5 20.4 0.4
9. AS20965 lag-2-0.rt0.lon2.uk.geant.net (2001:798:cc:1::2a) 0.0% 20 19.2 19.4 19.2 19.9 0.2
10. AS20965 janet-bckp-gw.mx1.lon2.uk.geant.net (2001:798:99:1::7e) 0.0% 20 19.8 21.9 19.8 58.5 8.6
11. AS786 ae30.londtw-sbr2.ja.net (2001:630:0:10::1ce) 0.0% 20 20.3 23.9 20.3 62.5 10.0
12. AS786 ae26.londtw-ban1.ja.net (2001:630:0:10::252) 0.0% 20 20.3 20.4 20.3 21.2 0.2
13. AS198864 2a01:56c0:b020:7::1 0.0% 20 20.5 20.6 20.4 22.0 0.4
14. AS198864 2a01:56c0:a020:1::a 0.0% 20 20.4 20.6 20.4 21.8 0.3
15. AS198864 se03.esc.qmul.ac.uk (2a01:56c1:10:1000::c224:b2c) 0.0% 20 26.7 21.9 20.4 26.7 2.0

[ui1.farm.particle.cz]$ ping -6 -M do -c 3 -s 8953 se03.esc.qmul.ac.uk
PING se03.esc.qmul.ac.uk(se03.esc.qmul.ac.uk (2a01:56c1:10:1000::c224:b2c)) 8953 data bytes
ping: local error: message too long, mtu: 9000
ping: local error: message too long, mtu: 9000
ping: local error: message too long, mtu: 9000
--- se03.esc.qmul.ac.uk ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2056ms

[ui1.farm.particle.cz]$ ping -6 -M do -c 3 -s 8952 se03.esc.qmul.ac.uk
PING se03.esc.qmul.ac.uk(se03.esc.qmul.ac.uk (2a01:56c1:10:1000::c224:b2c)) 8952 data bytes
--- se03.esc.qmul.ac.uk ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2034ms

[vokac@ui1 ~]$ ping -6 -M do -c 3 -s 8921 se03.esc.qmul.ac.uk
PING se03.esc.qmul.ac.uk(se03.esc.qmul.ac.uk (2a01:56c1:10:1000::c224:b2c)) 8921 data bytes
--- se03.esc.qmul.ac.uk ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2037ms

[ui1.farm.particle.cz]$ ping -6 -M do -c 3 -s 8920 se03.esc.qmul.ac.uk
PING se03.esc.qmul.ac.uk(se03.esc.qmul.ac.uk (2a01:56c1:10:1000::c224:b2c)) 8920 data bytes
8928 bytes from se03.esc.qmul.ac.uk (2a01:56c1:10:1000::c224:b2c): icmp_seq=1 ttl=50 time=20.5 ms
8928 bytes from se03.esc.qmul.ac.uk (2a01:56c1:10:1000::c224:b2c): icmp_seq=2 ttl=50 time=20.4 ms
8928 bytes from se03.esc.qmul.ac.uk (2a01:56c1:10:1000::c224:b2c): icmp_seq=3 ttl=50 time=20.4 ms
--- se03.esc.qmul.ac.uk ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 20.374/20.408/20.460/0.037 ms
This behavior starts with 2a01:56c0:b020:7::1 which is inside AS198864 (Queen Mary and Westfield College, University of London). The earlier hop in JANET AS786 seems to work fine.
> Is Manchester on Jumbo frames?
no
@Peter: this is likely because ethernet, ip and icmp headers are added on top of the length specified (so there will be 46 bytes in addition)
Path MTU issues are quite difficult to find as any network equipment along the path could cause the problem. I suggest to try to simplify the path first, as TW -> QMUL goes via L3 networks while QMUL -> TW goes via GEANT.

# pscheduler task trace --dest perfsonar-bandwidth.esc.qmul.ac.uk --ip-version 6 --first-ttl 1 --source lhc-bandwidth.twgrid.org --wait PT1M
...
8 2001:b200:2000:9:0:9264:11:1 (2001:b200:2000:9:0:9264:11:1) 0.9 ms
AS9505 TWGATE-AP Taiwan Internet Gateway, TW
9 2001:b200:100:1c:513:713:1111:1 (2001:b200:100:1c:513:713:1111:1) 1.0 ms
AS9505 TWGATE-AP Taiwan Internet Gateway, TW
10 2001:b200:100:e:513:301:1181:2 (2001:b200:100:e:513:301:1181:2) 124.6 ms
AS9505 TWGATE-AP Taiwan Internet Gateway, TW
11 ae9-1797.edge1.Washington1.Level3.net (2001:1900:2100::2ccd) 124.0 ms
AS3356 LEVEL3, US
12 No Response
13 2001:1900:5:2:2:0:110:1e2 (2001:1900:5:2:2:0:110:1e2) 244.0 ms
AS3356 LEVEL3, US
14 ae24.londtt-sbr1.ja.net (2001:630:0:10::39) 245.6 ms
AS786 JANET Jisc Services Limited, GB
15 ae28.londtw-sbr2.ja.net (2001:630:0:10::2e) 244.8 ms

...

So I would redirect this ticket to Academia Sinica to check with their network provider why L3 is preferred over GEANT for this destination. Once this is resolved and if the issue persits we can then check using perfsonars within GEANT and try to narrow it down. I will also inform Jisc about this issue.
After checking with our network team.
It seems that we don't receive QMUL routing entry from ESNET route. So, to QMUL, our only path will go via AS9264
However, it seems that GEANT doesn't announce QMUL entry to AS9264 router, while advising AS9264 to QMUL router. Hence, ASGC to QMUL, our only path will go through the commercial route to L3, then QMUL. From QMUL, on the other hand, goes through GEANT to ASGC.
I think, firstly, we may need to figure out why we don't receive QMUL routing entry from ESNET.
Hi, I cannot judge if it is related, but in the last few hours all transfers to and from QMUL site fail.Either with "HTTP 409 : Conflict, File not Found with url IP|HOST:PORT/PATH" or
"Result curl error (35): SSL connect error after 1 attempts" errors.
This seems a problem with QMUL storage being down.
Btw, in the meanwhile I have started a mail thread with ESnet support folks with all of you in CC since I am not sure they have a support unit (or access at all) to GGUS/helpdesk. If I have missed anyone interested, please let me know.
Ok, the ESnet folks cannot help us since :
"Qmul is not on lhcone. That's true of all the atlas T 2 sites in the UK."
So, now, what is the next step? Felix, if nor ESnet, who should we ask further?
@Felix Lee a couple of comments back, you write :
"I think, firstly, we may need to figure out why we don't receive QMUL routing entry from ESNET."
Just to confirm that in the above you men Geant, not ESNET, don't you?
I note that the issue started in late October, before that transfers looked ok
atlas ddm transfers, source TW, destination QMUL.
Let me intersperse my comments in between what Felix Lee writes:

"After checking with our network team.
It seems that we don't receive QMUL routing entry from ESNET route. So, to QMUL, our only path will go via AS9264"
QMUL isn't in LHCONE, so I _think_ that's expected.
"However, it seems that GEANT doesn't announce QMUL entry to AS9264 router, while advising AS9264 to QMUL router. Hence, ASGC to QMUL, our only path will go through the commercial route to L3, then QMUL. From QMUL, on the other hand, goes through GEANT to ASGC."
My colleagues below seem to think QMUL is being advertised by Geant - but I may be misunderstanding their answer.
"I think, firstly, we may need to figure out why we don't receive QMUL routing entry from ESNET."

I'd not expect that - as QMUL isn't in LHCONE.

I've talked to a couple of colleagues around routing:
Rob Evans says:
GEANT is receiving the QMUL routes. Using the GEANT looking glass:

inet.0: 19890 destinations, 44254 routes (19542 active, 0 holddown, 2674 hidden)
+ = Active Route, - = Last Active, * = Both

138.37.0.0/16 *[BGP/170] 1w4d 18:56:59, MED 200, localpref 150, from 62.40.96.25
AS path: 786 198864 I, validation-state: unknown
> to 62.40.98.60 via ae3.0, Push 14025
to 62.40.98.164 via ae8.0, Push 14025, Push 14029(top)
[BGP/170] 1w5d 19:14:35, MED 300, localpref 150
AS path: 786 198864 I, validation-state: unknown
> to 62.40.124.198 via ae10.0
161.23.0.0/16 *[BGP/170] 1w4d 18:56:59, MED 200, localpref 150, from 62.40.96.25
AS path: 786 198864 I, validation-state: unknown
> to 62.40.98.60 via ae3.0, Push 14025
to 62.40.98.164 via ae8.0, Push 14025, Push 14029(top)
[BGP/170] 1w5d 19:14:35, MED 300, localpref 150
AS path: 786 198864 I, validation-state: unknown
> to 62.40.124.198 via ae10.0
192.135.231.0/24 *[BGP/170] 1w4d 18:56:59, MED 200, localpref 150, from 62.40.96.25
AS path: 786 198864 I, validation-state: unknown
> to 62.40.98.60 via ae3.0, Push 14025
to 62.40.98.164 via ae8.0, Push 14025, Push 14029(top)
[BGP/170] 1w5d 19:14:35, MED 300, localpref 150
AS path: 786 198864 I, validation-state: unknown
> to 62.40.124.198 via ae10.0
194.36.8.0/22 *[BGP/170] 1w4d 18:56:59, MED 200, localpref 150, from 62.40.96.25
AS path: 786 198864 I, validation-state: unknown
> to 62.40.98.60 via ae3.0, Push 14025
to 62.40.98.164 via ae8.0, Push 14025, Push 14029(top)
[BGP/170] 2w5d 19:40:10, MED 300, localpref 150
AS path: 786 198864 I, validation-state: unknown
> to 62.40.124.198 via ae10.0

AS9264’s route advertisements appear to be non-trivial looking at the number of different paths they appear in the same looking glass, so I suggest we’ll need to talk to GEANT to figure out what the export policy to 9264 is.

Rob

And a follow up by David Richardson:
From the checks I’m able to issue via GEANT routers, it appears the QMUL routes are being advertised to AS9264 at Amsterdam:

A:nren@rt0.ams.nl# show router bgp neighbor as-number 9264

===============================================================================
BGP Neighbor
===============================================================================
-------------------------------------------------------------------------------
Peer : 202.169.175.205
Description : -- Peering with ASNET-TW --
Group : GEANT_RE
-------------------------------------------------------------------------------
Peer AS : 9264 Peer Port : 58707
Peer Address : 202.169.175.205
Local AS : 20965 Local Port : 179
Local Address : 202.169.175.206
Peer Type : External Dynamic Peer : No
State : Established Last State : Established

...

===============================================================================
BGP Router ID:62.40.96.16 AS:20965 Local AS:20965
===============================================================================
Legend -
Status codes : u - used, s - suppressed, h - history, d - decayed, * - valid
l - leaked, x - stale, > - best, b - backup, p - purge
Origin codes : i - IGP, e - EGP, ? - incomplete

===============================================================================
BGP IPv4 Routes
===============================================================================
Flag Network LocalPref MED
Nexthop (Router) Path-Id IGP Cost
As-Path Label
-------------------------------------------------------------------------------

A:nren@rt0.ams.nl# show router bgp neighbor 202.169.175.205 advertised-routes | match 161.23.0.0/16 post-lines 2
i 161.23.0
After checking with our network team, the asymmetric routing issue for IPv4 has been resolved, but the issue for IPv6 still remains.

# tracepath perfsonar-bandwidth.esc.qmul.ac.uk
1?: [LOCALHOST] 0.002ms pmtu 1500
1: _gateway 9.351ms
1: _gateway 6.779ms
2: fd00::3509 9.311ms
3: 2001:c08:aaaa::2:1005 3.412ms
4: 2001:c08:aaaa::2:101d 25.956ms
5: 2001:c08:aaaa::2:1009 4.750ms
6: 2001:c08:aaaa::1:1 0.559ms
7: 2001:c08:7f:1::9 0.551ms
8: 2001:b200:2000:9:0:9264:11:1 0.954ms
9: 2001:b200:100:1c:513:713:1111:1 1.129ms
10: 2001:b200:100:e:513:301:1181:2 124.828ms
11: ae9-1797.edge1.Washington1.Level3.net 125.501ms
12: no reply
13: 2001:1900:5:2:2:0:110:1e2 247.242ms asymm 14
14: ae24.londtt-sbr1.ja.net 246.856ms asymm 13
15: ae28.londtw-sbr2.ja.net 246.880ms asymm 12
16: ae26.londtw-ban1.ja.net 246.808ms asymm 13
17: 2a01:56c0:b020:7::1 247.541ms asymm 14
18: 2a01:56c0:a020:1::a 247.443ms asymm 15
19: perfsonar-bandwidth.esc.qmul.ac.uk 247.391ms !A
Resume: pmtu 1500
OK - so we believe that routes to perfsonar-bandwidth.esc.qmul.ac.uk should be available via Geant.

My colleague David has checked:
2001:630:11::/48 is from the Jisc /32 allocation.

Looking on the GEANT router, ASNET are being sent the Jisc aggregate route:

Peer : 2001:c08:aabb::1001
Description : -- IPv6 Peering with ASNET-TW --
Group : GEANT_RE6

nren@rt0.ams.nl# show router bgp neighbor 2001:c08:aabb::1001 advertised-routes ipv6 | match 2001:630 post-lines 2
i 2001:630::/32 n/a 0
2001:c08:aabb::1002 None 1608
20965 786

And the additional entries from the QMUL RIPE allocation:
2a01:56c0::/32
2a01:56c1:10::/44

The other two IPv6 subnets received from QMUL by Janet are also passed to GEANT and onward to AS9264:

nren@rt0.ams.nl# show router bgp neighbor 2001:c08:aabb::1001 advertised-routes ipv6 | match 2001:630 post-lines 2
i 2001:630::/32 n/a 0
2001:c08:aabb::1002 None 1608
20965 786

And

The other two IPv6 subnets received from QMUL by Janet are also passed to GEANT and onward to AS9264:

nren@rt0.ams.nl# show router bgp neighbor 2001:c08:aabb::1001 advertised-routes ipv6 | match 2a01:56c0: post-lines 2
i 2a01:56c0::/32 n/a 0
2001:c08:aabb::1002 None 1608

20965 786 198864 -

nren@rt0.ams.nl# show router bgp neighbor 2001:c08:aabb::1001 advertised-routes ipv6 | match 2a01:56c1:10 post-lines 2
i 2a01:56c1:10::/44 n/a 0
2001:c08:aabb::1002 None 1608
20965 786 198864
Our network team has adjusted the network configuration for AS9264, and the IPv6 asymmetric routing issue has been resolved.
Thanks for the information. But unfortunately, all ATLAS transfers from TW-FTT to UKI-LT2-QMUL are still failing:
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&theme=dark&var-binning=%24__auto_interval_binning&var-groupby=activity&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Functional+Test&var-activity=Express&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=SFO+to+EOS+export&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Recall&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=All&var-src_country=All&var-src_cloud=All&var-src_site=TW-FTT&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=All&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-LT2-QMUL&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_cloud&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Express&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1763738185822&to=1764342985824&viewPanel=6
Hi,
are there any news? The transfers are still timing out between these two sites.
Hi,
Due to an MTU issue, transfers from TW-FTT to UKI-LT2-QMUL are still failing.
We are not allowed to change route setting from AS9264 to QMUL.
We have requested assistance from the DDM ops team to temporarily unset the TW–QMUL distance configuration, and this distance parameter has now been temporarily unset.

Best regards,
Yi-Ru
Hi,
the distance was unset: https://its.cern.ch/jira/browse/ATLDDMOPS-5803
Hi,At Jisc (UK NREN, running Janet) we've been looking to see paths and MTU info by using the perfSONAR server in TW, which we believe to be:

lhc-bandwidth.twgrid.org and lhc-latency.twgrid.org, reporting tags of ATLAS, USATLAS and WLCG.
That seems to be running 5.2.3, so is up to date.

However, the remote server reports that pscheduler is unavailable. Perhaps remote 3rd party tests are disabled?

Anyway, we don't believe there's anything more we can do to help. Our engineer says the routing looks OK, and we have 9000 MTU supported should that be being used, nor do we filter PMTUD.

If there is something we can do, or test, though, please let us know.
Tim/Chris
While I was trying to understand DNS issues with another TW GGUS ticket, I found different LHCONE routing for the grid.sinica.edu.tw authoritative DNS servers ns.twgrid.org (2001:c08:ffff:ffff:ffff:ffff:fffa:863) and ns1.twgrid.org (2001:c08:ffff:ffff:ffff:ffff:fffa:502). CERN Looking glass for LHCONE IPv6 BGP table shows that TW advertise /121, /122 and /123 IPv6 prefixes … is it really expected / supported to use such small prefixes within LHCONE VRF?
Hi Petr
Since the LHCONE routing table is still small, there's no limit to the size of the prefixes announced to LHCONE.
/121 or longer are usually used for point-to-point links and may get announced into LHCONE by the network providers.
Edoardo
Hi,
what is the status of this issue?
I don't think there has been any action on this ticket. I'm putting it on hold until we join LHCONE which should be some time time the summer
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM93%100%100%94%100%84%96%100%97%100%97%93%96%100%100%100%
HammerCloud97%99%99%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS— no data —

Open GGUS tickets (3)

WLCG tickets (3)
WLCG #1002104 (id:1002104) UKI-LT2-RHUL XCache server down
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-18 14:53 (17d ago)  |  Updated: 2026-04-02 10:33
Conversation (14 messages)
Hello,
Over the last 14 hours, the XCache server UKI-LT2-RHUL storage040.ppgrid1.rhul.ac.uk(134.219.225.227) , or 134.219.225.227.1094 UKI-LT2-RHUL, has been down.

https://monit-grafana.cern.ch/d/efc1p7lzjwb9cc/atlas-xcache-availability?orgId=17&from=1773787633932&to=1773845233932

Could you please have a look?
Thanks, Matthias
Already on it but thanks for the report.
I restarted it and it's gone back to green. Let me know if the problem returns.
Dear experts,
Since ~7hours ago, the Xcache UKI-LT2-RHUL: storage052.ppgrid1.rhul.ac.uk(
134.219.225.239:1094 ) is very unstable, and was down for most of the time. You can check the status here:
https://monit-grafana.cern.ch/d/efc1p7lzjwb9cc/atlas-xcache-availability?orgId=17&from=1773716411264&to=1774062011265

Would you please check ?
Thanks !

Minghui for the ADCOS team
Thanks, I have restarted it.
I had some errors in the xcahe log like this:
{"log":"[2026-03-21
18:28:35.756163 +0000][Error
][AsyncSock ]
[u6031@meitner.tier2.hep.manchester.ac.uk:1094.0] Unable to initiate the
connection: [ERROR] Socket error: network is
unreachable\n","stream":"stderr","time":"2026-03-21T18:28:35.756284878Z"}

{"log":"[2026-03-21
18:28:37.052885 +0000][Error
][AsyncSock ]
[u6032@xrootd.echo.stfc.ac.uk:1094.0] Unable to initiate the connection:
[ERROR] Socket error: network is
unreachable\n","stream":"stderr","time":"2026-03-21T18:28:37.053045521Z"}

{"log":"[2026-03-21
18:28:37.053189 +0000][Error
][AsyncSock ]
[u6032@xrootd.echo.stfc.ac.uk:1094.0] Unable to initiate the connection:
[ERROR] Socket error: network is
unreachable\n","stream":"stderr","time":"2026-03-21T18:28:37.053336208Z"}

{"log":"[2026-03-21
18:28:37.842860 +0000][Error
][AsyncSock ]
[u6018@ceph-svc37.gridpp.rl.ac.uk:1094.0] Unable to initiate the connection:
[ERROR] Socket error: network is
unreachable\n","stream":"stderr","time":"2026-03-21T18:28:37.843072372Z"}
But as usual hard to tell if it's something gone wrong with the xcache or a real issue with the remote site.
The 10Gb/s link between site and JANET looked fairly full this morning which could potentially be contributing.
One of the problem we see as a source with RHUL is the amount of connections left open. We should check what version it is you are using because neither oxford nor bham cause that.
For example on a random DTN ~3/4 of the connections are from RHUL, the other DTNs are similar.
ss -tpr| wc -l
4414
[root@dtn01 ~]# ss -tpr| grep 134.219.225 |wc -l
3116
what version of xcache obviously.
Thanks Alessandra.
xrootd-5.9.1-1.2.osg24.el9.x86_64
xcache-4.1.0-1.osg24.el9.x86_64
xrootd-rucioN2N-for-Xcache-1.2-3.4.osg24.el9.x86_64
We use Ilija's container hub.opensciencegrid.org/usatlas/atlas-xcache
I'm adding Ilija maybe he has an explanation.
@Simon, that hub link doesn't work if that the correct one. @Ilija, in case you didn't read the previous post the problem is that RHUL xcache has thousands of connections open on our storage. We already discussed this at some point, several months ago, and you suggested this wouldn't happen with more recent versions. xrootd 5.9.1 seems recent enough though so it might be a config problem? Since Manchester has seen a increase in errors in the past 2 months, that might point to delayed communication between redirectors and gateways I'm keen to rule out RHUL number of connections as a problem. At any rate maybe we should point RHUL to RAL (though doing it without DT will cause errors).
Adding also Vip, their xcache doesn't have so many connections, maybe it might help to compare notes.
Just checked RHUL is indeed pointing to RAL as its primary storage endpoint , the order is as follows. [1] this is same order as Oxford.
At Oxford, Vip is using xrootd version from EPEL instead of OSG version, but also Oxford has deployed the xcache directly on a EL9 node and they're not using Illija's container.
@Simon do you think its possible to move to similar setup as Oxford since they are not seeing this issue ?

[1]

Pandaq
Status
Storage Endpoints




UKI-LT2-RHUL
online

RAL-LCG2-ECHO_DATADISK (PR)

RAL-LCG2-ECHO_DATADISK (PW)

RAL-LCG2-ECHO_SCRATCHDISK (PR)

UKI-LT2-QMUL_LOCALGROUPDISK (PR)

UKI-NORTHGRID-MAN-HEP-CEPH_DATADISK (PW)





UKI-LT2-RHUL_VP
online

RAL-LCG2-ECHO_DATADISK (PR)

RAL-LCG2-ECHO_SCRATCHDISK (PR)

UKI-LT2-QMUL_DATADISK (PR)

UKI-LT2-QMUL_SCRATCHDISK (PR)

UKI-LT2-QMUL_SCRATCHDISK (PW)

UKI-LT2-RHUL_VP_DISK (PR)

UKI-NORTHGRID-MAN-HEP-CEPH_DATADISK (PR)

UKI-NORTHGRID-MAN-HEP-CEPH_SCRATCHDISK (PR)

UKI-NORTHGRID-MAN-HEP-CEPH_SCRATCHDISK (PW)
Hi Brij,
thanks for checking.
Only the production queues point to RAL. All 3 sites VP queues have been pointing to Manchester for a while now. RHUL and BHAM both for read and write and Oxford for read (24/11/24 last modification when RAL was having network problems). We didn't change them back because without going in DT there will be job errors.
WLCG #1002155 (id:1002155) Upgrade your HTCondorCE endpoints to 24.0.x series (UKI-LT2-RHUL)
State: in progress  |  Priority: urgent  |  Opened: 2026-03-19 14:13 (16d ago)  |  Updated: 2026-03-24 13:36
Conversation (4 messages)
Dear site admins,

The HTCondorCE v23 series (and older) became unsopported and the endpoints running it should be either decommissioned or upgraded to 24.0.x series.

You received this ticket either because you provide at least one HTCondorCE endpoint out of support or because you provide HTCondorCE endpoint(s) but we couldn't determine the version by looking into the BDII.

If you are running a supported version of HTCondor, please let us know which one is, make sure that the endpoints are properly published into the BDII (which will make it easier to carry on activities like this one), and then close the ticket.

Instead, if you are running an unsupported version, we ask you to upgrade it as soon as possible.
In the UMD repository you can find HTCondor-CE 24.0.2 and HTCondor 24.0.14, which is the minimum version that we recommend.
Please check the full release notes of the 24.0.x series (https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-24-0.html) and pay attention to the differences between v23.0.x and v24.0.x in terms of settings and features (for example the different syntax used for the SSL mapping).
Please read carefully the documentation before the upgrade: all the changes with the upgrade must be applied manually, in particular the changes to the new syntax for the SSL mapping.

The quick configuration guide for HTCondor24 created by WLCG can be useful for the upgrade process: https://twiki.cern.ch/twiki/bin/view/LCG/MiniHTCv24EL9

Thanks for your collaboration,
EGI Operations
Simon George Hi Simon, did you see this ticket ?
Thanks Daniela.
Note that if you are running HTCondor25, this has dropped the bdii, so will generate a false positive.
WLCG #1001623 (id:1001623) Failed uploads from RHUL WNs to other sites
State: in progress  |  Priority: less urgent  |  Opened: 2026-01-23 09:18 (71d ago)  |  Updated: 2026-03-18 12:10
Conversation (13 messages)
Dear site admins,
since recently we have a constant low level of upload failures from RHUL WNs to other sites (mostly to RAL). The errors are mostly timeouts, e.g. from the last 12 hours we have:
```
$ grep -o 'GError(.*)' ./Failed\ Operations-data-01_23_2026\,\ 08_47_33\ AM.csv | sort | uniq -c
1 GError('DESTINATION CHECKSUM curl error (28): Timeout was reached', 110)
29 GError('DESTINATION MAKE_PARENT timeout of 10s', 110)
11 GError('DESTINATION MAKE_PARENT timeout of 10s', 110)
72 GError('DESTINATION OVERWRITE timeout of 10s', 110)
8 GError('TRANSFER ERROR: Copy failed (streamed). Last attempt: curl error (28): Timeout was reached (destination)', 9)
```
Could you please have a look?
Best Regards,
Alex
Just to note that so far I don't see any network issues, nor any correlation of this issue with ATLAS, our main VO.
Can I ask, how long has this been going on for? It sounds like the last 12 hours are given as an example of a much longer problem? What % of jobs are failing in this way?
It started around 14th of January (see plot attached). Around 3% of all uploads are failing (which does not necessarily cause job failures, since on failure we try to upload to other SEs).
Best Regards,
Alex
Hi, is it possible for you to identify a few of the jobs with this problem, as recent as possible, and point me to your log files or the error so I can see the file names that failed to download and timestamps?
Hi,
Here are 3 most recent jobs that experienced failures:
```
htcondorce://htc01.ppgrid1.rhul.ac.uk/3889917.7
htcondorce://htc01.ppgrid1.rhul.ac.uk/3890831.4
htcondorce://htc01.ppgrid1.rhul.ac.uk/3890397.1

```

The corresponding error messages are below (respectively):
```
Failed to copy /tmp/1262647510/00342434_00086027_1.sim to
https://webdav.echo.stfc.ac.uk:1094/lhcb:buffer/lhcb/MC/2024/SIM/00342434/0008/00342434_00086027_1.sim:
GError('DESTINATION OVERWRITE timeout of 10s', 110)
Failed to copy /tmp/1262705174/00342428_00114403_1.sim to
https://webdav.echo.stfc.ac.uk:1094/lhcb:buffer/lhcb/MC/2024/SIM/00342428/0011/00342428_00114403_1.sim:
GError('DESTINATION MAKE_PARENT timeout of 10s', 110)
Failed to copy /tmp/1262707390/00342434_00102077_1.sim to
https://webdav.echo.stfc.ac.uk:1094/lhcb:buffer/lhcb/MC/2024/SIM/00342434/0010/00342434_00102077_1.sim:
GError('DESTINATION OVERWRITE timeout of 10s', 110)

```
Error messages, problematic URLs and WNs where the job was running can be found here:
https://monit-grafana.cern.ch/d/J_sO5bo4z/dms-errors?from=now-1h&orgId=46&to=now&var-executionSite=LCG.UKI-LT2-RHUL.uk&var-operationType=All&var-target=All&var-user=All&viewPanel=2
Identify the exact job may be a bit tricky from the WN side, the best option would probably be to try to match source file (which is usually included into the error log) to the job.
Best Regards,
Alex
Hi,
It looks like the issue is similar to the Manchester one [1] -- sometimes it takes more than 10 seconds to establish an ssl connection to RAL gateways, e.g.
Connecting to 2001:630:54:e:82f6:b332::
depth=2 C=UK, O=eScienceRoot, OU=Authority, CN=UK e-Science Root
verify error:num=19:self-signed certificate in certificate chain
verify return:1
depth=2 C=UK, O=eScienceRoot, OU=Authority, CN=UK e-Science Root
verify return:1
depth=1 C=UK, O=eScienceCA, OU=Authority, CN=UK e-Science CA 2B
verify return:1
depth=0 C=UK, O=eScience, OU=CLRC, L=RAL, CN=ceph-svc16.gridpp.rl.ac.uk
verify return:1
CONNECTED(00000003)
[..]
Verify return code: 19 (self-signed certificate in certificate chain)
---
DONE
real 0m18.204s
user 0m0.015s
sys 0m0.013s
--------------------------------------The above output is from `node217.cm.cluster`.
Best Regards,
Alex
[1] https://helpdesk.ggus.eu/#ticket/zoom/1001702
Hi,
Seems like the issue is very similar to the Manchester one -- for some reason it takes quite some time to (unsuccessfully) attempt to resolve RHUL IPv6 addresses. E.g. for `2001:630:113::b8` (which is your WNs NAT or something like this, i guess?) i get
$ time nslookup -debug '2001:630:113::b8'
;; Got SERVFAIL reply from 130.246.56.240, trying next server
;; Got SERVFAIL reply from 130.246.181.239, trying next server
------------
QUESTIONS:
8.b.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.1.1.0.0.3.6.0.1.0.0.2.ip6.arpa, type = PTR, class = IN
ANSWERS:
AUTHORITY RECORDS:
ADDITIONAL RECORDS:
------------
** server can't find 8.b.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.1.1.0.0.3.6.0.1.0.0.2.ip6.arpa: SERVFAIL

real 0m11.034s
user 0m0.008s
sys 0m0.017s
Is there a way to speed up these DNS queries? It's OK if the lookup fails as long as it happens quickly.
Best Regards,
Alex
Thanks very much. This could be a very a useful insight.
2001:630:113::b8 is the address assigned to node184. All worker nodes have a public IPV6 address assigned, we do not use NAT for them. None of them are in the DNS. I was not aware that it was necessary. I've seen similar issues of a strange delay like this affecting ATLAS data fetches via our xcache, and that was solved but never understood by disabling IPV6 for xcaches.
I've put in a call to our networks team about why reverse lookup is not working in our DNS even when I put in a new entry explicitly.
Hi,
Indeed, xrootd does reverse DNS lookups for connected clients unless it is explicitly asked not to do so (see e.g. [1] and `xrd.network` configuration directive). Though it is still a bit strange that it takes a few seconds for RAL DNS servers to do the unsuccessful reverse lookup..
Best Regards,
Alex
[1] https://github.com/xrootd/xrootd/blob/master/src/Xrd/XrdInet.cc#L95
Hi,
Any news on the reverse PTR records?
Best Regards,
Alex
Sorry abut the delay. I'm waiting for a response to an internal ticket, I will try reminding them.

From: helpdesk@ggus.org <helpdesk@ggus.org>
Sent: 12 March 2026 13:25
To: George, Simon <S.George@rhul.ac.uk>
Subject: [EXT][GGUS-Ticket-ID: #1001623] "TEAM" "IN PROGRESS" "NGI_UK" "Failed uploads from RHUL WNs to other sites" (Updated)

This email originated from outside Royal Holloway. Be cautious using links, attachments and QR codes.

GGUS Helpdesk Notification
Ticket #1001623 "Failed uploads from RHUL WNs to other sites"
was updated by Alexander Rogovskiy on 2026-03-12 13:25 (UTC).

Hi,

Any news on the reverse PTR records?

Best Regards,
Alex

Ticket is assigned to NGIs › NGI_UK.

Affected VO is lhcb.

https://helpdesk.ggus.eu/#ticket/zoom/1001623

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG

This email, its contents and any attachments are intended solely for the addressee and may contain confidential information. In certain circumstances, it may also be subject to legal privilege. Any unauthorised use, disclosure, or copying is not permitted. If you have received this email in error, please notify us and immediately and permanently delete it. Any views or opinions expressed in personal emails are solely those of the author and do not necessarily represent those of Royal Holloway, University of London. It is your responsibility to ensure that this email and any attachments are virus free.
FYI still waiting for a response locally, I am chasing.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM88%93%92%66%87%89%86%89%81%97%93%77%89%91%93%82%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%
FTS— no data —

Open GGUS tickets (1)

WLCG tickets (1)
WLCG #1002061 (id:1002061) changes proposed for ALICE job properties (UKI-SOUTHGRID-OX-HEP)
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-11 21:33 (23d ago)  |  Updated: 2026-04-02 10:36
Conversation (17 messages)
Dear colleagues,
to help improve job efficiencies, ALICE would be interested in
changing the core count and/or lifetime of ALICE production jobs.

Please let us know which of these changes can be tried at your site:

1. Increasing the core count from 8 to 16?
2. Increasing the lifetime to 48 hours?
3. Increasing the lifetime to 72 hours?

If a given change is found to cause issues, it will be reverted.

If you prefer ALICE jobs to keep running as they are, that also works.

Thanks for your consideration!
Hi,
We used to have a system in place for periodically part draining worker nodes to allow space for 8-core jobs to start. With the advent of high core-count worker nodes we removed the "draining" mechanism, as 8-core jobs seemed to be starting fine on their own. It would be interesting to see if 16-core jobs can get a fair a chance to start without intervention. So yes, please test 16-core jobs at our site.

We regularly have to drain and reboot worker nodes to implement kernel security patches. With having fewer large core-count worker nodes, it can be very inefficient to wait for a few longer jobs to complete before being able to reboot. This can tie up a large number of cores! If you are happy for us to kill longer running Alice jobs during a patch/reboot run, then we are happy to run longer jobs. Otherwise we would prefer to run shorter jobs.

Cheers,
Mike
Hi Mike,I propose we test those changes in steps, starting with 16 cores and see how that goes.

We have a way to tell long-running pilots to start draining, but in general we do not mind
if a whole bunch of jobs need to be killed because of occasional maintenance, thanks!
Cheers, I'll keep an eye out for new 16-core jobs.
Hi again,all those jobs are refused by your CE:

Matchmaking, MaxSlotsPerJob problem, ExecutionTarget: 8 (MaxSlotsPerJob) JobDescription: 16 (NumberOfProcesses)

Do you want to look into that or would you rather prefer we stick with 8 cores for now?
Aha. Hadn't considered that. Sorry.
We will be upgrading to ARC 7 and HTCondor 25 in the very near future. Maybe we should just revert to 8 cores for now. I'll make sure the new config allows for 16-core jobs.

I'll let you know when we are ready.

Cheers,
Mike
OK, thanks!
Shall we just close this ticket for now?
We've updated our arc config to allow 16 core jobs. Would you like to try again?
Cheers.
Hi Mike,we have adjusted our configuration and these are the first few jobs:

gsiftp://t2arc01.physics.ox.ac.uk:2811/jobs/rj2MDmnPGL9n3KTBFm6idEhqfjZhOmABFKDmg6AMDmABFKDmIqdcfn
gsiftp://t2arc01.physics.ox.ac.uk:2811/jobs/ezTNDmoPGL9n3KTBFm6idEhqfjZhOmABFKDmF7AMDmABFKDm6HqbRo
gsiftp://t2arc01.physics.ox.ac.uk:2811/jobs/eitKDmoQGL9n3KTBFm6idEhqfjZhOmABFKDmXeBMDmABFKDmMYuxln
gsiftp://t2arc01.physics.ox.ac.uk:2811/jobs/ljNLDmpQGL9n3KTBFm6idEhqfjZhOmABFKDm6eBMDmABFKDmHAmTWo
Hi again,the first of those jobs has started in the meantime:

Job: gsiftp://t2arc01.physics.ox.ac.uk:2811/jobs/rj2MDmnPGL9n3KTBFm6idEhqfjZhOmABFKDmg6AMDmABFKDmIqdcfn
Name: JAliEn-1773403398072-0
State: Running
I can see 3 running and one in the idle queue

VO : Job id : #CPUs: Running time: %CPU Eff
ALICE_multicore: 8337386.0: 16: 0 day 01:45:09: 103.66%
ALICE_multicore: 8337388.0: 16: 0 day 00:32:24: 94.83%
ALICE_multicore: 8337387.0: 16: 0 day 01:16:21: 100.58%
The 16 cores jobs seems to be running fine, if you are happy, please can your close the ticket.
Thanks
Vip
Hi Vip,might we still try a longer TTL?
Hi Maarten,
Our HTCondor configuration currently periodically removes jobs with > 16-day CPU time. So at the moment we can't run 16-core jobs for more than 24hrs. I Think it would be reasonable to increase this. Let us have a bit more of a think about this and get back to you.

Cheers,
Mike
Hi guys,if it all becomes too tricky, we just settle for what is doable in practice.
Ok. Lets leave it at this for now. When we upgrade to arc7 we can loosen the limits.
Hi Michael,can we try going back to 8 cores with a TTL of 48 hours instead?
Is a wall-clock time of 48 hours supported by your configuration?
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM34%96%82%94%96%100%99%100%100%99%97%98%90%92%92%82%
HammerCloud97%99%100%100%99%100%100%100%100%100%100%100%100%100%100%100%
FTS— no data —

Open GGUS tickets (4)

WLCG tickets (4)
WLCG #1001975 (id:1001975) UKI-SCOTGRID-GLASGOW transfer errors as destination
State: in progress  |  Priority: urgent  |  Opened: 2026-03-04 10:09 (31d ago)  |  Updated: 2026-04-02 14:27
Conversation (21 messages)
Dear experts,
UKI-SCOTGRID-GLASGOW transfer low efficiency as dst in the last 11 hours.

Potentially related ticket: https://helpdesk.ggus.eu/#ticket/zoom/3657
Example err message:
HTTP stat: HTTP 403 : Permission refused ~26.69 K
MKDIR: FAILURE - Unable to create the subdirectories for the destination: ~2.74 K

Link:
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?from=1772535600000&orgId=17&to=1772622000000&var-activity=Analysis%20Input&var-activity=Analysis%20Output&var-activity=Data%20Carousel%20Analysis&var-activity=Data%20Carousel%20Production&var-activity=Data%20Challenge&var-activity=Data%20Consolidation&var-activity=Data%20Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional%20Test&var-activity=Production%20Input&var-activity=Production%20Output&var-activity=Recovery&var-activity=Staging&var-activity=T0%20Export&var-activity=T0%20Recall&var-activity=T0%20Tape&var-activity=T0%20Tape%20Derived&var-activity=T0%20Tape%20RAW&var-activity=User%20Subscriptions&var-activity=default&var-activity_disabled=Analysis%20Input&var-activity_disabled=Data%20Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional%20Test&var-activity_disabled=Production%20Input&var-activity_disabled=Production%20Output&var-activity_disabled=Staging&var-activity_disabled=User%20Subscriptions&var-binning=$__auto_interval_binning&var-columns=src_cloud&var-dst_cloud=All&var-dst_country=All&var-dst_endpoint=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_token=All&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-groupby=dst_experiment_site&var-include=&var-include_es_dst=All&var-include_es_src=All&var-measurement=ddm_transfer&var-protocol=All&var-remote_access=All&var-retention_policy=raw&var-rows=dst_experiment_site&var-src_cloud=All&var-src_country=All&var-src_endpoint=All&var-src_site=All&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_token=All
Example detail:

2026/03/04 10:04:52

Functional Test

tests

step14.16903.85431.recon.ESD.77518.21301

HTTP stat: HTTP 403 : Permission refused

transfer-failed

CA-SFU-T2_DATADISK

UKI-SCOTGRID-GLASGOW-CEPH_DATADISK

6.0 s

1.05 MB

https://fts4-atlas.cern.ch:8449/fts/webui/#/job/b4fa638c-17a1-11f1-9ec6-d693b16e0cf0

davs://lcg-se1.sfu.computecanada.ca:2880/atlas/atlasdatadisk/rucio/tests/e5/c1/step14.16903.85431.recon.ESD.77518.21301?copy_mode=pull

davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/tests/e5/c1/step14.16903.85431.recon.ESD.77518.21301

davs

1048576

-171938043-1772618692000

1772618714

1772611820000

step14.16903.85431.recon.ESD.77518

tests

UNKNOWN

UNKNOWN

true

UNKNOWN

441705159

1772618707000
.Yesterday, functional tests were moved to FTS4 with slightly different gfal2 configuration
* CERN FTS3 ATLAS: HTTP PLUGIN configured with RETRIEVE_BEARER_TOKEN=true
* CERN FTS4 ATLAS: HTTP PLUGIN configured with RETRIEVE_BEARER_TOKEN=false

and with Glasgow storage implementation we are hitting a bug where MKDIR returns failure eventhough it "create" a directory, e.g.

# test with SE-token (FTS3 configuration)
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=true' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir1# now directory was successfully created, so next attempt fails
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=true' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir1
gfal-mkdir error: 17 (File exists) - HTTP 405 : Method Not Allowed, File Exist with url https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir1# test with X.509 VOMS proxy (FTS4 configuration)
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=false' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir2
gfal-mkdir error: 1 (Operation not permitted) - HTTP 403 : Permission refused
# fails, but when I try same command than according error message directory was indeed created
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=false' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir2
gfal-mkdir error: 17 (File exists) - HTTP 405 : Method Not Allowed, File Exist with url https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir2

Functional tests for Glasgow were moved back to FTS3, so these transfer failus should disappear, but that doesn't mean implementation issue at Glasgow is solved. RAL doesn't have this issue with failing MKDIR when called with X.509 proxy.

Could you fix your storage not to fail MKDIR with X.509 proxy
The transfer efficiency seems to recover after 4th March 12am UTC+1.
$ gfal-copy /etc/hosts davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/tests/e5/c1/pants
Copying file:///etc/hosts [FAILED] after 0s
gfal-copy error: 17 (File exists) - TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 403 : Permission refused (destination)

but works on the internal door

$ gfal-copy /etc/hosts davs://cephc08.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/tests/e5/c1/pants

Copying file:///etc/hosts [DONE] after 0s

What is the difference? Should we use the internal door for WAN too?

Cheers,
Rod.
Hi Rod,

All the FTS transfers are going via the external door (and V6), and work fine?
Please don't use the internal door for external traffic.

On 6 Mar 2026 08:40, helpdesk@ggus.org wrote:

GGUS Helpdesk Notification
Ticket #1001975 "UKI-SCOTGRID-GLASGOW transfer errors as destination"
was updated by Rodney Walker on 2026-03-06 08:39 (UTC).

$ gfal-copy /etc/hosts davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/tests/e5/c1/pants

Copying file:///etc/hosts [FAILED] after 0s
gfal-copy error: 17 (File exists) - TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 403 : Permission refused (destination)

but works on the internal door

$ gfal-copy /etc/hosts davs://cephc08.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/tests/e5/c1/pants

Copying file:///etc/hosts [DONE] after 0s

What is the difference? Should we use the internal door for WAN too?

Cheers,
Rod.



Ticket is assigned to NGIs › NGI_UK.

Affected VO is atlas.

Extra notifications:

Notified Groups: ATLAS UK cloud support

Subscribers/Mentions: Alessandra Forti, Petr Vokac

https://helpdesk.ggus.eu/#ticket/zoom/1001975

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG
Looks like you completely missed issue described eariler, could you fix external doors
(let's hope GGUS doesn't break formatting again for tests that can be used to reproduce this issue)

# test with SE-token
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=true' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir1
# now directory was successfully created, so next attempt correctly fails
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=true' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir1
gfal-mkdir error: 17 (File exists) - HTTP 405 : Method Not Allowed, File Exist with url https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir1

# test with X.509 VOMS proxy
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=false' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir2
gfal-mkdir error: 1 (Operation not permitted) - HTTP 403 : Permission refused
# fails, but when I try same command than according error message directory was indeed created
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=false' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir2
gfal-mkdir error: 17 (File exists) - HTTP 405 : Method Not Allowed, File Exist with url https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_mkdir2
Just to be clear: what `gfal-mkdir` fails with X.509 VOMS proxy? This should be fixed
hi Petr, indeed, but you will notice that *Rod's* test is for a file copy, not directory creation, and I was responding to him!
I note that there is no such thing as a directory on our storage, however, so there is no sense in which a directory can be created (the object store just has files with names that are paths).
We need to understand why FTS4 works at ral and not in Glasgow. My first thought was that you might not have the saem version of the plugin that ignores the directory creation?
Following command _must_ return OK
$ gfal-mkdir -D 'HTTP PLUGIN:RETRIEVE_BEARER_TOKEN=false' https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_random_name_123
gfal-mkdir error: 1 (Operation not permitted) - HTTP 403 : Permission refused

We don't care what happens at storage backend, we only care that this command return HTTP 200
You can also rely directly on curl to avoid complex gfal$ curl --verbose --silent --cert /tmp/x509up_u$(id -u) --key /tmp/x509up_u$(id -u) --cacert /tmp/x509up_u$(id -u) --capath /etc/grid-security/certificates -L -X MKCOL https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_random_name_$(uuidgen)
* Trying 2001:630:40:ef::72:1094...
* Connected to cephc04.gla.scotgrid.ac.uk (2001:630:40:ef::72) port 1094 (#0)
...
> MKCOL /atlas:datadisk/SAM/test_random_name_998c6562-712a-4148-8621-f798087f9e8a HTTP/1.1
> Host: cephc04.gla.scotgrid.ac.uk:1094
> User-Agent: curl/7.76.1
> Accept: */*
>
...
< HTTP/1.1 307 Temporary Redirect
< Connection: Keep-Alive
< Server: XrootD/v5.7.3
< Content-Length: 0
< Location: http://cephc07.gla.scotgrid.ac.uk:1094/atlas%3Adatadisk/SAM/test_random_name_998c6562-712a-4148-8621-f798087f9e8a
<
...
* Clear auth, redirects scheme from HTTPS to httpIssue another request to this URL: 'http://cephc07.gla.scotgrid.ac.uk:1094/atlas%3Adatadisk/SAM/test_random_name_998c6562-712a-4148-8621-f798087f9e8a'
* Trying 2001:630:40:ef::75:1094...
* Connected to cephc07.gla.scotgrid.ac.uk (2001:630:40:ef::75) port 1094 (#1)
> MKCOL /atlas%3Adatadisk/SAM/test_random_name_998c6562-712a-4148-8621-f798087f9e8a HTTP/1.1
> Host: cephc07.gla.scotgrid.ac.uk:1094
> User-Agent: curl/7.76.1
> Accept: */*
>
...
< HTTP/1.1 403 Forbidden
< Connection: Close
< Server: XrootD/v5.9.1
< Content-Length: 109
<
Unable to mkdir /atlas:datadisk/SAM/test_random_name_998c6562-712a-4148-8621-f798087f9e8a; permission denied
This command must succeed and not fail with HTTP 403
I mean, I *wrote* the changes to the plugin that ignore directory creation... and I think they should be in all versions since I did that, which was a while ago. This isn't getting as far as the xrootd-posix layer, it's being rejected at the auth step before it...
Rod's test is more interesting, because it might be a failure to actually make a file. However, the authdb for both cephc04 and cephc08 are identical, and you'd think that would be what would determine "success" (both for directory creation and file creation). In fact, the main difference is that bearer tokens are currently *disabled* on cephc08 (the one that Rod got to *work*) for http [because we observe internal transfers tend to be root protocol not http]

So, that implies that it's "having scitokens/bearer tokens" enabled which is breaking things for x509 proxies...

[the config on cephc07 and cephc04 is identical here]
Do we know if RAL has bearer tokens enabled? It would be a check on this being the problem.
yes, https://testbed.farm.particle.cz/cgi-bin/se.cgi?search=RAL-LCG2
on a second point bearer tokens are only in TPC, not in uploads. There shouldn't be any bear token used in gfal-copy.
gfal-copy with RETRIEVE_BEARER_TOKEN=true (default confituration) used SE-tokens for all operations
(this was changes it gfal used by FTS4, because there are no benefit of additional SE-token HTTP request & response)
@Sam this is how the FTS people explained the problem
> I think the difference is that in FTS3 the first thing the copy process did was retrieve a macaroon token from the storage, and then use this token for all subsequent operations. Whereas in FTS4, we only try to retrieve the macaroon right before issuing the COPY command. Therefore, in FTS4, we use the X509 certificate for all other operations, including MKCOL to create parent directories.

> And for some reason the MKCOL request does not go well along with X509 certificates on this particular storage cephc07.gla.scotgrid.ac.uk
Where does cephc07 even come into this? The doors are 04 and 08. I see redirection to cephc07 when I run the MKCOL curl command to cephc04, but not when I curl to 08. I can even run the curl against 07 directly without error.

Anyway, Petr was very clear - this needs to work

$ curl --verbose --silent --cert $X509_USER_PROXY --key $X509_USER_PROXY --cacert $X509_USER_PROXY --capath /etc/grid-security/certificates -L -X MKCOL https://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/SAM/test_random_name_$(uuidgen)

and does not.
So: cephc04 and cephc08 are the internal and external endpoints for transfers. cephc04 is a redirector, and spreads its load across more than one server (although it seems to be preferring cephc07 at the moment).
cephc08 is not a redirector (and even if it were, it wouldn't redirect to the same pool, because the point of having separate doors is to separate out the endpoints).
(It's understood that we need to make it work with FTS4's changed behaviour... but the first thing is to work out why the config doesn't do the expected thing)
Is there any news?
There are still 300 or so "Permission refused" a day averaged over the last week, but the overall transfer efficiency with UKI-SCOTGRID-GLASGOW as destination has been >90% since March 5.

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?from=1772437550996&orgId=17&to=1775029550996&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-binning=%24__auto_interval_binning&var-columns=src_cloud&var-dst_cloud=UK&var-dst_country=All&var-dst_endpoint=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_token=All&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-groupby=dst_experiment_site&var-include=&var-include_es_dst=All&var-include_es_src=All&var-measurement=ddm_transfer&var-protocol=All&var-remote_access=All&var-retention_policy=raw&var-rows=dst_endpoint&var-src_cloud=All&var-src_country=All&var-src_endpoint=All&var-src_site=All&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_token=All&viewPanel=6
Last 12, observing low deletion efficiency ~31% with 14.83K errors at the site mainly due to "The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink HTTP 403 : Permission refused"
Link:
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity=T0+Recall&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1775096792999&to=1775139992999&var-enr_filters=data.purged_reason%7C%3D%7CThe+requested+service+is+not+available+at+the+moment.%0ADetails%3A+An+unknown+exception+occurred.%0ADetails%3A+DavPosix%3A%3Aunlink++HTTP+403+%3A+Permission+refused
WLCG #1000554 (id:1000554) UKI-SCOTGRID-GLASGOW has high number of deletion failures
State: on hold  |  Priority: less urgent  |  Opened: 2025-09-11 22:32 (204d ago)  |  Updated: 2026-04-01 10:25
Conversation (11 messages)
Last 6h, deletion efficiency is ~ 63% with 4.22K error due to The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s
Link:
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1757608231323&to=1757629831323&var-enr_filters=data.purged_reason%7C%3D%7CThe+requested+service+is+not+available+at+the+moment.%0ADetails%3A+An+unknown+exception+occurred.%0ADetails%3A+DavPosix%3A%3Aunlink++timeout+of+20s

Details:
09/11/2025, 10:30:23 PM
mc15_13TeV
log.38001411._000869.job.log.tgz.1

The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s
deletion-failed
UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
20.0 s
462.16 kB
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/mc15_13TeV/40/c2/log.38001411._000869.job.log.tgz.1
-29358868-1757629823000
1757629826
UNKNOWN
UNKNOWN
log
UNKNOWN
true
UNKNOWN
-29358868-1757629823000
1757629826
331108833
175762982300
Hi, deletion efficiency is <60% in the last 6h
Deletions fail with "
The requested service is not available at the moment.
Details: An unknown exception occurred.
Details: DavPosix::unlink timeout of 20s"

15/09/2025, 07:23:38
mc23_13p6TeV
log.40005555._017643.job.log.tgz.1

The requested service is not available at the moment.
Details: An unknown exception occurred.
Details: DavPosix::unlink timeout of 20s
deletion-failed
UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
20.0 s
784.08 kB
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/mc23_13p6TeV/f5/7e/log.40005555._017643.job.log.tgz.1
1051012417-1757921018000

Please have a look

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1757900908751&to=1757922508751
In the last 24h deletion efficiency is > 99%, any issues fixed?
Hi: we didn't do anything to the site for any of this period - remember that currently we're operating on a single gateway due to IPv6 issues, and hence load spikes may cause service degradation which will resolve naturally when load reduces.
Right, I can confirm we still see some fluctuations and issues in the deletion efficiency:

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_cloud&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity=T0+Recall&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=UK&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1759311027883&to=1759354227883
Marking this on hold, site is working for a permanent solution on network end.
Hi, the number of UKI-SCOTGRID-GLASGOW's DDM Deletion failures is 2950 in the past 6 hours. Details: The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s.

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=$__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis%20Input&var-activity=Analysis%20Output&var-activity=Data%20Carousel%20Analysis&var-activity=Data%20Carousel%20Production&var-activity=Data%20Challenge&var-activity=Data%20Consolidation&var-activity=Data%20Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional%20Test&var-activity=Production%20Input&var-activity=Production%20Output&var-activity=Recovery&var-activity=Staging&var-activity=T0%20Export&var-activity=T0%20Recall&var-activity=T0%20Tape&var-activity=T0%20Tape%20Derived&var-activity=T0%20Tape%20RAW&var-activity=User%20Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis%20Input&var-activity_disabled=Data%20Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional%20Test&var-activity_disabled=Production%20Input&var-activity_disabled=Production%20Output&var-activity_disabled=User%20Subscriptions&var-protocol=All&var-remote_access=All&from=now-6h&to=now&var-enr_filters=data.dst_experiment_site%7C%3D%7CUKI-SCOTGRID-GLASGOW
Hi, in the last 6h, the number of DDM deletion errors at UKI-SCOTGRID-GLASGOW has exceeded 1.8K.
The detailed errors information is shown below:
The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s
DDM monitor link:
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=UKI-SCOTGRID-GLASGOW&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1763415032566&to=1763436632566&refresh=5s
Hi,
The situation is same in DDM deletions, the site's deletion efficiency has gone less than 50% since past 6hours and the number of errors has exceeded 2.6k. The error is: "The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s"

Here is the link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1764815591700&to=1764837191700
HI,

we still see the DDM Deletion errors/Functional tests failures: >3k errors in the last 3 hours:
UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s

https://monit-grafana.cern.ch/goto/qToToG7vR?orgId=17

Could you check it?

ADCoS shifter
There are two other tickets for
UKI-SCOTGRID-GLASGOW, with much more detailed discussion: https://helpdesk.ggus.eu/#ticket/zoom/1001975 (in progress), and https://helpdesk.ggus.eu/#ticket/zoom/3657 (on hold). This one seems redundant. **Unless there are objections, I propose to close this one.**

By the way, since Feb 21, deletions have never been less than 90% efficient and transfers are generally >90% efficient (a glitch down to 60-ish % on March 3,4).

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1759314251516&to=1775039051517&viewPanel=6

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Recall&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1759314272820&to=1775039072820&viewPanel=174
WLCG #683521 (id:3657) UKI-SCOTGRID-GLASGOW transfer errors
State: on hold  |  Priority: urgent  |  Opened: 2025-06-13 21:57 (294d ago)  |  Updated: 2026-04-01 08:00
Conversation (53 messages)
UKI-SCOTGRID-GLASGOW transfer errors inn the last 12 hours.

For example:

06/13/2025, 09:54:24 PM
Analysis Output
user.ivyoung
user.ivyoung.45076750._000008.single_lepton_run2_nosys.root

TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Connection terminated abruptly; Status of TPC request unknown
transfer-failed
MPPMU_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
2.5 hours
4.45 GB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/66e55b2e-4881-11f0-8de4-fa163e5ce271

davs://grid-dav.rzg.mpg.de:2880/atlasscratchdisk/rucio/user/ivyoung/ef/f0/user.ivyoung.45076750._000008.single_lepton_run2_nosys.root?copy_mode=pull
davs://cephc07.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/ivyoung/ef/f0/user.ivyoung.45076750._000008.single_lepton_run2_nosys.root
davs
4449998966
-354218960-1749851664000
1749851688
1749838010000
user.ivyoung.mc20_13TeV.508792.MGPy8_ttbar_SMEFTsim_topdecay_reweighted_nonallhad.deriv.DAOD_PHYSLITE.e8379_a907_r14860_p6490_gta199.v00_single_lepton_run2_nosys.606509192.606509192
user.ivyoung
UNKNOWN
UNKNOWN
true
UNKNOWN
262895294


Link :
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=$__auto_interval_binning&var-groupby=dst_cloud&var-activity=Analysis%20Input&var-activity=Analysis%20Output&var-activity=Data%20Brokering&var-activity=Data%20Consolidation&var-activity=Data%20Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional%20Test&var-activity=Production%20Input&var-activity=Production%20Output&var-activity=Recovery&var-activity=Staging&var-activity=T0%20Export&var-activity=T0%20Tape&var-activity=User%20Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis%20Input&var-activity_disabled=Data%20Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional%20Test&var-activity_disabled=Production%20Input&var-activity_disabled=Production%20Output&var-activity_disabled=Staging&var-activity_disabled=User%20Subscriptions&var-protocol=All&var-remote_access=All&from=now-12h&to=now
As an update, the deletion efficiency is low for the site (around 20%).
See https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1749919632014&to=1749930432014
Still low transfer efficiency. Any update?
Link for the past 24 hours:
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_cloud&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1750198675039&to=1750285075039
On Deletions - this has recovered since the weekend.
On transfers (as a destination - we're fine as a source): most of the "failures" are transfers being cancelled ["In the queue for too long" or "user canceled job"]. It may be that we have a backlog of transfers in FTS?
We certainly are not consuming anywhere near our total bandwidth so I am looking into it.
Hi,
just an update about current situation. The level of these TPC errors is at stable level:

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&theme=dark&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-activity=Data+Challenge&var-activity=Analysis+Output&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_cloud&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Express&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1750848907528&to=1751453707528&var-enr_filters=data.purged_reason%7C%3D%7CTRANSFER+ERROR%3A+Copy+failed+%283rd+pull%29.+Last+attempt%3A+Connection+terminated+abruptly%3B+Status+of+TPC+request+unknown&viewPanel=122

example of failed transfer:
timestamp: 02/07/2025, 10:54:50
activity: Data Consolidation
scope: data24_13p6TeV
name: AOD.44776797._002209.pool.root.1
reason_text: TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Connection terminated abruptly; Status of TPC request unknown
event_type: transfer-failed
duration: 14293
src_rse: BNL-OSG2_DATADISK
dst_rse: UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
bytes: 7172777879
transfer_link: https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/db516b7a-5361-11f0-99d1-fa163e39e2d9
src_url: davs://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data24_13p6TeV/bf/42/AOD.44776797._002209.pool.root.1?copy_mode=pull
dst_url: davs://cephc07.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/data24_13p6TeV/bf/42/AOD.44776797._002209.pool.root.1
protocol: davs
Hi,
Also another update about the current situation of this site that it is failing as a destination again with~7K errors: "transfers being cancelled ["In the queue for too long" or "user canceled job"]" over the past 6h.

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1751832037354&to=1751853637355

Example:

07/07/2025, 02:00:03 AM
Analysis Output
user.ludovica
user.ludovica.45370011._002056.results.txt

Job has been canceled because it stayed in the queue for too long (max-time-in-queue timeout)
transfer-failed
RAL-LCG2-ECHO_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
-1.0 s
91.00 B
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/d8d8d4a0-5555-11f0-9142-fa163e5a92fb

davs://rdr.echo.stfc.ac.uk:1094/atlas:scratchdisk/rucio/user/ludovica/66/58/user.ludovica.45370011._002056.results.txt?copy_mode=pull
davs://cephc07.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/ludovica/66/58/user.ludovica.45370011._002056.results.txt
davs
91
1703265213-1751853603000
1751853696
1751248670000
user.ludovica.dyt_Ai_NLO_wm_13_CT18NNLO_mem23_kmuren_1_kmufac_1_qt_default_m_1_1_260_Y_0.0_2.0_VJLO_v2_results.txt.609862607.609862607
user.ludovica
UNKNOWN
UNKNOWN
true
UNKNOWN
279450547

Please take a look. Thank you.

Khanh.
Hi all,

We still observed the transfer errors, for example,
- Job has been canceled because it stayed in the queue for too long (max-time-in-queue timeout)
- TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Connection terminated abruptly; Status of TPC request unknown

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1752154939153&to=1752198139154

Could you check it?

ADCoS shifter
Site is completely red since yesterday.

https://monit-grafana.cern.ch/goto/nx4IP48NR?orgId=17
Hi,
As an update that this site is failing as a destination with ~8.54K failures over the past 6h. The most common error codes, for example:
- Job canceled by the user

- HTTP 403 : Permission refused

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&from=1752688985883&to=1752710585883&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All

Please have a look. Thank you.

Khanh (ADCoS shifter).
Hi again,
It seems this site is also failing as a source, especially for transfers from BNL-ATLAS. There has been ~2.14K transfer failures over the past 3h with the most common error code: TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Transfer failure: socket timeout on GET : Read timed out; redirects IP|HOST:PORTtadisk/PATH

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=src_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=UKI-SCOTGRID-GLASGOW&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=All&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&from=1752708602389&to=1752719402389&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All
An example of transfer failures:

07/17/2025, 02:30:49 AM
User Subscriptions
mc16_13TeV
EVNT.35304445._001689.pool.root.1

TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Transfer failure: socket timeout on GET (received 0 B of data; 0 B pending): Read timed out; redirects [http://cephc07.gla.scotgrid.ac.uk:1095/atlas:datadisk/rucio/mc16_13TeV/11/d0/EVNT.35304445._001689.pool.root.1?<redacted>
transfer-failed
UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
BNL-OSG2_LOCALGROUPDISK
5.4 mins
255.81 MB
https://fts3.usatlas.bnl.gov:8449/fts3/ftsmon/#/job/babadc32-6286-11f0-9383-00163e105263

davs://cephc07.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/mc16_13TeV/11/d0/EVNT.35304445._001689.pool.root.1?copy_mode=pull
davs://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/LOCALGROUPDISK/rucio/mc16_13TeV/11/d0/EVNT.35304445._001689.pool.root.1
davs
255812644
1872224110-1752719449000
1752719455
1752688810000
mc16_13TeV.701110.Sh_2214_WqqZvv.merge.EVNT.e8547_e8455_tid35304445_00
mc16_13TeV
EVNT
UNKNOWN
true
UNKNOWN
284250869
1752719449000

Please also have a look.
Thank you.

Khanh (ADCoS shifter)
Hi,

The site continues failing transfers. Please see:

https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&var-enr_filters=data.purged_reason%7C%3D%7CTRANSFER+ERROR%3A+Copy+failed+%283rd+pull%29.+Last+attempt%3A+copy+HTTP+500+%3A+Unexpected+server+error%3A+500+&from=1753112228688&to=1753123028689

Could you please have a look at this?

Gabriela (ADCOS shifter)
Hi,

The site is still failing as a destination with 0% transfer efficiency over the past 48 hours.
We’ve observed over 67k transfer failures, most of which are timeout of 300s.
See attached monitoring link for details.

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_cloud&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1753488000000&to=1753660799000

An example of transfer failures:
2025/07/27 12:59:22
Analysis Output
user.fgiuli
user.fgiuli.45763422._000118.results.root
timeout of 300s
transfer-failed
CSCS-LCG2_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
5.0 mins
23.99 kB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/22826cdc-6ab5-11f0-a0be-5a59366a71a6
davs://storage01.lcg.cscs.ch:2880/pnfs/lcg.cscs.ch/atlas/atlasscratchdisk/rucio/user/fgiuli/e6/7f/user.fgiuli.45763422._000118.results.root?copy_mode=pull
davs://cephc07.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/fgiuli/e6/7f/user.fgiuli.45763422._000118.results.root
davs
23989
-1433076373-1753621635000
1753621646
1753598559000
user.fgiuli.dyt_Ai_NLO_wp_5_CT18ZNNLO_as_0118_kmuren_1_kmufac_1_qt_default_m_1_1_260_Y_3.6_3.8_VJLO_v1_results.root.614645398.614645398
user.fgiuli
UNKNOWN
UNKNOWN
true
UNKNOWN
296181671
1753621639000

Could you please have a look at this?

Linjing Duan (ADCOS trainee)
Any update here?
Dear,For the past two days, the transfer and deletion efficiencies of UKI-SCOTGRID-GLASGOW have been almost zero, and the number of transfer and deletion errors is ~80k and ~40k, respectively, with the following error:

Transfer error: "timeout of 300s"

Delection error: "The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s"
For more details:
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-binning=%24__auto_interval_binning&var-columns=src_experiment_site&var-dst_cloud=All&var-dst_country=All&var-dst_endpoint=All&var-dst_site=All&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_token=All&var-enr_filters=data.dst_experiment_site%7C%3D%7CUKI-SCOTGRID-GLASGOW&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-groupby=dst_cloud&var-include=&var-include_es_dst=All&var-include_es_src=All&var-measurement=ddm_transfer&var-protocol=All&var-remote_access=All&var-retention_policy=raw&var-rows=dst_experiment_site&var-src_cloud=All&var-src_country=All&var-src_endpoint=All&var-src_site=All&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_token=All&from=1753781933254&to=1753868333254

Could you please have a look at this matter?
Cheers,
Mohammed FAraj "ADCoS senior shifter"
Transfer details:

07/30/2025, 09:39:16 AM
Analysis Output
user.fgiuli
user.fgiuli.45785631._001165.results.log

timeout of 300s
transfer-failed
IFAE_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
5.0 mins
6.81 kB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/454d33f6-6d25-11f0-b1c7-fa163e087e6b

davs://webdav-at2.pic.es:8446/t2atlasscratchdisk/rucio/user/fgiuli/9b/8d/user.fgiuli.45785631._001165.results.log?copy_mode=pull
davs://cephc07.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/fgiuli/9b/8d/user.fgiuli.45785631._001165.results.log
davs
6807
1339348039-1753868356000
1753868367
1753866554000
user.fgiuli.dyt_Ai_NNLO_wm_13_CT18ZNNLO_as_0118_kmuren_1_kmufac_1_qt_default_m_1_1_260_Y_1.2_1.4_VJREAL_v1_results.log.614983219.614983219
user.fgiuli
UNKNOWN
UNKNOWN
true
UNKNOWN
296820429
1753868363000

Deletion details:

07/30/2025, 09:39:54 AM
panda
panda.um.group.phys-hmbs.45771404._000620.hist-output.root

The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s
deletion-failed
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
20.0 s
26.57 MB
davs://cephc07.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/panda/08/1e/panda.um.group.phys-hmbs.45771404._000620.hist-output.root
1306367809-1753868394000
1753868400
UNKNOWN
UNKNOWN
UNKNOWN
UNKNOWN
true
UNKNOWN
1306367809-1753868394000
1753868400
294553713
1753868394000
To get the transfers and the jobs, while the more general networking problem is understood, we have temporarily replaced in CRIC the production machine with an older machine which while less performant, has a configuration that for the moment works. All this depends from central university configuration, so I opened a ticket to track changes in cric https://its.cern.ch/jira/browse/ADCINFR-280
The transfers are restarting. I'm putting the ticket on hold.
FYI: The transfer errors at this site as the destination still persist.https://monit-grafana.cern.ch/goto/NRkntywHR?orgId=17
Over 5.5k during past 3h.
The deletion efficiency of UKI-SCOTGRID-GLASGOW has been less than 50% with large failures of about ~5.46 K during past 6 hours. The failure error is: "The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s".
Deletion details:
15/08/2025, 14:07:27
tests
step14.46295.58541.recon.ESD.35323.22530

The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s
deletion-failed
UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
20.0 s
1.05 MB
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/tests/84/3d/step14.46295.58541.recon.ESD.35323.22530
-887837128-1755266847000
1755266855
UNKNOWN
UNKNOWN
UNKNOWN
UNKNOWN
true
UNKNOWN
-887837128-1755266847000
1755266855
305413898
1755266847000

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_cloud&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1755245506591&to=1755267106591

Regards,
ADCoS shifter
Dear site admins,
Just an update about the current situation of this site: it is failing as a destination with ~4.94K transfer failures over the past 6h. The most common transfer error message is "Job canceled by the user".

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1755542498638&to=1755564098638

Example:

08/19/2025, 12:32:21 AM
Analysis Output
user.mzaazoua
user.mzaazoua.45898571._000045.output-tree.root

Job canceled by the user
transfer-failed
TRIUMF-LCG2_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
-1.0 s
2.45 MB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/01ec7aba-7c37-11f0-83a1-fa163ea7ee69

davs://webdav.lcg.triumf.ca:2880/atlas/atlasscratchdisk/rucio/user/mzaazoua/98/97/user.mzaazoua.45898571._000045.output-tree.root?copy_mode=pull
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/mzaazoua/98/97/user.mzaazoua.45898571._000045.output-tree.root
davs
2449954
1840446642-1755563541000
1755563900
1755523496000
user.mzaazoua.Test010_data_Run2.periodAllYear.grp16_v01_p6266_TREE.616580467.616580467
user.mzaazoua
UNKNOWN
UNKNOWN
true
UNKNOWN
307629778
1755563894000

Please have a look.

Best regards,
Khanh (ADCoS shifter).
Please also note the site also has high deletion failures: ~4.35K over the past 6h.
Error code: The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s

Example:

08/19/2025, 12:45:56 AM
data24_13p6TeV
data24_13p6TeV.00486205.physics_Main.merge.AOD.f1519_m2259._lb0847._0003.1

The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s
deletion-failed
UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
20.0 s
2.99 GB
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/data24_13p6TeV/8c/6c/data24_13p6TeV.00486205.physics_Main.merge.AOD.f1519_m2259._lb0847._0003.1
1341492406-1755564356000
1755564362
UNKNOWN
UNKNOWN
AOD
UNKNOWN
true
UNKNOWN
1341492406-1755564356000
1755564362
311164857
1755564356000

Best regards,
Khanh.
Additional debugging info and details available in the other ticket which has been closed as duplicate for now https://helpdesk.ggus.eu/#ticket/zoom/3828
Dear site admins,
May we check if there are any updates on the debugging?

In fact over the past 6h, this site is failing as a destination with ~4.09 K transfer failures. The most common error codes are:
Job canceled by the user (~1.19 K)

Result curl error (52): Server returned nothing (no headers, no data) after 1 attempts (~1.18 K)

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1756328171298&to=1756349771298

Please have a look.

Best regards,
Baihong and Khanh (ADCoS shifters).
Dear site admins,
May we check if there are any updates on the debugging?
In the past 3 hours, this site is failing as a destination with ~9.38k transfer failures. The most common error codes are:
TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Connection terminated abruptly; Status of TPC request unknown(~3.50k)
Result curl error (52): Server returned nothing (no headers, no data) after 1 attempts(~2.76k)
Example:
2025/09/04 05:20:46
Analysis Output
user.menke

user.menke.46286483.EXT0._000877.RadMaps.root

TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Connection terminated abruptly; Status of TPC request unknown
transfer-failed
CERN-PROD_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
51.0 s
50.67 MB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/4f2490a2-8924-11f0-a059-fa163ea7ee69

davs://eosatlas.cern.ch:443/eos/atlas/atlasscratchdisk/rucio/user/menke/ba/51/user.menke.46286483.EXT0._000877.RadMaps.root?copy_mode=pull
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/menke/ba/51/user.menke.46286483.EXT0._000877.RadMaps.root
davs
50671672
-828652927-1756963246000
1756963278
1756944902000
user.menke.RadMaps-R3S-2021-03-02-01-13p6TeV-G4p10p6_photevap-Py8EG_A3_NNPDF23LO_minbias_inelastic.v1_EXT0.620070434.620070434
user.menke
UNKNOWN
UNKNOWN
true
UNKNOWN
318240364
1756963264000

Link:https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1756952639835&to=1756963439835
please have a look.
Best regards,
Zelin and Khanh (ADCoS shifters).
There were multiple errors reported in this ticket. Checking now, it seems the site is working relatively well. Average transfer efficiencies for last day are 99% as source and 97% as destination. So, I close this ticket.
Hi, the issue occurs again in the past 3 hours, with 24.7% efficiency and 18.2k errors.
the main reason is : "HTTP 403 : Permission refused"
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Brokering&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=User+Subscriptions&var-activity=default&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1759036193024&to=1759046993024

here is the detail:
2025/09/28 07:10:32
Functional Test
tests
step14.90376.6018.recon.ESD.46564.68515

HTTP 403 : Permission refused
transfer-failed
UKI-NORTHGRID-LIV-HEP_DATADISK
UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
0 s
1.05 MB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/64a1e59c-9c39-11f0-8a5b-fa163ec3b00c

davs://hepgrid11.ph.liv.ac.uk:443/dpm/ph.liv.ac.uk/home/atlas/atlasdatadisk/rucio/tests/30/2a/step14.90376.6018.recon.ESD.46564.68515?copy_mode=pull
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/tests/30/2a/step14.90376.6018.recon.ESD.46564.68515
davs
1048576
-147178199-1759043432000
1759043436
1759043008000
step14.90376.6018.recon.ESD.46564
tests
UNKNOWN
UNKNOWN
true
UNKNOWN
338694825
1759043434000
Thanks, this is known issue and site is working to have a long term solutions.
As there is a ongoing work which might take some time, I set the ticket on hold.
Hi folks,The overall transfer efficiency of TW-FTT is currently down 20% only because of these failure. Is there an ETA for solution? Do we have to change the distance to the site until this problem is solved?
Cheers,
Yi-Ru & Ivan
Unfortunately, we don't have an ETA, although we are pushing our campus networking team for a solution. If there are adjustments that can be made to site distance in order to lessen the pressure elsewhere, then that might be a good idea?
Gordon
What is the known problem and precise symptoms? I don't think everything can be blindly blamed on that. For example, I cannot gfal-copy any file from Glasgow. I also notice successful transfers for FZK-LCG_DATADISK and 100% failed for KIT-T2_DATADISK. This is a smoking gun because this is the same storage, same doors. The difference is in rucio config where tokens are enabled for FZK-LCG2 but not KIT-T2

$ rucio-admin rse info FZK-LCG2_DATADISK | grep oid
oidc_base_path: /pnfs/gridka.de/atlas
oidc_support: True
[lxplus956] ~ $ rucio-admin rse info KIT-T2_DATADISK | grep oid
[lxplus956] ~ $

gfal-copy and KIT-T are using x509, so I`d say that is broken -

DEBUG Impossible to get string_list parameter HTTP PLUGIN:HEADERS, set to a default value (null), err Key file does not have key “HEADERS” in group “HTTP PLUGIN”
DEBUG Impossible to get integer parameter HTTP PLUGIN:OPERATION_TIMEOUT, set to default value 1800, err Key file does not have key “OPERATION_TIMEOUT” in group “HTTP PLUGIN”
DEBUG Using client X509 for HTTPS session authorization
DEBUG Impossible to get string parameter BEARER:TOKEN, set to default value (null), err Key file does not have group “BEARER”

The rest of the apparent random failures between sites, e.g. failures also from FZK despite it supporting tokens, is down to mixing of transfers for sites not supporting tokens. Then all transfers must use certs.

This was being looked at by dev.

Cheers,
Rod.
hi,
As an update that this site is failing as a destination with ~4K failures over the past 3h.
main error:
Job canceled by the user

example:

2025/11/07 06:32:28
Analysis Output
user.fgiuli
user.fgiuli.47069510._004974.results.txt

Job canceled by the user
transfer-failed
NET2_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
-1.0 s
383.00 B
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/cda5cb48-bb61-11f0-8293-5a59366a71a6

davs://webdav.data.net2.mghpcc.org:2880/atlasscratchdisk/rucio/user/fgiuli/62/4f/user.fgiuli.47069510._004974.results.txt?copy_mode=pull
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/fgiuli/62/4f/user.fgiuli.47069510._004974.results.txt
davs
383
894286054-1762497148000
1762497340
1762468827000
user.fgiuli.dyt_Z3D_8TeV_NNLO_sin2th_scheme_CC_costh_bin_1_NNPDF31_nnlo_as_0118_Y_0.4_0.6_VJVIRT_v0_results.txt.628808337.628808337
user.fgiuli
UNKNOWN
UNKNOWN
true
UNKNOWN
363648895
1762497329000

Link is: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity=T0+Recall&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_endpoint&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_endpoint&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&var-enr_filters=data.dst_experiment_site%7C%3D%7CUKI-SCOTGRID-GLASGOW&from=1762486700633&to=1762497500633
Can you have a look at this?
Best,
Zelin and Baihong
Dear site admins,
could you please comment on your x509 configuration mentioned above by Rod?
Cheers,
Ivan
Hi, over the past 3 hours , there are 9.36k transfer error Job has been canceled because it stayed in the queue for too long (max-time-in-queue timeout) and 3.26k deletion error The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s for UKI-SCOTGRID-GLASGOWlink is
https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_cloud&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity=T0+Recall&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=Staging&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1762913318771&to=1762924118771
and here is example for transfer error
2025/11/12 05:16:16
Analysis Output
user.hslien
user.hslien.47281889._000055.output-tree.root
TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Transfer failure: Internal transfer failure, local=/atlas:scratchdisk/rucio/user/hslien/0b/a5/user.hslien.47281889._000055.output-tree.root, remote=https://ccdavatlas.in2p3.fr:2880/atlasscratchdisk/rucio/user/hslien/0b/a5/user.hslien.47281889._000055.output-tree.root?<redacted>, HTTP library failure=SSL connect error
transfer-failed
IN2P3-CC_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
4.0 mins
3.70 MB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/11456174-bf86-11f0-afab-fa163e9a529a
davs://ccdavatlas.in2p3.fr:2880/atlasscratchdisk/rucio/user/hslien/0b/a5/user.hslien.47281889._000055.output-tree.root?copy_mode=pull
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/hslien/0b/a5/user.hslien.47281889._000055.output-tree.root
davs
3697594
939328499-1762924576000
1762924583
1762924197000
user.hslien.mc23a_sys_v01.2025_11_07_T104339.601355.e8551_s4162_r14622_p6697_TREE.630747418.630747418
user.hslien
UNKNOWN
UNKNOWN
true
UNKNOWN
367139501
1762924579000

also the deletion error
2025/11/12 05:16:24
user.chhultqu
user.chhultqu.46062497.EXT0._000442.AOD.pool.root

The requested service is not available at the moment. Details: An unknown exception occurred. Details: DavPosix::unlink timeout of 20s
deletion-failed
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
20.0 s
7.03 GB
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/chhultqu/64/c4/user.chhultqu.46062497.EXT0._000442.AOD.pool.root
-1896086664-1762924584000
1762924594
UNKNOWN
UNKNOWN
UNKNOWN
UNKNOWN
true
UNKNOWN
-1896086664-1762924584000
1762924594
369074280
1762924584000

Could you please have a look at this issue? Thanks.

Xiang li (ADCOS shifter)
Dear site admins,
could you please comment on your x509 configuration mentioned above by Rod?
Cheers,
Ivan
Hi,
there might be a second order problem with x509 but to me the error pattern stems from the only ipv6 server being undersized. So in this sense it is true that the answer is always the same.
one of the things I was thinking this morning that Glasgow could try, since a general ipv6 solution is far away, is to rename cephc07 and assign it the ipv6 ip of cephc04. i.e. replace the hardware. @gordon how difficult would that be?
Hi Alessandra,
In principal, I don't think that would be difficult. Our concern was always that we weren't entirely sure *why* cephc04 was accessible via IPv6, as really all the hosts should behave in the same way; the worry was that if we started playing around with things, we might end up with nothing working at all. However, given the amount of time it's taking to get a proper resolution, it may be worth trying.
Also: re x509 - no, this was just a (known) problem where sometimes xrootd servers break their certificate chains and need to be restarted [but still authenticate things that don't need x509 fine themselves]. Sorry, this was resolved off-ticket with Rod.
production transfers seem mostly ok what is broken is user subscriptions and analysis output because I assume they have a much lower priority. Unfortunately Glasgow cannot benefit from alt storage on the same queue. The load from the atlas point of view doesn't seem that much. i've also increased the parallel transfers to 100->150
Thank you all. So, what is the bottom line here? We look forward to implementing what Alessandra proposed (rename cephc07 and assign it the ipv6 ip of cephc04. i.e. replace the hardware)? I don't think that the fact that what is failing is analysis related, not production related should make is not following up on this - the site still has only ~60% transger efficiency as destination (https://monit-grafana.cern.ch/goto/wXQkLZivg?orgId=17)
Hi Ivan,
we agreed yesterday that next week cephc07 will be renamed and get cephc04 ipv4 and ipv6 IPs. And will get it from there. There might be other problems behind that but until the hardware is replaced it's all speculation.
Thank you Ale. Do we have ETA (after which we can ask on status update)?
Hi all,

We are temporarily forcing a hop between the sites until this is solved.
Tracked in https://its.cern.ch/jira/browse/ATLDDMOPS-5798

Cheers,
Fabio for DDM Ops
> Do we have ETA?
this question is for @gordon. He is going to do the work. I assume he'll have to declare a DT.
No ETA at the moment, and I'll need support from Sam / Bruno as well to make sure I don't mess up something on the storage side.
However, right as this very minute, our campus network team is looking at the problem, so we'll hold off until they've finished. We might even get a fix which makes shuffling things around irrelevant... fingers crossed!
x509 auth is not working but tokens are ok. last time a restart of the door helped.
Any news here?
So, we did reset the xrootd door (twice, in fact, because the first time only helped for about 12 hrs).The current plans are:

1) move the xrootd servers to new releases of xrootd (and xrootd-ceph and ceph backends)
2) move around the xrootd servers that can be seen by the outside world on IPv6 to try to improve capacity for ATLAS (trading off against capacity for other VOs)
3) continue trying to solve the core problem (most of our servers can't speak IPv6 outside of the University network - which limits not just Storage but also, for example, our ability to move CEs to v6 as well) working with University network admins. This actually has been ongoing - a number of potential things have been fixed even this week - but without apparently finding whatever's causing the issue in question.
Thanks for the info. As it looks like this can take some time, I set the ticket on hold
Dear site admins,
Over the past 3h, this site has been failing as both source and destination with error: "HTTP 500 : Unexpected server error: 500".

Link: https://monit-grafana.cern.ch/d/FtSFfwdmk/ddm-transfers?orgId=17&var-binning=%24__auto_interval_binning&var-groupby=dst_experiment_site&var-activity=Analysis+Input&var-activity=Analysis+Output&var-activity=Data+Carousel+Analysis&var-activity=Data+Carousel+Production&var-activity=Data+Challenge&var-activity=Data+Consolidation&var-activity=Data+Rebalancing&var-activity=Deletion&var-activity=Express&var-activity=Functional+Test&var-activity=Production+Input&var-activity=Production+Output&var-activity=Recovery&var-activity=Staging&var-activity=T0+Export&var-activity=T0+Tape&var-activity=T0+Tape+Derived&var-activity=T0+Tape+RAW&var-activity=User+Subscriptions&var-activity=default&var-activity=T0+Recall&var-src_tier=0&var-src_tier=1&var-src_tier=2&var-src_country=All&var-src_cloud=All&var-src_site=All&var-src_endpoint=All&var-src_token=All&var-columns=src_experiment_site&var-dst_tier=0&var-dst_tier=1&var-dst_tier=2&var-dst_country=All&var-dst_cloud=All&var-dst_site=UKI-SCOTGRID-GLASGOW&var-dst_endpoint=All&var-dst_token=All&var-rows=dst_experiment_site&var-measurement=ddm_transfer&var-retention_policy=raw&var-include=&var-exclude=TEST%7CPPS%7CGRIDFTP%7CLAKE%7CAWS&var-exclude_es=All&var-include_es_dst=All&var-include_es_src=All&var-activity_disabled=Analysis+Input&var-activity_disabled=Data+Consolidation&var-activity_disabled=Deletion&var-activity_disabled=Functional+Test&var-activity_disabled=Production+Input&var-activity_disabled=Production+Output&var-activity_disabled=User+Subscriptions&var-protocol=All&var-remote_access=All&from=1771453886901&to=1771464686901

Example of a transfer failure:

02/19/2026, 01:29:54 AM
Analysis Output
user.menke
user.menke.48610805._002006.output.root

HTTP 500 : Unexpected server error: 500
transfer-failed
RO-07-NIPNE_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
5.2 mins
7.13 GB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/c2928e4a-0d31-11f1-a545-fa163e9896b6

davs://tbit00.nipne.ro:443/dpm/nipne.ro/home/atlas/atlasscratchdisk/rucio/user/menke/57/50/user.menke.48610805._002006.output.root?copy_mode=pull
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/menke/57/50/user.menke.48610805._002006.output.root
davs
7130591004
1889496570-1771464594000
1771464618
1771464017000
user.menke.mc16_13TeV.410471.PhPy8EG.mc16d_Feb2026_ttccReversed_WithPL_Sys_output.root.648112593.648112593
user.menke
UNKNOWN
UNKNOWN
true
UNKNOWN
436665608
1771464609000

02/19/2026, 01:29:53 AM
Functional Test
tests
step14.20397.62852.recon.ESD.21178.16559

HTTP 500 : Unexpected server error: 500
transfer-failed
UKI-LT2-QMUL_DATADISK
UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
2.0 mins
1.05 MB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/e98338a0-0d28-11f1-bfa1-5a59366a71a6

davs://webdav.esc.qmul.ac.uk:8443/webdav/atlas/atlasdatadisk/rucio/tests/0c/76/step14.20397.62852.recon.ESD.21178.16559?copy_mode=pull
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:datadisk/rucio/tests/0c/76/step14.20397.62852.recon.ESD.21178.16559
davs
1048576
-2058997527-1771464593000
1771464618
1771460406000
step14.20397.62852.recon.ESD.21178
tests
UNKNOWN
UNKNOWN
true
UNKNOWN
432132965
1771464608000

02/19/2026, 01:29:44 AM

Analysis Output
user.chhultqu
user.chhultqu.48565189.EXT0._146731.DAOD_EGAM4.DAOD_EGAM4.pool.root

HTTP 500 : Unexpected server error: 500
transfer-failed
UKI-LT2-QMUL_SCRATCHDISK
UKI-SCOTGRID-GLASGOW-CEPH_SCRATCHDISK
2.0 mins
7.97 MB
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/24462eaa-0d30-11f1-8660-5a59366a71a6

davs://webdav.esc.qmul.ac.uk:8443/webdav/atlas/atlasscratchdisk/rucio/user/chhultqu/86/70/user.chhultqu.48565189.EXT0._146731.DAOD_EGAM4.DAOD_EGAM4.pool.root?copy_mode=pull
davs://cephc04.gla.scotgrid.ac.uk:1094/atlas:scratchdisk/rucio/user/chhultqu/86/70/user.chhultqu.48565189.EXT0._146731.DAOD_EGAM4.DAOD_EGAM4.pool.root
davs
7965457
-1831937552-1771464584000
1771464618
1771463537000
user.chhultqu.data24.physics_Main.DAOD_EGAM4_v6_EXT0.646424600.646424600
user.chhultqu
UNKNOWN
UNKNOWN
true
UNKNOWN
432439748
1771464610000

Please have a look.

Khanh (ADCoS shifter).
Any update here?
If you look at ddm, you should see that we now have no 500 errors for the past 5+ days. The issue was caused by excessive logging from xrootd filling up disks and needing some cleanup (and drastic ameliorative action to control the log sizes).
Thanks Sam.

The previous discussion on this ticket was about the IPv6 issue (cf https://helpdesk.ggus.eu/#ticket/zoom/3828, https://its.cern.ch/jira/browse/ADCINFR-280, https://its.cern.ch/jira/browse/ATLDDMOPS-5787) leading to transfer failures to RAL-LCG2.

Are there any news about this? Can we remove the workaround on the Rucio side that prevents direct transfers between Glasgow and RAL?
Hi, sorry for the delayed reply. As far as we can see, the IPv6 issues are resolved (at the University level), so we'd be happy to enable the GLA->RAL links again
All the metrics discussed in this ticket have looked quite good since March 5 (deletions, transfers as source and destination). There is also a complementary ticket https://helpdesk.ggus.eu/#ticket/zoom/1001975 that is still open.

Is there news about the
GLA->RAL link? As far as I can tell, this is the only outstanding question that is unique to this ticket.
WLCG #1000957 (id:1000957) IGWN glide-ins are getting 'stuck' at SGridGLA
State: waiting for submitter's reply  |  Priority: less urgent  |  Opened: 2025-10-23 09:46 (163d ago)  |  Updated: 2025-11-25 08:57
Conversation (6 messages)
This ticket is mainly intended to get in touch with the admins at Glasgow. I am relaying this on behalf of the igwn operations.
LIGO/VIRGO jobs at SGridGLA going through ARC-CE are not going through. There may be some configuration needed to accept glide-ins, and to get the id mapping between scitokens and the local unix system sorted out.

The front-end mentioned is ce04.gla.scotgrid.ac.uk
I don't actually see anything hitting the CE at all. We already had a mapping for SciTokens issued by https://scitokens.org/ligo. I've now added one for https://cilogon.org/igwn (from a quick web search, but I'm not sure if this is correct?).
Gordon
Hi Gordon,
thanks for checking. James Clark (who cannot open this ticket, unfortunately) asked me to relay this message:

the original mapping should be correct. E.g., here's what we have on the condor CEs:
# LIGO ##
SCITOKENS /^https\:\/\/scitokens\.org\/ligo,/ igwn-pilot
And the Virgo pilots *should* be presenting both types of credential, after all.
So we'll need someone from OSG to take a look at the logs on the factory itself.
The suggestion is to check that the CE accepts the SSL cert for https://scitokens.org/ligo as it may not be a part of the standard IGTF bundle.
I've installed the GTS WE1 cert on our CEs, and also reverted the addition of the https://cilogon.org/igwn mapping. Let's see if that makes a difference.
Hi, since making this change we havent seen any activty from your VO, do you see anything on your end?
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%95%100%95%100%98%99%99%100%100%?100%100%95%83%96%
HammerCloud100%100%100%100%100%100%100%100%100%100%100%100%100%100%100%98%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (1)

CMS tickets (1)
CMS #681826 (id:1927) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_Baylor
State: assigned  |  Priority: less urgent  |  Opened: 2025-01-29 10:48 (430d ago)  |  Updated: 2025-09-02 08:34
Conversation (13 messages)
GGUS ID: 168899
Last modifier: Jakrapop Akaranee
Date: 2024-11-05 11:14:27
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_Baylor
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: USCMS
Issue type: CMS_SAM tests
Description:
Dear Baylor Site Administrators,
We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack, specifically for the following endpoint:

kodiak-se.baylor.edu (XrootD [1] and WebDAV [2] )

Could you please review dual stack support on your storage element?
Thank you for your assistance.
Best Regards,Jakrapop
-----------
[1] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dkodiak-se.baylor.edu%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice

[2] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dkodiak-se.baylor.edu%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
GGUS ID: 168899
Last modifier: Kenichi Hatakeyama
Date: 2024-11-06 22:19:34

Public Diary:
Will try to take a look.
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168899
Last modifier: Kenichi Hatakeyama
Date: 2024-12-09 14:34:39

Public Diary:
I am just a little struggling to find time. If this is urgent, please let me know. Thank you.
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168899
Last modifier: Jakrapop Akaranee
Date: 2024-12-16 14:06:12

Public Diary:
Dear Kenichi,

Thank you for providing the progress and update. It’s not extremely urgent, but could you let me know when you expect it to be completed?

Best regards,
Jakrapop

Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168899
Last modifier: Kenichi Hatakeyama
Date: 2024-12-18 13:57:17

Public Diary:
Hoping later this week or next week..
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
Assigning missing CMS site name error during import to new GGUS.

Jakrapop
Hi. I probably need some help to get idea what this is about. What config needs to be revisited?
Hello Kenichi, Could you please provide any update about this ticket. You need to ask network engineer/ISP to confirm your computer network infrastructure already support IPv6. Then modify some network configurations on your storage endpoint to obtain IPv4/IPv6.
Thank you,
Noy
I reached out to our cluster admins. Will keep you posted.
This is relevant just for SE?
Or, is it important to understand IPv6 plan for CE and compute nodes?
Hi. Dr. Kenichi Hatakeyama forwarded to me your message about enabling IPv6 on our network infrastructure so we can set up a dual stack on our storage element (SE). I have discussed the project with our network infrastructure team. They are going to determine what is needed to enable IPv6 from our border routers to our SE. They wanted me to ask GGUS support what information you need from them for configuring the dual network stack on the host.

Along those same lines, I will need instructions for how to configure the host.

Thanks very much for your help,

Mike Hutcheson
Director – Research Technology
Information Technology Services
One Bear Place #97268
Waco, TX 76798
Tel: 254-710-4110


GGUS Helpdesk Notification
Ticket #1927 "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_Baylor"
was updated by Chan-Anun Rungphitakchai on 2025-07-21 10:55 (UTC).

Hello Kenichi,
Could you please provide any update about this ticket. You need to ask network engineer/ISP to confirm your computer network infrastructure already support IPv6. Then modify some network configurations on your storage endpoint to obtain IPv4/IPv6.
Thank you,
Noy


Ticket is assigned to NGIs › USCMS

Updates:
Notified Users: →

https://helpdesk.ggus.eu/#ticket/zoom/1927
You are receiving this because you were mentioned in this ticket. | Manage your notification settings | EGI/WLCG
Thank you, Mike.
Noy or other experts, do you have response to Mike's questions?
Hello, Kenichi. WLCG plan to get rid of IPv4 in the future. At this moment, site support asks every site make sure their endpoint has IPv6 and we don't ask you remove IPv4. The dual stack is the best solution for this time.
Thank you,
Noy
[1]
https://twiki.cern.ch/twiki/bin/view/CMS/IPv6Status4Sites
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (4)

CMS tickets (4)
CMS #1000989 (id:1000989) XRootD SAM test failure at T3_US_Brown
State: assigned  |  Priority: less urgent  |  Opened: 2025-10-28 10:02 (158d ago)  |  Updated: 2026-01-12 14:29
Conversation (4 messages)
Hello Brown admins.
Since Wednesday (22 Oct). Your XRootD "pbrux30cit.hep.brown.edu" endpoint has been failed SAM "1connection" test [1]. The log file shows "Endpoint does not have a reachable IPv4 nor IPv6 address" error [2] and manually test also gave a same result [3]. Could you please take a look and check this server's status/connection/services?
Best regards,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_US_Brown
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-1connection&var-dst_hostname=pbrux30cit.hep.brown.edu&var-timestamp=1761636969000
[3]
[crungphi@lxplus810 ~]$ ping -c 5 pbrux30cit.hep.brown.edu
PING pbrux30cit.hep.brown.edu (128.148.128.20) 56(84) bytes of data.
--- pbrux30cit.hep.brown.edu ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4096ms

[crungphi@lxplus810 ~]$ nc -zv -4 -w 30 128.148.128.20 1094
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: TIMEOUT.
Hello,
pbrux30cit.hep.brown.edu is no longer out endpoint. our new endpoint is bruxmg.hep.brown.edu and should be configured as such in the siteconf.

Thank you,
Mike
Extra endpoint list updated.
bruxmg.hep.brown.edu / XROOTD is failing SAM tests too.

- Stephan
Hello, Brown admins. The issue is still ongoing. Could you please provide and update about storage endpoint.Thank you,
Noy
CMS #682622 (id:2755) Request for XRootD Upgrade & Network Packet Labeling Configuration at T3_US_Brown
State: in progress  |  Priority: urgent  |  Opened: 2025-03-12 14:34 (388d ago)  |  Updated: 2025-10-28 14:07
Conversation (6 messages)
Dear Brown Site Administrators,

CMS is resuming data taking next
month and extending the usage of tokens. There have been a lot of improvements
made in XRootD and bug fixes for tokens. We would like all CMS sites to upgrade
the XRootD services, i.e. site redirectors and storage services, if native XRootD
is being used.

We would also like to take this
opportunity to encourage sites to enable network packet labeling by adding the
following configuration lines.

xrootd.pmark ffdest
us.scitags.org:10514
xrootd.pmark domain any
xrootd.pmark defsfile curl
https://scitags.docs.cern.ch/api.json
xrootd.pmark map2exp path
/<path-to-store>/store cms



If your site supports multiple VOs on the same XRootD
service or requires additional details, please refer to the SciTag Network
Packet Labeling Twiki page: https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesXrootdScitagPacketLabeling.



Our target date for completing the upgrade and configuration
update is April 5th, in preparation for the LHC commissioning for 2025.

Finally, if I not wrong, your site is running a very old XRootD
version. Upgrading to v5.7.3 will introduce changes to the VOMS library and
configuration currently used by your XRootD v5.4.3. Please see Bockjoo's twiki
page at https://twiki.cern.ch/twiki/bin/view/CMSPublic/OSG36XRootDINFO
for the changes this requires.

Thank you for your cooperation. Please let us know if you
have any questions or concerns.

Kind regards,

Jakrapop Akaranee

CMS Site Support************************************************************************************
This is an automated mail. When replying don't change the subject line!

************************************************************************************
Ticket Link: https://helpdesk.ggus.eu/#ticket/zoom/2755
Any update? -- Thanks, Noy
Hello Brown admins. We suggest you upgrade XRootD endpoint to version 5.8.3 and implement packet labelling configuration. Could you please provide update plan?
Cheers,
Noy
Hello Noy,
Apologies for the silence on this subject. I have a new server built with xrootd 5.7.3. Early next week I am going to make PR's to change the endpoint in some central OSG configs.

Mike
Hello Mike. Thank you for your response. By the way, could you please provide any updates?
Best regards,
Noy
Good Morning,
Our endpoint was upgraded a few months back to use: xrootd-5.7.3-1.5.osg23.el9.x86_64

Thank you,
Mike
CMS #681797 (id:1898) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_Brown
State: in progress  |  Priority: less urgent  |  Opened: 2025-01-29 10:46 (430d ago)  |  Updated: 2025-10-09 13:01
Conversation (5 messages)
GGUS ID: 168934
Last modifier: Jakrapop Akaranee
Date: 2024-11-07 08:58:53
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_Brown
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: USCMS
Issue type: CMS_SAM tests
Description:
Dear Brow Site Administrators,

We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack network, specifically for the following endpoint:

Both WebDAV [1] and XRootD [2] services at pbrux30cit.hep.brown.edu

Could you please review dual stack support on your storage element? Thank you for your assistance. Best Regards,Jakrapop
-----------
[1] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dpbrux30cit.hep.brown.edu%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
[2] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dpbrux30cit.hep.brown.edu%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice
GGUS ID: 168934
Last modifier: Michael Antonelli
Date: 2024-11-07 15:44:11

Status: in progress
Responsible Unit: USCMS
Public Diary:
Hello,

I will work on getting ipv6 support working.

Thank you for the notice,
Mike
GGUS ID: 168934
Last modifier: Jakrapop Akaranee
Date: 2024-11-27 12:52:37

Public Diary:
Dear Michael,

Just following up regarding IPv6 support. Would you be able to provide an update on the current progress?

Cheers,
Jakrapop
Assigning missing CMS site name error during import to new GGUS.
Jakrapop
Hello, Just remind you about this old ticket. Could you please provide an update?
Thank you,
Noy
CMS #682006 (id:2107) SAM mkdir failure in /store/temp/user area at (T3_US_Brown)
State: in progress  |  Priority: less urgent  |  Opened: 2025-02-03 13:00 (425d ago)  |  Updated: 2025-04-02 16:37
Conversation (20 messages)
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2023-01-13 18:23:53
Subject: SAM mkdir failure in /store/temp/user area at (T3_US_Brown)
Ticket Type: USER
CC: ;cms-comp-ops-site-support-team@cern.ch
Status: assigned
Responsible Unit: USCMS
Issue type: File Access
Description:
Dear site admin,
we are switching SAM tests to use /store/temp/user
for write testing instead of /store/unmerged. Any CMS
user, i.e. user presenting a certificate with CMS VOMS
extension should be able to read/write/delete in the
/store/temp/user area. We discovered that this does not
work for the SAM test at your site.
Could you please take a look/adjust the permissions.
Thanks,
cheers, Stephan

https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fcheck_command%3D%26filled_in%3Dfilter%26host_last_check_from%3D%26host_last_check_until%3D%26host_last_state_change_from%3D%26host_last_state_change_until%3D%26host_regex%3D%26is_host_in_notification_period%3D-1%26is_in_downtime%3D-1%26is_service_acknowledged%3D-1%26is_service_active_checks_enabled%3D-1%26is_service_in_notification_period%3D-1%26is_service_is_flapping%3D-1%26is_service_notifications_enabled%3D-1%26is_service_scheduled_downtime_depth%3D-1%26limit%3Dhard%26opthost_group%3D%26optservice_group%3D%26selection%3Db0012f76-9eb6-4674-b095-0cc7de39e1f3%26service_output%3D%26service_regex%3Dorg.cms.SE-WebDAV%26svc_last_check_from%3D%26svc_last_check_until%3D%26svc_last_state_change_from%3D%26svc_last_state_change_until%3D%26view_name%3Dsearchsvc
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2023-01-13 18:44:44

Public Diary:
Sorry, the issue is not the /store/temp/user write but for SAM being able to write into /store/unmerged and not just /store/unmerged/SAM during the transition.
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2023-03-24 14:55:28

Public Diary:
CMS switched write tests to /store/temp/user. - Stephan
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-08 13:09:24

Public Diary:
Hello,

I updated the auth_file at our endpoint to allow cmsprod to access /store/temp/user. Looking at the XRD logs now no longer shows permission denied errors but instead shows:
u04.1377606:298@etf-01.cern.ch Xrootd_Response: sending err 3011: Unable to open /store/test/xrootd/T2_DE_DESY/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9
_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root; no such file or directory

thank you,
Mike

Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-08 13:41:39

Public Diary:
Sorry I meant to post this error message instead:
cmsprod.4561:301@etf-28.cern.ch Unable to locate /store/temp/user/cmssam/se_webdav_20230608-0746_etf-28_2710624_cert.txt; no such file or directory

thank you,
Mike

Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-08 14:22:30

Public Diary:
Sorry I meant to post this error message instead:
cmsprod.4561:301@etf-28.cern.ch Unable to locate /store/temp/user/cmssam/se_webdav_20230608-0746_etf-28_2710624_cert.txt; no such file or directory

thank you,
Mike

Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2023-06-09 06:51:54

Public Diary:
Thanks for investigating updating things Mike!
So, on SE-WebDAV-7crt-write the "05:34:02 [I] Davix: < HTTP/1.1 403 Forbidden" in
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-7crt-write&var-dst_hostname=pbrux30cit.hep.brown.edu&var-timestamp=1686288843000
is the relevant error message. (The "no such file or directory" is a follow on
error.) So, the authorization for SAM to write to /store/temp/user/ is not yet
woking.
For the SE-XRootD-9federation the files at pbrux30cit.hep.brown.edu:1094 are not
reachable via the federation. Your xrootd service should subscribe to the transitional
federation. Take a look at "Joining Federation (production or transitional)" on
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CMSXrootDArchitecture
Thanks,
cheers, Stephan
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-09 13:56:36

Public Diary:
Thank you for pointing this out stephan, I will work over this documentation.

Mike
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-20 15:55:47

Public Diary:
Hello Stephan,

I am trying to get subscribed to the transitional federation using the documentation you sent. Since we have just one server so I followed the doc for no local redirector.

I added these config lines to xrootd-standalone.cfg (I tried xrootd-clustered.cfg as well, same results):
all.role server
all.manager cms-xrd-transit.cern.ch+ 1213


Fallback access is setup correctly following this document: https://twiki.cern.ch/twiki/bin/view/CMSPublic/XRootDFallback

Xrootd service was restarted and I waited about 60 mins to make sure the change had time to propagate. Then I initialized a voms proxy (on lxplus) with the cms voms and used this command to test:
xrdfs cms-xrd-transit.cern.ch locate -d -m /store/test/xrootd/T3_US_Brown/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root

I am still getting "[ERROR] Server responded with an error: [3011] No servers have the file"... Any thoughts?

thank you,
Mike
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2023-06-20 16:49:39
Changed CC to cms-comp-ops-site-support-team@cern.ch;

Public Diary:
Hallo Michael,
if i query the transitional federation, i.e. do a
/usr/bin/xrdmapc --list all cms-xrd-transit.cern.ch:1094
i see brux11.hep.brown.edu:1094 subscribed but not
pbrux30cit.hep.brown.edu:1094
You may want to shutdown the xrootd service on the old
machine and remove the startup files.
For the new service, take a look at the xrootd and cmsd
logs, there should be a subscription failure logged there.
Thanks,
cheers, Stephan
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-20 17:56:16

Public Diary:
Thanks for the response! My apologies, it turns out cmsd was not configured and once I got it up and running the subscription appeared and the test succeeded.

Mike
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-20 18:14:08

Public Diary:
Thanks for the response! My apologies, it turns out cmsd was not configured and once I got it up and running the subscription appeared and the test succeeded.

Mike
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2023-06-20 21:30:13

Public Diary:
Very Good, thanks Mike. The SE-XRootD-9federation test, however, is still failing:
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-9federation&var-dst_hostname=pbrux30cit.hep.brown.edu&var-timestamp=1687293482000
and also for me an xrdfs root://cms-xrd-global.cern.ch/ ls -l /store/test/xrootd/T3_US_Brown/... or xrdcp -f root://cms-xrd-global.cern.ch//store/test/xrootd/T3_US_Brown/... don't work. Can you please check your VOMS access config?

The ticket is originally for the SE-WebDAV-7crt-write failures, i.e. SAM writing to davs://pbrux30cit.hep.brown.edu:1094/store/temp/user/cmssam . This is still failing with "HTTP 403: Forbidden". A recent log file is at
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-7crt-write&var-dst_hostname=pbrux30cit.hep.brown.edu&var-timestamp=1687291849000
Could you please take a look at this too. Thanks,
- Stephan
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-23 17:26:39

Public Diary:
Hello Stephan,

I don't have the VOMS plugin installed or configured, we are still using OSG 3.5 and lcmaps. The grid-mapfile is currently handling all the mapping. Unless you mean something different when you say VOMS access config?

Mike
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2023-06-23 17:49:56

Public Diary:
You would need to pull a gridmap file with all users from the VOMS admin server or construct one yourself from CRIC to provide all CMS users with read access. Thus using VOMS being the recommended approach. For WebDAV you can probably maintain Rucio, production, SAM and local user DNs manually. For XRootD you should not subscribe to the federation if access is limited to local users.
Thanks,
- Stephan
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Michael Antonelli
Date: 2023-06-26 15:19:12

Public Diary:
Hello again,

Hope you enjoyed your weekend. Is there any way you could elaborate more on what your last message? For instance what you mean by "construct one yourself from CRIC" and "Rucio, production, SAM and local user DNs manually". Is there a document somewhere I could collect all of these DN's to put in out grid-mapfile to potentially address this issue? Am I understanding that you are trying to suggest I would just be better off upgrading and using the VOMS plugin?

Thanks for you time and your help,
Mike
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2023-06-26 16:01:54

Public Diary:
Hallo Mike,
yes, installing/configuring the VOMS plugin module would be much simpler than maintaining your own gridmap file. (We don't have an example script but some sites do this, i.e. it's possible. THE CMS CRIC instance knows all CMS users and their DN, i.e. its API can be used to generate a gridmap file containing all valid CMS users.) For IAM-issued token access you really want to be at v5.5.4 or higher. Adding the RPM/config at that time would be the simplest approach. Bockjoo has the configuration of the new VOMS module documents at "XRootD Installation/Configuration with OSG 3.6" on https://twiki.cern.ch/twiki/bin/view/CMSPublic/CMSXrootDArchitecture
Thanks,
cheers, Stephan
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
GGUS ID: 160088
Last modifier: Stephan Lammel
Date: 2024-10-28 13:05:35

Status: in progress
Responsible Unit: USCMS
Public Diary:
writing to /store/temp/user still fails. The ticket should remain open.
Thanks,
- Stephan
Internal Diary:
Involved CMS AAA - WAN Access in this ticket.
Any update? -- Thanks Noy
Assign ticket to Brown site.
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM100%100%95%97%100%97%84%41%100%97%95%100%100%97%85%96%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM— no data —
HammerCloud— no data —
FTS— no data —

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

CMS tickets (2)
CMS #681810 (id:1911) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_NotreDame
State: on hold  |  Priority: less urgent  |  Opened: 2025-01-29 10:46 (430d ago)  |  Updated: 2025-10-09 13:05
Conversation (7 messages)
GGUS ID: 168903
Last modifier: Jakrapop Akaranee
Date: 2024-11-05 14:19:04
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_NotreDame
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: USCMS
Issue type: CMS_SAM tests
Description:
Dear NotreDame Site Administrators,

We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack, specifically for the following endpoint:

hactar01.crc.nd.edu (XrootD [1] and WebDAV [2])

Could you please review dual stack support on your storage element?

Thank you for your assistance.
Best Regards,Jakrapop
-----------
[1]https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhactar01.crc.nd.edu%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice
[2]https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhactar01.crc.nd.edu%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
GGUS ID: 168903
Last modifier: Irena Johnson
Date: 2024-11-07 21:59:42

Public Diary:
Hello Jakrapop,

This summer we upgraded our infrastructure from RHEL7 to RHEL9, as well as other software upgrades. Could you please share documentation so I can check double check our configuration?

The xrootd packages we have installed on our SE are listed below:

gfal2-plugin-xrootd-2.23.1-1.el9.x86_64
osg-xrootd-23-6.osg23.el9.noarch
osg-xrootd-standalone-23-6.osg23.el9.noarch
xrootd-5.7.1-1.3.osg23.el9.x86_64
xrootd-client-5.7.1-1.3.osg23.el9.x86_64
xrootd-client-devel-5.7.1-1.3.osg23.el9.x86_64
xrootd-client-libs-5.7.1-1.3.osg23.el9.x86_64
xrootd-cmstfc-1.5.2-8.osg23.el9.x86_64
xrootd-debuginfo-5.7.1-1.3.osg23.el9.x86_64
xrootd-devel-5.7.1-1.3.osg23.el9.x86_64
xrootd-fuse-5.7.1-1.3.osg23.el9.x86_64
xrootd-libs-5.7.1-1.3.osg23.el9.x86_64
xrootd-multiuser-2.2.0-1.1.osg23.el9.x86_64
xrootd-private-devel-5.7.1-1.3.osg23.el9.x86_64
xrootd-scitokens-5.7.1-1.3.osg23.el9.x86_64
xrootd-selinux-5.7.1-1.3.osg23.el9.noarch
xrootd-server-5.7.1-1.3.osg23.el9.x86_64
xrootd-server-devel-5.7.1-1.3.osg23.el9.x86_64
xrootd-server-libs-5.7.1-1.3.osg23.el9.x86_64
xrootd-voms-5.7.1-1.3.osg23.el9.x86_64
xrootd-voms-debuginfo-5.7.1-1.3.osg23.el9.x86_64

Thanks,
Irena
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 168903
Last modifier: Irena Johnson
Date: 2024-11-11 16:20:54

Public Diary:
Hello Jakrapop,

This summer we upgraded our infrastructure from RHEL7 to RHEL9, as well as other software upgrades. Could you please share documentation so I can check double check our configuration?

The xrootd packages we have installed on our SE are listed below:

gfal2-plugin-xrootd-2.23.1-1.el9.x86_64
osg-xrootd-23-6.osg23.el9.noarch
osg-xrootd-standalone-23-6.osg23.el9.noarch
xrootd-5.7.1-1.3.osg23.el9.x86_64
xrootd-client-5.7.1-1.3.osg23.el9.x86_64
xrootd-client-devel-5.7.1-1.3.osg23.el9.x86_64
xrootd-client-libs-5.7.1-1.3.osg23.el9.x86_64
xrootd-cmstfc-1.5.2-8.osg23.el9.x86_64
xrootd-debuginfo-5.7.1-1.3.osg23.el9.x86_64
xrootd-devel-5.7.1-1.3.osg23.el9.x86_64
xrootd-fuse-5.7.1-1.3.osg23.el9.x86_64
xrootd-libs-5.7.1-1.3.osg23.el9.x86_64
xrootd-multiuser-2.2.0-1.1.osg23.el9.x86_64
xrootd-private-devel-5.7.1-1.3.osg23.el9.x86_64
xrootd-scitokens-5.7.1-1.3.osg23.el9.x86_64
xrootd-selinux-5.7.1-1.3.osg23.el9.noarch
xrootd-server-5.7.1-1.3.osg23.el9.x86_64
xrootd-server-devel-5.7.1-1.3.osg23.el9.x86_64
xrootd-server-libs-5.7.1-1.3.osg23.el9.x86_64
xrootd-voms-5.7.1-1.3.osg23.el9.x86_64
xrootd-voms-debuginfo-5.7.1-1.3.osg23.el9.x86_64

Thanks,
Irena
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 168903
Last modifier: Stephan Lammel
Date: 2024-11-11 16:59:24

Public Diary:
Hallo Irena,
you need to get an IPv6 address from your networking
group. (It's not an RPM to install, etc.)
Thanks,
cheers, Stephan

https://access.redhat.com/solutions/347693

Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 168903
Last modifier: Irena Johnson
Date: 2024-11-18 17:22:50

Public Diary:
Dear Stephan,

Unfortunately, it is not possible. Notre Dame does not have IPv6 addresses and also there is no routing at the campus networking level for IPv6 networking.


Thanks,
Irena
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
GGUS ID: 168903
Last modifier: Stephan Lammel
Date: 2024-11-18 18:45:46

Status: on hold
Responsible Unit: USCMS
Public Diary:
Thanks Irena!
I am surprised your networking team has no plans to transition
to IPv6. I'll place the ticket on hold for the time being.
Thanks,
- Stephan
Internal Diary:
Sent notification on this ticket still waiting for user input to GGUS ticket monitoring team.
Assigning missing CMS site name error during import to new GGUS.
Jakrapop
CMS #681846 (id:1947) Update HTCONDOR config for new issuer token support at T3_US_NotreDame
State: assigned  |  Priority: very urgent  |  Opened: 2025-01-29 10:48 (430d ago)  |  Updated: 2025-07-21 10:47
Conversation (5 messages)
GGUS ID: 169422
Last modifier: Chan-anun Rungphitakchai
Date: 2024-12-16 21:08:36
Subject: Update HTCONDOR config for new issuer token support at T3_US_NotreDame
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: USCMS
Issue type: Other
Description:
Good afternoon, NotreDame admin
There is new k8s issuer server (https://cms-auth.cern.ch). Your HTCONDOR endpoint do not support token authentication with a new issuer [1]. Could you check your OSG version? If your site still use OSG 35, please upgrade to new OSG version and update osg-scitokens-mapfile.rpm to version 13-2 or add new configurations for cms-auth.cern.ch on osg-scitokens-mapfile.conf.
I attach a document and examples [2]
Best Regards,
Noy
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Ddeepthought.crc.nd.edu%26site%3Detf%26view_name%3Dhost
[2]
https://osg-htc.org/docs/compute-element/install-htcondor-ce/
# CMS IAM development instance:
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
# SAM/ETF tests (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
# CMS ITB pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
# CMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
# CMS ITB LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
# USCMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
# USCMS HEPCloud pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms
Good afternoon, NotreDame admin. The configuration from last comment is incorrect [URL should be on lowercase [cms not CMS]]. OSG team already update OSG RPM. Could you please consider upgrading to OSG23/24 or manually add new config to mapfile.conf according to OSG site [1].
Thank you,
Noy
[1]
topology.opensciencegrid.org/collaborations/osg-scitokens-mapfile.conf
correct ticket assignment which seems to have gotten messed up during the import. - Stephan
Can you please add support for the K8s IAM instance today/tomorrow. ETF production will switch on Monday. Your site would then fail SAM tests.
Thanks,
- Stephan
Hello NotreDame admins Could you please confirm you already updated your config to support K8s IAM instance?
Thank you,
Noy
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM87%65%78%87%91%94%76%95%76%79%97%86%97%87%88%97%
HammerCloud????????????????
FTS— no data —

Open GGUS tickets (1)

CMS tickets (1)
CMS #682202 (id:2329) Update HTCONDOR config for new issuer token support at T3_US_OSG
State: in progress  |  Priority: very urgent  |  Opened: 2025-02-13 21:33 (414d ago)  |  Updated: 2025-05-01 14:17
Conversation (31 messages)
Good afternoon, OSG admin
There is new k8s issuer server (https://cms-auth.cern.ch). Your HTCONDOR endpoint do not support token authentication with a new issuer [1]. Could you update to latest OSG23/24 or you can add new configurations for cms-auth.cern.ch on osg-scitokens-mapfile.conf on this week. ETF production will switch on Monday.
I attach a document and examples [2]
Best Regards,
Noy
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dgate02.aglt2.org%26site%3Detf%26view_name%3Dhost
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dits-condor-ce2.syr.edu%26site%3Detf%26view_name%3Dhost
[2]
https://osg-htc.org/docs/compute-element/install-htcondor-ce/
topology.opensciencegrid.org/collaborations/osg-scitokens-mapfile.conf
HiI can't open the [1] link.
For AGLT2, this is what I see from the token mapfile, it seems to be up-to-date?

[root@gate02 mapfiles.d]# more /usr/share/condor-ce/mapfiles.d/osg-scitokens-mapfile.conf |grep -i cms|grep -v "#"
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms

Cheers!

Wenjing (AGLT2)
Hallo Wenjing,
thanks for taking a look. The last seven lines have a capital "CMS" which should be lower case "cms".
Could you please adjust this?
Thanks,
cheers, Stephan
Good morning, Wenjing.

ETF production already switched to new K8s instance this morning. Your
storage endpoint has been failed since 10:00UTC today. Could you please take a
look?

Thank you,

Noy
Hi, the capital CMS problem seems to be fixed, and this is the current content from the token map file:
[root@gate02 ~]# more /usr/share/condor-ce/mapfiles.d/osg-scitokens-mapfile.conf |grep -i cms
## CMS ##
# CMS pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
# CMS ITB pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
# CMS LOCAL pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
# CMS ITB LOCAL pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
# USCMS LOCAL pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
# USCMS HEPCloud pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms
# CMS IAM development instance:
SCITOKENS /^https\:\/\/cms\-auth\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
SCITOKENS /^https\:\/\/cms\-auth\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
# CMS ITB pilots (development):
SCITOKENS /^https\:\/\/cms\-auth\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
# CMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/cms\-auth\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
# CMS ITB LOCAL pilots (development):
SCITOKENS /^https\:\/\/cms\-auth\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
# USCMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/cms\-auth\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
# USCMS HEPCloud pilots (development):
SCITOKENS /^https\:\/\/cms\-auth\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms

Not sure why it is still failing?

-Wenjing
Hallo Wenjing,
gate02.aglt2.orgis fine. Thanks you!

We still have an issue with its-condor-ce[234].syr.edu
The ticket need to be assigned to site "SU-ITS" which io don't find in the site selector. Assigning to helpdesk instead.

Thanks,
- Stephan
reassigning to GGUS Helpdesk. The ticket needs to be assigned to grid site "SU-ITS" not OSG software support.

- Stephan
Hi Stephan,

Thank you for pointing to it. Indeed, this site "SU-ITS" is missing for some reason. We need to check our site database, which is managed by Günter. So I'm afraid, we can't fix it now as it's already late here, but I will write Günter immidiately at least to get it done tomorrow. I hope it should be still ok.
No problem, take your time Pavel.
(I was checking on tickets status and saw it re-assigned to the odd OSG group with no reply.)
Thanks,

- Stephan
Hi Stephan,

Günter reported that the site is missing in the OSG database dumps, so we should find out why it's missing there. Do you know whom we can contact about it?
Thanks Pavel!
So, Brian Lin from OSG is probably the person who can shed some light on why the CEs of Syracuse University are assigned to OSG resource group SU-ITS but the resource group missing in the OSG database.
I assume/hope Brian is on the "OSG Software Support" unit and re-assign the ticket to this support unit.
Please correct if this is incorrect.
Thanks,
- Stephan
Looks like RTU (Riga Technical University) another OSG site is also missing in GGUS.
- Stephan
Hallo Pavel,
we were going to make a ticket for T2_LV_HPCNET / RTU which still
doesn't have a GGUS entry. So i was going through the tickets and before
contacting OSG checked the topology data at
https://my.opensciencegrid.org/rgsummary/xml

I find an entry for both SU-ITS and RTU. Which URL are you/GGUS using
for topology information? Can you/Guenter please provide details on
what URL you are using, etc.?
Thanks,
cheers, Stephan
Hello Stephan,
Just checked with Guenter. We are using the following URL: https://my.opensciencegrid.org/rgsummary/xml?gridtype=on&gridtype_1=on&voown=on&voown_3=on&active=on&active_value=1

Changing the URL on our side could be tricky. Would it be possible to reconfigure the sites you mentioned so that they show up in our search results?

Best regards,
Aliaksei
Thanks Aliaksei!

Looks like it's the "&voown=on&voown_3=on" query that makes the sites disappear. I'll check with OSG.

THnak,s
- Stephan
Ok, checked with OSG.
I have a question for Guenter Grein now: Do you make an OSG site query for each VO, is that correct? (I naively assumed GGUS makes one query and not one per VO.)
Thanks,
- Stephan
So, the OSG team added the VO tags for both SU-ITS and RTU yesterday.
Still neither of the sites shows up in the CMS site selection in GGUS.
Something is still worng or not automatic!
- Stephan
@Stefan Lammel:

Hi Stefan,
Sorry for my late reply.
We make an OSG site query for each VO.
Günter
Just to mention that RTU and SU-ITS sites are now available if the group NGIs>USCMS is selected. They appeared automatically. There could be some delay wrt OSG update though, since the sync is performed daily on our side.
Ok, Thanks Guenter!
Do you really need use the VO filter? The VO-feed has a list of used sites that
will be more correct than the VO flags of the grid consortia that have little use.
My recommendation would be to remove the VO filter and consider the sites

listed in each experiment's VO-feed instead. It makes things simpler and take
the information from the most relevant place.

Thanks for checking Aliaksei! Odd, i thought i checked 24 hours after.

Thanks,
- Stephan
Hi Stephan,
The main reason is that we don't have all required email addresses in the VO feeds. We need the site contacts and for T1 the site alarm mail addresses.
For ATLAS we have at least a site contact mail address in the VO feed, but for CMS we don't have any mail addresses in the VO feed. In case there is a way to provide the required mail addresses via VO feed, we can think about getting rid of the OIM thing.

Regards,
Günter
Hallo Guenter,
thanks for your reply. I was not suggesting to avoid the muOSG/OIM query
but to do it without the VO filter. That way we don't rely on proper site-VO
assocciation on OSG side but take it from the more acurate experiment side.
(We can add an email address for each site in the VO-feed if this is desirable.
We do this for non-grid sites already. We should probably query sites and get
their input for such a change.)
Thanks,
- Stephan
Hallo Stephan,
I see. Nevertheless it would make things easier for us in the long term view if we can get rid of one data source.
For now we have 3 sources which we need to match somehow: GOC DB, MYOSG/OIM and VO feeds. Hence reducing this to 2 sources will require some adaption of the scripts but reduce complexity in general. Ideally we would be able to get the required information from 1 data source, but I admit the world is not ideal. Replacing "MyOSG/OIM plus VO feed" by "VO feed" would be an improvement.
As already said, we need both types of mail addresses. Site contact mail for every site and additionally alarm contact for T1 sites

Best regards,
Guenter
Ok, works for CMS to have the contacts in the VO-feed. So, we should
bring that to WLCG and see if all experiments can provide this in the
VO-feed and if sites agree. Maarten Litmaath ?

I would still suggest to remove the "&voown=on&voown_3=on" in the
meantime.
Thanks,
cheers, Stephan
Hi all,for the contacts to be in the VO feeds, CRIC would have to import them
from the respective sources. Furthermore, the VO feeds would at least
have to require an X509 certificate for access, because we cannot have
e-mail addresses made available to the world. And we would need to
ensure other VO feed consumers still work with those extra attributes
as well as the certificate requirement.

All that would need to be discussed at least with the CRIC devs.
Hi all,
I didn't expect so much impact when suggesting to use only VO feeds. I will try removing the filter
"&voown=on&voown_3=on" and check the outcome as suggested by Stephan. If this works we can keep everything as is for now.
Hi again,
I just checked without filter
"&voown=on&voown_3=on" and I received a complete list of all sites regardless of the VO.
Hence I would have to filter out the relevant sites for CMS by myself.

I prefer using this filter in the URL instead as I don't see any advantage in not using this filter in the URL besides more work for me in my scripts.
Thanks for looking into this Guenter.
The VO-feed has the list of sites. Isn't it not just picking up the contacts from
the OIM/myOSG query for the sites in the VO-feed? (Yes, there will be extra
site entries in the query that aren't used by CMS but that might be true already
now.)

- Stephan
Dear All,

Today we have had, I think, the most significant incident of GGUS since we
switched to Zammad. I don't know if you noticed it, but the system was
not available for a few minutes, which could sound like a minor break at first glance.

But we spent multiple hours to understand what has happen and to find some workarounds how to fix current errors in the system. I don't
know how it relates to the discussion in this ticket, but I think it is that is why I'm posting it here.

So around 14:00 UTC time we noticed that system is not
responding. After restart and reboot we found an error in
synchronization procedure between staging DB and Zammad.

Moreover we found that today around 7:00 UTC around 220! records have
been added to staging DB and at 14:00 Zammad tried to sync all that
which caused crash and multiple errors. Also we found duplication of
some entries in staging DB and other peculiarities, like same names of
the sites with different IDs etc. Most of the sites were added under
USCMS, here is a short snapshot for only a small portion out of 200:

..........

NGIs › USCMS › GP-ARGO-astate














NGIs › USCMS › GP-ARGO-cameron














NGIs › USCMS › GP-ARGO-creighton














NGIs › USCMS › GP-ARGO-doane














NGIs › USCMS › GP-ARGO-dsu














NGIs › USCMS › GP-ARGO-emporia














NGIs › USCMS › GP-ARGO-ksu














NGIs › USCMS › GP-ARGO-ku














NGIs › USCMS › GP-ARGO-langston














NGIs › USCMS › GP-ARGO-mst














NGIs › USCMS › GP-ARGO-oru














NGIs › USCMS › GP-ARGO-osu














NGIs › USCMS › GP-ARGO-sdsmt














NGIs › USCMS › GP-ARGO-sdsu














NGIs › USCMS › GP-ARGO-semo














NGIs › USCMS › GP-ARGO-uams














NGIs › USCMS › GP-ARGO-uark














NGIs › USCMS › GP-ARGO-usd














NGIs › USCMS › GP-ARGO-wichita














NGIs › USCMS › GPGRID














NGIs › USCMS › GPN-GP-ARGO














NGIs › USCMS › GRID_ce2














NGIs › USCMS › GridUNESP














NGIs › USCMS › GSU-ACIDS












Hallo Pavel,
the CMS site list did not change, as far as i can tell. The VO-feed seems not
to be increased. I doubt OSG added the CMS VO flag to those site yesterdaay.
Could this be a side effect of Guenter testing the myOSG/OIM query without
VO filter?
Thanks,
cheers, Stephan
Hi Stephan,

thank you. Good to know that CMS didn't intend to increase the number of sites. Ok, if that is some technical issue, that is no problem, we will check it with Guenter and fix asap. Thank you for confirming that this change was unintended.

Cheers,

Pavel
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud????????????????
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (2)

CMS tickets (2)
CMS #1001154 (id:1001154) Unreachable to storage endpoint at T3_US_PuertoRico
State: in progress  |  Priority: less urgent  |  Opened: 2025-11-17 13:58 (138d ago)  |  Updated: 2026-01-12 20:56
Conversation (4 messages)
Hello PuertoRico admins.
Since 12:00UTC Tuesday (Nov 11). Your storage endpoint has been failed SAM "1connection" test [1]. The log files show "Connection attempt to 136.145.77.39 timed out" message [2]. Tracepath and Ping test also show the same result [3]. Could you please take a look and check this server's connection/status?
Cheers,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_US_PuertoRico
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1763384106000
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1763384145000
[3]
[crungphi@lxplus814 ~]$ ping -c 5 cms-se.hep.uprm.edu
PING cms-se.hep.uprm.edu (136.145.77.39) 56(84) bytes of data.
--- cms-se.hep.uprm.edu ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4134ms

[crungphi@lxplus814 ~]$ tracepath cms-se.hep.uprm.edu
1?: [LOCALHOST] pmtu 1500
1: k513-v-rjuxl-v12-ip851.cern.ch 18.789ms
1: k513-v-rjuxl-v11-ip851.cern.ch 31.207ms
2: k513-b-rjupl-v3-cc41.cern.ch 10.759ms
3: b773-b-rjuxl-2-cd30.cern.ch 12.943ms
4: g773-e-rjuxm-20-sg4.cern.ch 0.910ms
5: g773-e-fpa78-2-fi2.cern.ch 8.200ms
6: e773-e-rjuxm-v20-fe2.cern.ch 9.075ms
7: e773-e-rjup1-2-se4.cern.ch 14.184ms
8: cern2.mx1.gen.ch.geant.net 18.818ms asymm 9
9: no reply
10: no reply
11: bundle-ether1.102.core1.bost2.net.internet2.edu 88.459ms asymm 12
12: fourhundredge-0-0-0-2.4079.core1.hart2.net.internet2.edu 123.942ms asymm 22
13: fourhundredge-0-0-0-0.4079.core1.newy32aoa.net.internet2.edu 117.929ms asymm 24
14: fourhundredge-0-0-0-2.4079.core1.ashb.net.internet2.edu 136.757ms asymm 20
15: fourhundredge-0-0-0-18.4079.core2.ashb.net.internet2.edu 113.913ms asymm 20
16: fourhundredge-0-0-0-1.4079.core2.atla.net.internet2.edu 123.962ms asymm 21
17: fourhundredge-0-0-0-22.4079.core1.atla.net.internet2.edu 115.925ms
18: fourhundredge-0-0-0-6.4079.core1.jack.net.internet2.edu 116.866ms asymm 20
19: 198.71.45.187 117.761ms asymm 17
20: et-0-0-1-67.rt04.bb.ampath.net 117.401ms asymm 19
21: upr-819.ce.ampath.net 152.366ms asymm 19
22: rum-hub787-a.upr.edu 156.171ms asymm 23
23: no reply
24: no reply
25: no reply
26: no reply
27: no reply
28: no reply
29: no reply
30: no reply
Too many hops: pmtu 1500
Resume: pmtu 1500
Hi Noy,
storage endpoint at T3_US_PuertoRico is online again. However, sounds that our institution erased the reverse IP. And probably put the server behind a firewall. So ping will fail.
Right now from home ping does not work.
Also checked logs, and power fluctuation took the host server off.
At this time, the storage is read only, since some nodes haven't been recovered yet.
Regards,
Eduardo

From: helpdesk@ggus.org <helpdesk@ggus.org>
Sent: Monday, November 17, 2025 9:58 AM
To: Juan Eduardo Ramirez Vargas <juaneduardo.ramirez@upr.edu>
Subject: [GGUS-Ticket-ID: #1001154] "ASSIGNED" "USCMS" "Unreachable to storage endpoint at T3_US_PuertoRico" (New)

GGUS Helpdesk Notification
Ticket #1001154 "Unreachable to storage endpoint at T3_US_PuertoRico"
was created by Chan-Anun Rungphitakchai on 2025-11-17 13:58 (UTC).

Hello PuertoRico admins.
Since 12:00UTC Tuesday (Nov 11). Your storage endpoint has been failed SAM "1connection" test [1]. The log files show "Connection attempt to 136.145.77.39 timed out" message [2]. Tracepath and Ping test also show the same result [3]. Could you please take a look and check this server's connection/status?

Cheers,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_US_PuertoRico
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1763384106000
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1763384145000
[3]

[crungphi@lxplus814 ~]$ ping -c 5 cms-se.hep.uprm.edu
PING cms-se.hep.uprm.edu (136.145.77.39) 56(84) bytes of data.
--- cms-se.hep.uprm.edu ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4134ms

[crungphi@lxplus814 ~]$ tracepath cms-se.hep.uprm.edu
1?: [LOCALHOST] pmtu 1500
1: k513-v-rjuxl-v12-ip851.cern.ch 18.789ms
1: k513-v-rjuxl-v11-ip851.cern.ch 31.207ms
2: k513-b-rjupl-v3-cc41.cern.ch 10.759ms
3: b773-b-rjuxl-2-cd30.cern.ch 12.943ms
4: g773-e-rjuxm-20-sg4.cern.ch 0.910ms
5: g773-e-fpa78-2-fi2.cern.ch 8.200ms
6: e773-e-rjuxm-v20-fe2.cern.ch 9.075ms
7: e773-e-rjup1-2-se4.cern.ch 14.184ms
8: cern2.mx1.gen.ch.geant.net 18.818ms asymm 9
9: no reply
10: no reply
11: bundle-ether1.102.core1.bost2.net.internet2.edu 88.459ms asymm 12
12: fourhundredge-0-0-0-2.4079.core1.hart2.net.internet2.edu 123.942ms asymm 22
13: fourhundredge-0-0-0-0.4079.core1.newy32aoa.net.internet2.edu 117.929ms asymm 24
14: fourhundredge-0-0-0-2.4079.core1.ashb.net.internet2.edu 136.757ms asymm 20
15: fourhundredge-0-0-0-18.4079.core2.ashb.net.internet2.edu 113.913ms asymm 20
16: fourhundredge-0-0-0-1.4079.core2.atla.net.internet2.edu 123.962ms asymm 21
17: fourhundredge-0-0-0-22.4079.core1.atla.net.internet2.edu 115.925ms
18: fourhundredge-0-0-0-6.4079.core1.jack.net.internet2.edu 116.866ms asymm 20
19: 198.71.45.187 117.761ms asymm 17
20: et-0-0-1-67.rt04.bb.ampath.net 117.401ms asymm 19
21: upr-819.ce.ampath.net 152.366ms asymm 19
22: rum-hub787-a.upr.edu 156.171ms asymm 23
23: no reply
24: no reply
25: no reply
26: no reply
27: no reply
28: no reply
29: no reply
30: no reply
Too many hops: pmtu 1500
Resume: pmtu 1500

Ticket is assigned to NGIs › USCMS

Affected VO is cms.

https://helpdesk.ggus.eu/#ticket/zoom/1001154

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG
Any update? -- Thank you, Noy
Hi Noy,
I did OS upgrade on the server. Working in configuration.
Regards
Eduardo
CMS #683067 (id:3201) SAM test failure at T3_US_PuertoRico
State: in progress  |  Priority: less urgent  |  Opened: 2025-04-16 22:04 (352d ago)  |  Updated: 2025-08-13 11:27
Conversation (4 messages)
Good afternoon, Puerto Rico admin.
Since 12:00UTC Saturday (Apr 12). Your storage endpoint has been failed SAM "1connection" test [1]. Log files show "Endpoint does not have a reachable IPv4 address" and "no IP addresses found for host cms-se.hep.uprm.edu" [2]. Could you please take a look at your server and check status/connectivity.
Thank you,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_US_PuertoRico
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1744830875000
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1744831985000
Hi,
Yes,
last week since apr 12, the building was without electricity. Scheduled maintenance extended unexpectedly until midweek.
Later came the power outage island wide. I was recovering the cluster slowly.
Now cluster is again up.
However, as a reminder server certificate is not igft compatible but letsencrypt, so SAM test always will fail.
Thanks
Eduardo

From: helpdesk@ggus.org <helpdesk@ggus.org>
Sent: Wednesday, April 16, 2025 6:04 PM
To: Juan Eduardo Ramirez Vargas <juaneduardo.ramirez@upr.edu>
Subject: [GGUS-Ticket-ID: #683067] "ASSIGNED" "USCMS" "SAM test failure at T3_US_PuertoRico" (New)

GGUS Helpdesk Notification
Hi juaneduardo.ramirez@upr.edu,

The new ticket (SAM test failure at T3_US_PuertoRico) has been created by "Chan-Anun Rungphitakchai".

Group: NGIs › USCMS
Owner: -
State: assigned

Information:
Good afternoon, Puerto Rico admin.
Since 12:00UTC Saturday (Apr 12). Your storage endpoint has been failed SAM "1connection" test [1]. Log files show "Endpoint does not have a reachable IPv4 address" and "no IP addresses found for host cms-se.hep.uprm.edu" [2]. Could you please take a look at your server and check status/connectivity.
Thank you,
Noy
[1]https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_US_PuertoRico
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1744830875000
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1744831985000

View this in GGUS Helpdesk

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG
Hello Eduardo, The issue occurred again last Wednesday (Aug 6). There is "Endpoint does not have a reachable IPv4 nor IPv6 address" on log file [1]. Does your site has power outage issue again?
Cheers,
Noy
[1]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-WebDAV-1connection&var-dst_hostname=cms-se.hep.uprm.edu&var-timestamp=1755067825000
Hi Noy,
There was a power outage, during the weekend, I was on vacation.
Now server is up. But soon I will need to upgrade it.
Regards,
Eduardo

From: helpdesk@ggus.org <helpdesk@ggus.org>
Sent: Wednesday, August 13, 2025 5:46 AM
To: Juan Eduardo Ramirez Vargas <juaneduardo.ramirez@upr.edu>
Subject: [GGUS-Ticket-ID: #683067] "IN PROGRESS" "USCMS" "SAM test failure at T3_US_PuertoRico" (Updated)

GGUS Helpdesk Notification
Ticket #3201 "SAM test failure at T3_US_PuertoRico"
was updated by Chan-Anun Rungphitakchai on 2025-08-13 09:46 (UTC).

<empty>

Ticket is assigned to NGIs › USCMS

Updates:

State: assigned → in progress

https://helpdesk.ggus.eu/#ticket/zoom/3201

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%0%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM0%0%0%0%0%22%25%0%????0%0%0%?
HammerCloud0%0%0%
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

Open GGUS tickets (3)

CMS tickets (3)
CMS #1001958 (id:1001958) Intermittent XRootD SAM test failures at T3_US_UMD
State: in progress  |  Priority: less urgent  |  Opened: 2026-03-03 13:16 (32d ago)  |  Updated: 2026-04-02 13:21
Conversation (15 messages)
Hello, Maryland admin
Your XRootD endpoint has been failed SAM "9federation" test since 16:00 Thursday (Mar 26). There is "block 0 checksum mismatch" message on log file when tester try to get file from your site via global redirector [2]. Could you please take a look and check with your XRootD subscription service?
Thank you,
Noy
[1]
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_US_UMD
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-9federation&var-dst_hostname=hepcms-se2.umd.edu&var-timestamp=1772542567000
Hi,

It is true that I have a new xrootd server running/under testing on hepcms-ce.umd.edu (RHEL9) but I have not changed the config file. So I do not understand why CMS is testing that machine before it is declared to be done. Below is the grafana output of the link above.

Anwar

Starting CMS XRootD federation test of hepcms-se2.umd.edu:1094 on 2026-Mar-03 12:55:32
Checking global re-director:
connected to data server u05@cms-xrd-global01.cern.ch:2094
global re-director serving
Checking data accesse via federation:
root://u05@cms-xrd-global.cern.ch//store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root
connected to data server u05@hepcms-ce.umd.edu:1094
[E] block 0 checksum mismatch, test file "CE860B10-5D76-E711-BCA8-FA163EAA761A.root", adler32 is "1" should be "b7994050"



On Tue, Mar 3, 2026 at 8:17 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was created by Chan-Anun Rungphitakchai on 2026-03-03 13:16 (UTC).

Hello, Maryland admin
Your XRootD endpoint has been failed SAM "9federation" test since 16:00 Thursday (Mar 26). There is "block 0 checksum mismatch" message on log file when tester try to get file from your site via global redirector [2]. Could you please take a look and check with your XRootD subscription service?
Thank you,
Noy
[1]
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T3_US_UMD
[2]
https://monit-grafana.cern.ch/d/siYq3DxZz/wlcg-sitemon-test-details?orgId=20&var-metric=org.cms.SE-XRootD-9federation&var-dst_hostname=hepcms-se2.umd.edu&var-timestamp=1772542567000

Ticket is assigned to NGIs › USCMS

Affected VO is cms.

Extra notifications:

Notified Groups: CMS Site Support

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics
Hallo Anwar,
looks like you subscribed the server to the federation and it responds to the : //store/test/xrootd/T3_US_UMD
path, thus it showing up in the test of hepcms-se2.umd.edu:1094

I would suggest to not subscribe to the federation until the service is ready.
Thanks,
cheers, Stephan
But then how do I test it?

Is the current problem just certificate the issue? In principle, it should be working fine. I can do xrd commands locally.

Anwar

Anwar BhattiResearch Professor of Physics

On Tue, Mar 3, 2026, 9:43 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was updated by Stephan Lammel on 2026-03-03 14:42 (UTC).

Hallo Anwar,
looks like you subscribed the server to the federation and it responds to the : //store/test/xrootd/T3_US_UMD

path, thus it showing up in the test of hepcms-se2.umd.edu:1094

I would suggest to not subscribe to the federation until the service is ready.

Thanks,
cheers, Stephan



Ticket is assigned to NGIs › USCMS.

Affected VO is cms.

Extra notifications:

Notified Groups: CMS Site Support

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG
Do you really need the federation subscription for testing? If not, just stop the cmsd service.
- Stephan
I stopped cmsd@clustred.

As far as I understand, the xrootd server on hepcms-ce.umd.edu should be working. I did not change siteconf as I am working on installing condor-ce on the same machine and that part is not done yet.

Anwar

On Tue, Mar 3, 2026 at 11:46 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was updated by Stephan Lammel on 2026-03-03 16:46 (UTC).

Do you really need the federation subscription for testing? If not, just stop the cmsd service.
- Stephan

Ticket is assigned to NGIs › USCMS.

Affected VO is cms.

Extra notifications:

Notified Groups: CMS Site Support

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics
Thank you for your work. I close this ticket.
Thanks

Do you also check how CE is working?

Anwar BhattiResearch Professor of Physics

On Thu, Mar 26, 2026, 9:47 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was updated by Chan-Anun Rungphitakchai on 2026-03-26 13:47 (UTC).

Thank you for your work. I close this ticket.

Ticket is assigned to NGIs › USCMS.

Affected VO is cms.

Updates:
State: assigned → solved

Extra notifications:

Notified Groups: CMS Site Support

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG
Hello Anwar. With the site status page, there are several unknown SAM/Hammercloud results from your CE, but the Glidein-factory page shows your endpoint has some running jobs [1]. Looks like your endpoint still runs some jobs, not testing jobs. Could you please check this node's status/services? If everything seems fine from your end, I will send this ticket to the submission infrastructure team for further investigation.
Cheers,
Noy
[1]
http://gfactory-2.opensciencegrid.org/factory/monitor/factoryEntryStatusNow.html?entry=CMS_T3_US_UMD_hepcms-ce2
[2]
https://monit-grafana.cern.ch/d/requested-cpu/requested-cpu?orgId=11&var-site=T3_US_UMD&var-binning=1h&from=1773792000000&to=1774483199000
I fixed the condor on my end today. It should be working now. Not sure how it will propagate or it there is still some problem.

On Thu, Mar 26, 2026 at 10:00 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was updated by Chan-Anun Rungphitakchai on 2026-03-26 14:00 (UTC).

Hello Anwar.With the site status page, there are several unknown SAM/Hammercloud results from your CE, but the Glidein-factory page shows your endpoint has some running jobs [1]. Looks like your endpoint still runs some jobs, not testing jobs. Could you please check this node's status/services? If everything seems fine from your end, I will send this ticket to the submission infrastructure team for further investigation.
Cheers,
Noy
[1]
http://gfactory-2.opensciencegrid.org/factory/monitor/factoryEntryStatusNow.html?entry=CMS_T3_US_UMD_hepcms-ce2
[2]
https://monit-grafana.cern.ch/d/requested-cpu/requested-cpu?orgId=11&var-site=T3_US_UMD&var-binning=1h&from=1773792000000&to=1774483199000

Ticket is assigned to NGIs › USCMS.

Affected VO is cms.

Extra notifications:

Notified Groups: CMS Site Support

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics
Can you please ask someone to investigate it further?

Thanks,
Anwar

On Thu, Mar 26, 2026 at 10:48 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was updated by Anwar Bhatti on 2026-03-26 14:47 (UTC).


I fixed the condor on my end today. It should be working now. Not sure how it will propagate or it there is still some problem.

On Thu, Mar 26, 2026 at 10:00 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was updated by Chan-Anun Rungphitakchai on 2026-03-26 14:00 (UTC).

Hello Anwar.With the site status page, there are several unknown SAM/Hammercloud results from your CE, but the Glidein-factory page shows your endpoint has some running jobs [1]. Looks like your endpoint still runs some jobs, not testing jobs. Could you please check this node's status/services? If everything seems fine from your end, I will send this ticket to the submission infrastructure team for further investigation.
Cheers,
Noy
[1]
http://gfactory-2.opensciencegrid.org/factory/monitor/factoryEntryStatusNow.html?entry=CMS_T3_US_UMD_hepcms-ce2
[2]
https://monit-grafana.cern.ch/d/requested-cpu/requested-cpu?orgId=11&var-site=T3_US_UMD&var-binning=1h&from=1773792000000&to=1774483199000

Ticket is assigned to NGIs › USCMS.

Affected VO is cms.

Extra notifications:

Notified Groups: CMS Site Support

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics

Ticket is assigned to NGIs › USCMS.

Affected VO is cms.

Extra notifications:

Notified Groups: CMS Site Support

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics
Hello Anwar,

I'll send some test pilots to the CE to check if we are able to run jobs there. I'll keep you updated.

Best regards,
Luís - CMS Submission Infrastructure.
Thanks.
Anwar

On Tue, Mar 31, 2026 at 11:31 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was updated by Luis Simas on 2026-03-31 15:31 (UTC).

Hello Anwar,

I'll send some test pilots to the CE to check if we are able to run jobs there. I'll keep you updated.

Best regards,
Luís - CMS Submission Infrastructure.

Ticket is assigned to NGIs › USCMS.

Affected VO is cms.

Extra notifications:

Notified Groups: CMS Site Support, CMS Submission Infrastructure

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics
Hello Anwar,

I've validated the connectivity to your CE with condor_ce_ping [1]. The CE is reachable and our GlideinWMS Factory is able to authenticate with it using scitokens.

I've sent some test jobs to your CE yesterday. From our monitoring, I can see that the pilots reached the CE, but are waiting to be scheduled by your batch system to a worker node [2]. Checking our monitoring for the global CERN pool, it seems that no pilot has ran in your CE since 15/03/2026. This doesn't necessarily indicate a problem, but could you check if you see any pending pilot jobs in your site?

Best regards,
Luís.

---

[1]
[lsimasde@vocms0204 ~]$ BEARER_TOKEN_FILE=/tmp/token _condor_SEC_CLIENT_AUTHENTICATION_METHODS=scitoken _condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY condor_ce_ping -pool hepcms-ce2.umd.edu:9619 -name hepcms-ce2.umd.edu WRITE -d
...
04/01/26 09:38:53 SSL host check: host alias hepcms-ce2.umd.edu matches certificate SAN hepcms-ce2.umd.edu.
04/01/26 09:38:53 SSL Auth: SSL: continue read/write.
04/01/26 09:38:53 SSL authentication succeeded to
04/01/26 09:38:53 Authentication was a Success.
...
04/01/26 09:38:54 SECMAN: startCommand succeeded.
WRITE command using (AES, AES, and SCITOKENS) succeeded as cmspilot@users.htcondor.org to schedd hepcms-ce2.umd.edu.

[2]
https://monit-grafana.cern.ch/goto/-yni8VtDg?orgId=11

[3]
https://monit-grafana.cern.ch/goto/Wq3N84tDR?orgId=11
I have not seen any cms jobs on the US_T3_UMD cluster. None is in queue.

Thanks,
Anwar

On Wed, Apr 1, 2026 at 3:48 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1001958 "Intermittent XRootD SAM test failures at T3_US_UMD"
was updated by Luis Simas on 2026-04-01 07:48 (UTC).

Hello Anwar,

I've validated the connectivity to your CE with condor_ce_ping [1]. The CE is reachable and our GlideinWMS Factory is able to authenticate with it using scitokens.

I've sent some test jobs to your CE yesterday. From our monitoring, I can see that the pilots reached the CE, but are waiting to be scheduled by your batch system to a worker node [2]. Checking our monitoring for the global CERN pool, it seems that no pilot has ran in your CE since 15/03/2026. This doesn't necessarily indicate a problem, but could you check if you see any pending pilot jobs in your site?

Best regards,
Luís.

---

[1]
[lsimasde@vocms0204 ~]$ BEARER_TOKEN_FILE=/tmp/token _condor_SEC_CLIENT_AUTHENTICATION_METHODS=scitoken _condor_TOOL_DEBUG=D_FULLDEBUG,D_SECURITY condor_ce_ping -pool hepcms-ce2.umd.edu:9619 -name hepcms-ce2.umd.edu WRITE -d
...
04/01/26 09:38:53 SSL host check: host alias hepcms-ce2.umd.edu matches certificate SAN hepcms-ce2.umd.edu.
04/01/26 09:38:53 SSL Auth: SSL: continue read/write.
04/01/26 09:38:53 SSL authentication succeeded to
04/01/26 09:38:53 Authentication was a Success.
...
04/01/26 09:38:54 SECMAN: startCommand succeeded.
WRITE command using (AES, AES, and SCITOKENS) succeeded as cmspilot@users.htcondor.org to schedd hepcms-ce2.umd.edu.

[2]
https://monit-grafana.cern.ch/goto/-yni8VtDg?orgId=11

[3]
https://monit-grafana.cern.ch/goto/Wq3N84tDR?orgId=11

Ticket is assigned to NGIs › USCMS.

Affected VO is cms.

Extra notifications:

Notified Groups: CMS Site Support, CMS Submission Infrastructure

https://helpdesk.ggus.eu/#ticket/zoom/1001958

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics
CMS #681825 (id:1926) Update HTCONDOR config for new issuer token support at T3_US_UMD
State: in progress  |  Priority: very urgent  |  Opened: 2025-01-29 10:47 (430d ago)  |  Updated: 2025-07-21 13:14
Conversation (17 messages)
GGUS ID: 169424
Last modifier: Chan-anun Rungphitakchai
Date: 2024-12-16 21:15:10
Subject: Update HTCONDOR config for new issuer token support at T3_US_UMD
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: USCMS
Issue type: Other
Description:
?Good afternoon, Maryland admin
There is new k8s issuer server (https://cms-auth.cern.ch). Your HTCONDOR endpoint do not support token authentication with a new issuer [1]. Could you check your OSG version? If your site still use OSG 35, please upgrade to new OSG version and update osg-scitokens-mapfile.rpm to version 13-2 or add new configurations for cms-auth.cern.ch on osg-scitokens-mapfile.conf.
I attach a document and examples [2]
Best Regards,
Noy
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-ce2.umd.edu%26site%3Detf%26view_name%3Dhost
[2]
https://osg-htc.org/docs/compute-element/install-htcondor-ce/
# CMS IAM development instance:
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
# SAM/ETF tests (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
# CMS ITB pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
# CMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
# CMS ITB LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
# USCMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
# USCMS HEPCloud pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms
GGUS ID: 169424
Last modifier: Anwar A Bhatti
Date: 2024-12-17 03:44:03

Public Diary:
Hi Julia,

We're pressing our provider about this, but so far no progress.

Best,
Internal Diary:
Added attachment image.png
https://ggus.eu/index.php?mode=download&attid=ATT119796
GGUS ID: 169424
Last modifier: Anwar A Bhatti
Date: 2024-12-17 03:44:04

Public Diary:
Hi Noy,
I am a bit confused. I see following on
https://etf-cms-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-ce2.umd.edu%26site%3Detf%26view_name%3Dhost
which has usual two warnings. Does you tests are testing something else? or a particular node is not working?
Here is the current status.
[cid:01e8b553-6d96-44a5-a030-e19fb5968853]
[cid:cbca2caf-688f-445e-8122-42a4743a3eed]

[cid:7650bfe0-e746-405e-810f-2c2e6c566d0e]

________________________________
From: helpdesk@ggus.org
Sent: Monday, December 16, 2024 4:16 PM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #169424 Ticket for site "umd-cms" "cms" "Update HTCONDOR config for new issuer token support at T3_US_UMD"

[EXTERNAL] ? This message is from an external sender
Dear support staff,
Ticket #169424 for site "umd-cms" is ASSIGNED to you.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169424
SUBJECT: Update HTCONDOR config for new issuer token support at T3_US_UMD
TICKET INFORMATION:
DESCRIPTION:
?Good afternoon, Maryland admin
There is new k8s issuer server (https://cms-auth.cern.ch). Your HTCONDOR endpoint do not support token authentication with a new issuer [1]. Could you check your OSG version? If your site still use OSG 35, please upgrade to new OSG version and update osg-scitokens-mapfile.rpm to version 13-2 or add new configurations for cms-auth.cern.ch on osg-scitokens-mapfile.conf.
I attach a document and examples [2]
Best Regards,
Noy
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-ce2.umd.edu%26site%3Detf%26view_name%3Dhost
[2]
https://osg-htc.org/docs/compute-element/install-htcondor-ce/
# CMS IAM development instance:
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
# SAM/ETF tests (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
# CMS ITB pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
# CMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
# CMS ITB LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
# USCMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
# USCMS HEPCloud pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms

NOTIFIED SITE: umd-cms
CONCERNED VO: cms
PRIORITY: urgent
ISSUE TYPE: Other
SUBMITTER: Chan-anun Rungphitakchai
*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:
Added attachment image.png
https://ggus.eu/index.php?mode=download&attid=ATT119796
GGUS ID: 169424
Last modifier: Chan-anun Rungphitakchai
Date: 2024-12-18 17:20:24

Status: in progress
Responsible Unit: USCMS
Public Diary:
Good morning, Anwar.
Thank you for reply. The link that you attach is one for product. Production ETF use old (https://cms-auth.web.cern.ch) token issuer for test. You don't worry about 2 warning, those warning is okay for now. I'm asking for sites in the states to add configurations for support a new token issuer. You can check the status with attach link [1].
Thank you,
Noy
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-ce2.umd.edu%26site%3Detf%26view_name%3Dhost
Internal Diary:
Added attachment image.png
https://ggus.eu/index.php?mode=download&attid=ATT119796
GGUS ID: 169424
Last modifier: Anwar A Bhatti
Date: 2024-12-18 19:07:02

Public Diary:
Good morning, Anwar.
Thank you for reply. The link that you attach is one for product. Production ETF use old (https://cms-auth.web.cern.ch) token issuer for test. You don't worry about 2 warning, those warning is okay for now. I'm asking for sites in the states to add configurations for support a new token issuer. You can check the status with attach link [1].
Thank you,
Noy
[1]
https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-ce2.umd.edu%26site%3Detf%26view_name%3Dhost
Internal Diary:
Added attachment mailbody.2024-12-18_19.05.44.txt
https://ggus.eu/index.php?mode=download&attid=ATT119859
GGUS ID: 169424
Last modifier: Chan-anun Rungphitakchai
Date: 2025-01-09 02:53:04

Public Diary:
Hello Anwar
Look likes, most of the sites has hardcode mapping each authentication servers with local batch system configuration. You need to add a mapping for new k8s issuer. Carl (Wisconsin) found that /etc/condor-ce/mapfiles.d/10-scitokens.conf needed mapping entries for the new Kubernetes IAM issued tokens. Maybe your system uses the same location?
Cheers,
Noy
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 169424
Last modifier: Anwar A Bhatti
Date: 2025-01-09 03:06:02

Public Diary:
We have the file below. What do I need to add?
Thanks
Happy New Year
Anwar
[root@hepcms-ce2 condor-ce]# more mapfiles.d/10-scitokens.conf
##############################################################################
#
# HTCondor-CE manual SciTokens authentication mappings
#
# This file will NOT be overwritten upon RPM upgrade.
#
###############################################################################
# Authentication of SciTokens and WLCG tokens requires CA certificates
# installed in the standard system (/etc/pki/tls/certs/ca-bundle.crt)
# or Grid (/etc/grid-security/certificates) locations. If using Grid
# certificates, be sure to set 'AUTH_SSL_*' configuration values as
# appropriate in /etc/condor-ce/config.d/
# To allow clients with SciToken or WLCG tokens to submit jobs to your
# HTCondor-CE, add lines of the following format:
#
# SCITOKENS /,/
#
# Where the second field (between the '/') should be a Perl Compatible
# Regular Expression (PCRE). For example, to map all clients with
# SciTokens issued by the OSG VO regardless of subject to the local
# 'osg' user, add the following line to this file:
#
# SCITOKENS /^https:\/\/scitokens.org\/osg-connect,.*/ osg
# CMS IAM development instance:
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
# SAM/ETF tests (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
# CMS ITB pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
# CMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
# CMS ITB LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
# USCMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
# USCMS HEPCloud pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms
________________________________
From: helpdesk@ggus.org
Sent: Wednesday, January 8, 2025 9:54 PM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #169424 "IN PROGRESS" "USCMS" "Update HTCONDOR config for new issuer token support at T3_US_UMD"

[EXTERNAL] This message is from an external sender
Hello,
GGUS ticket #169424 was updated.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169424
SUBJECT: Update HTCONDOR config for new issuer token support at T3_US_UMD
LATEST MODIFICATIONS:
LAST MODIFIER: Chan-anun Rungphitakchai
PUBLIC DIARY:
Hello Anwar
Look likes, most of the sites has hardcode mapping each authentication servers with local batch system configuration. You need to add a mapping for new k8s issuer. Carl (Wisconsin) found that /etc/condor-ce/mapfiles.d/10-scitokens.conf needed mapping entries for the new Kubernetes IAM issued tokens. Maybe your system uses the same location?
Cheers,
Noy

*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 169424
Last modifier: Anwar A Bhatti
Date: 2024-12-18 19:07:03

Public Diary:
Hi Noy,
I am confused what needs to be changed?

Anwar
On
[root@hepcms-ce2 condor-ce]# uname -a
Linux hepcms-ce2.umd.edu 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 8 07:11:46 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
we have
htcondor-ce-condor.noarch 23.0.17-1.el8 @osg
Package osg-scitokens-mapfile-13-2.osg23.el8.x86_64 is already installed.

/usr/share/condor-ce/mapfiles.d/osg-scitokens-mapfile.conf

has following lines.

## CMS ##
# CMS pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
# SAM/ETF tests:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
# CMS ITB pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
# CMS LOCAL pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
# CMS ITB LOCAL pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
# USCMS LOCAL pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
# USCMS HEPCloud pilots:
SCITOKENS /^https\:\/\/cms\-auth\.web\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms
# CMS IAM development instance:
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,bad55f4e\-602c\-4e8d\-a5c5\-bd8ffb762113$/ cmspilot
# SAM/ETF tests (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,08ca855e\-d715\-410e\-a6ff\-ad77306e1763$/ lcgadmin
# CMS ITB pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,490a9a36\-0268\-4070\-8813\-65af031be5a3$/ cmspilot
# CMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,07f75a9a\-bb78\-4735\-938b\-7e61b2b62d5c$/ cmslocal
# CMS ITB LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,efbed8c1\-f9a7\-4063\-92f7\-f89c04ce04a3$/ cmslocal
# USCMS LOCAL pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,99b97e4f\-5cf0\-4d3b\-9fcc\-82fa86dca1e8$/ uscmslocal
# USCMS HEPCloud pilots (development):
SCITOKENS /^https\:\/\/CMS\-auth\.cern\.ch\/,2f327ad0\-1934\-4635\-8ba3\-ee72da0a304c$/ uscms

________________________________
From: helpdesk@ggus.org
Sent: Monday, December 16, 2024 4:16 PM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #169424 Ticket for site "umd-cms" "cms" "Update HTCONDOR config for new issuer token support at T3_US_UMD"

[EXTERNAL] ? This message is from an external sender
Dear support staff,
Ticket #169424 for site "umd-cms" is ASSIGNED to you.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169424
SUBJECT: Update HTCONDOR config for new issuer token support at T3_US_UMD
TICKET INFORMATION:
DESCRIPTION:
?Good afternoon, Maryland admin
There is new k8s issuer server (https://cms-auth.cern.ch). Your HTCONDOR endpoint do not support token authentication with a new issuer [1]. Could you check your OSG version? If your site still use OSG 35, please upgrade to new OSG version and update osg-scitokens-mapfile.rpm to version 13-2 or add new configurations for cms-auth.cern.ch on osg-scitokens-mapfile.conf.
I attach a document and e...
...
--- body truncated, see attachment mailbody.2024-12-18_19.05.44.txt for full text ---

Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
Added attachment image.png
Added attachment mailbody.2024-12-18_19.05.44.txt
Good afternoon, Anwar The configuration on your device is not correct. The URL should be lowercase (cms not CMS). OSG team already update new RPM. Could you please consider update the server to OSG23/24 or manually change your configuration according to mapfile.conf on OSG site [1].
Thank you,
Noy
[1]
topology.opensciencegrid.org/collaborations/osg-scitokens-mapfile.conf
correct ticket assignment which seems to have gotten messed up during the import. - Stephan
Can you please add support for the K8s IAM instance today/tomorrow. ETF production will switch on Monday. Your site would then fail SAM tests.
Thanks,
- Stephan
Hi Stephan,
I am not sure whom this message is directed to. If me, can you please send me more explicit instructions like which lines to change in which file.
I see that tests are failing for some time, there are cmspilot/uscmslocal jobs running.
Is there is link to check the success rate of cmspilot jobs?
Thanks,
Anwar

From: helpdesk@ggus.org <helpdesk@ggus.org>
Sent: Thursday, February 13, 2025 3:25 PM
To: bhatti@umd.edu <bhatti@umd.edu>
Subject: [GGUS-Ticket-ID: #681825] "IN PROGRESS" "USCMS" "Update HTCONDOR config for new issuer token support at T3_US_UMD" (Updated)

[EXTERNAL] – This message is from an external sender

GGUS Helpdesk Notification
Hi,

Ticket (Update HTCONDOR config for new issuer token support at T3_US_UMD) has been updated by "Stephan Lammel".

Changes:
Priority: urgent -> very urgent

Information:
Can you please add support for the K8s IAM instance today/tomorrow. ETF production will switch on Monday. Your site would then fail SAM tests.
Thanks,
- Stephan

View this in GGUS Helpdesk

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG
Hallo Anwar,
the best action depends on the OSG version you/your CE is at. If you are
at OSG 23 or 24 you can just "yum update" and restart. If you are on an
older OSG/OS version you will need to update the SciToken mapfile yourself.
Noy put the instructions for this into the ticket, see the earlier entries.
Thanks,
cheers, Stephan
Hello Anwar, Could you please update your OSG version and let us know.
Thanks,
Noy
It is a very generic request. I think the OSG software at UMD is almost up-to-date. I updated the node running condor and condor-ce just a few days ago. Can you please be more explicit?

Anwar

On Mon, Jul 21, 2025 at 6:25 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1926 "Update HTCONDOR config for new issuer token support at T3_US_UMD"
was updated by Chan-Anun Rungphitakchai on 2025-07-21 10:25 (UTC).

Hello Anwar,Could you please update your OSG version and let us know.
Thanks,
Noy

Ticket is assigned to NGIs › USCMS

Updates:
Notified Users: →

https://helpdesk.ggus.eu/#ticket/zoom/1926

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics
CMS #681827 (id:1928) Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
State: assigned  |  Priority: urgent  |  Opened: 2025-01-29 10:48 (430d ago)  |  Updated: 2025-07-21 13:11
Conversation (14 messages)
GGUS ID: 168897
Last modifier: Jakrapop Akaranee
Date: 2024-11-05 10:57:41
Subject: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
Ticket Type: USER
CC: cms-comp-ops-site-support-team@cern.ch;
Status: assigned
Responsible Unit: USCMS
Issue type: CMS_SAM tests
Description:
Dear UMD Site Administrators,
We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack, specifically for the following endpoint:

hepcms-se2.umd.edu (XrootD [1] and WebDAV [2])

Could you please review dual stack support on your storage element?
Thank you for your assistance.
Best Regards,Jakrapop
-----------
[1] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-se2.umd.edu%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice
[2] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-se2.umd.edu%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
GGUS ID: 168897
Last modifier: Anwar A Bhatti
Date: 2024-11-05 14:29:01

Public Diary:
Dear Jakrapop Akaranee,

Can you please tell me exactly what needs to be changed? What is dual stack?
Best,
Anwar
?
?
Reply
?
Reply all
?
Forward

________________________________
From: helpdesk@ggus.org
Sent: Tuesday, November 5, 2024 6:00 AM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #168897 Ticket for site "umd-cms" "cms" "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD"

[EXTERNAL] ? This message is from an external sender
Dear support staff,
Ticket #168897 for site "umd-cms" is ASSIGNED to you.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168897
SUBJECT: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
TICKET INFORMATION:
DESCRIPTION:
Dear UMD Site Administrators,
We are currently preparing the ETF pre-production instance and have found that your storage element no longer supports dual stack, specifically for the following endpoint:

* hepcms-se2.umd.edu (XrootD [1] and WebDAV [2])

Could you please review dual stack support on your storage element?
Thank you for your assistance.
Best Regards,Jayapura
-----------
[1] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-se2.umd.edu%26service%3Dorg.cms.SE-XRootD-1connection%26site%3Detf%26view_name%3Dservice
[2] https://etf-cms-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Dhepcms-se2.umd.edu%26service%3Dorg.cms.SE-WebDAV-1connection%26site%3Detf%26view_name%3Dservice
NOTIFIED SITE: umd-cms
CONCERNED VO: cms
PRIORITY: urgent
ISSUE TYPE: CMS_SAM tests
SUBMITTER: Jakrapop Akaranee
*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Jakrapop Akaranee
Date: 2024-11-05 15:41:49

Public Diary:
Dear Anwar,

I apologize for any confusion in my previous communication. I would greatly appreciate it if you could ensure that your XRootD and WebDAV endpoints (hepcms-se2.umd.edu) support both IPv4 and IPv6, as your storage endpoint currently supports only IPv4 [1].

Thank you very much for your assistance!

Best regards,
Jakrapop
------------------
[1]
(base) [ajakrapo@lxplus961 ~]$ dig hepcms-se2.umd.edu

; > DiG 9.16.23-RH > hepcms-se2.umd.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER







Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Anwar A Bhatti
Date: 2024-11-05 16:00:01

Public Diary:
I will work with my IT to get support IPV6.
If you have any info on where changes are needed, it will be highly appreciated.
Best,
Anwar
________________________________
From: helpdesk@ggus.org
Sent: Tuesday, November 5, 2024 10:42 AM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #168897 "ASSIGNED" "USCMS" "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD"

[EXTERNAL] This message is from an external sender
Hello,
GGUS ticket #168897 was updated.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168897
SUBJECT: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
LATEST MODIFICATIONS:
LAST MODIFIER: Jakrapop Akaranee
PUBLIC DIARY:
Dear Anwar,

I apologize for any confusion in my previous communication. I would greatly appreciate it if you could ensure that your XRootD and WebDAV endpoints (hepcms-se2.umd.edu) support both IPv4 and IPv6, as your storage endpoint currently supports only IPv4 [1].

Thank you very much for your assistance!

Best regards,
Jakrapop
------------------
[1]
(base) [ajakrapo@lxplus961 ~]$ dig hepcms-se2.umd.edu
; DiG 9.16.23-RH hepcms-se2.umd.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Stephan Lammel
Date: 2024-11-06 22:30:00

Public Diary:
https://access.redhat.com/solutions/347693
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Anwar A Bhatti
Date: 2024-11-12 15:41:02

Public Diary:
I had discussion with UMD IT/network group. They would like to know when transition must be completed. Can you give me date?
They have not provided this service to any one except one group. So they need to re-design their networking as VLAN is shared.
Also, do I need IPV6 addresses for our worker/data storage nodes which are currently on our privent only and do not have any public IP at this time.
Anwar
________________________________
From: helpdesk@ggus.org
Sent: Friday, November 8, 2024 6:47 PM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #168897 "ASSIGNED" "USCMS" "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD"

[EXTERNAL] This message is from an external sender
Hello,
GGUS ticket #168897 was updated.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168897
SUBJECT: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
LATEST MODIFICATIONS:
LAST MODIFIER: Anwar A Bhatti
PUBLIC DIARY:
I am working on it. I enabled it in /etc/sysconfig/network-scripts/ifcfg-eno1 and eno2 but it needs more work.

[bhatti@hepcms-se2 ~]$ ifconfig -a
eno1: flags=4163 mtu 1500
inet 10.1.0.34 netmask 255.255.0.0 broadcast 10.1.255.255
inet6 fe80::332f:e0fb:bb12:6870 prefixlen 64 scopeid 0x20
ether 00:1d:09:68:09:97 txqueuelen 1000 (Ethernet)
RX packets 976910128 bytes 1325164819319 (1.2 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 781685088 bytes 997774373608 (929.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eno2: flags=4163 mtu 1500
inet 128.8.216.202 netmask 255.255.254.0 broadcast 128.8.217.255
inet6 fe80::a:ff74:b266:88d7 prefixlen 64 scopeid 0x20
ether 00:1d:09:68:09:99 txqueuelen 1000 (Ethernet)
RX packets 752073620 bytes 895568241488 (834.0 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 841316781 bytes 1181283432503 (1.0 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
________________________________
From: helpdesk@ggus.org
Sent: Wednesday, November 6, 2024 5:31 PM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #168897 "ASSIGNED" "USCMS" "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD"

[EXTERNAL] This message is from an external sender
Hello,
GGUS ticket #168897 was updated.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168897
SUBJECT: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
LATEST MODIFICATIONS:
LAST MODIFIER: Stephan Lammel
PUBLIC DIARY:
https://access.redhat.com/solutions/347693

*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************

*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Anwar A Bhatti
Date: 2024-11-08 23:45:02

Public Diary:
I am working on it. I enabled it in /etc/sysconfig/network-scripts/ifcfg-eno1 and eno2 but it needs more work.

[bhatti@hepcms-se2 ~]$ ifconfig -a
eno1: flags=4163 mtu 1500
inet 10.1.0.34 netmask 255.255.0.0 broadcast 10.1.255.255
inet6 fe80::332f:e0fb:bb12:6870 prefixlen 64 scopeid 0x20
ether 00:1d:09:68:09:97 txqueuelen 1000 (Ethernet)
RX packets 976910128 bytes 1325164819319 (1.2 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 781685088 bytes 997774373608 (929.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eno2: flags=4163 mtu 1500
inet 128.8.216.202 netmask 255.255.254.0 broadcast 128.8.217.255
inet6 fe80::a:ff74:b266:88d7 prefixlen 64 scopeid 0x20
ether 00:1d:09:68:09:99 txqueuelen 1000 (Ethernet)
RX packets 752073620 bytes 895568241488 (834.0 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 841316781 bytes 1181283432503 (1.0 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
________________________________
From: helpdesk@ggus.org
Sent: Wednesday, November 6, 2024 5:31 PM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #168897 "ASSIGNED" "USCMS" "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD"

[EXTERNAL] This message is from an external sender
Hello,
GGUS ticket #168897 was updated.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168897
SUBJECT: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
LATEST MODIFICATIONS:
LAST MODIFIER: Stephan Lammel
PUBLIC DIARY:
https://access.redhat.com/solutions/347693

*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Stephan Lammel
Date: 2024-11-10 18:09:13

Public Diary:
Hallo Anwar,
fe80 are link local addresses and not globally routable. You need
to get an IP address from your networking team.
Thanks,
cheers, Stephan
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Anwar A Bhatti
Date: 2024-11-10 19:34:02

Public Diary:
Yes, I am in contact with them.
Anwar
________________________________
From: helpdesk@ggus.org
Sent: Sunday, November 10, 2024 1:09 PM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #168897 "ASSIGNED" "USCMS" "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD"

[EXTERNAL] This message is from an external sender
Hello,
GGUS ticket #168897 was updated.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168897
SUBJECT: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
LATEST MODIFICATIONS:
LAST MODIFIER: Stephan Lammel
PUBLIC DIARY:
Hallo Anwar,
fe80 are link local addresses and not globally routable. You need
to get an IP address from your networking team.
Thanks,
cheers, Stephan

*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Stephan Lammel
Date: 2024-11-12 15:52:24

Public Diary:
Hallo Anwar,
there is no hard deadline for Tier-3s. We expect to have IPv6-only
worker nodes within the next couple of weeks. Those will then not be
able to access your storage.
Yes, CE, SE endpoints, and worker nodes should be all dual stacked.
You can get a public/globally routable IPv6 address in addition to your
private/NAT IPv4 address (and reduce load on your NAT box).
Thanks,
cheers, Stephan
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
GGUS ID: 168897
Last modifier: Anwar A Bhatti
Date: 2024-11-12 16:07:02

Public Diary:
Hi Stephan,
Thanks,
I had told them early January. Let us see how it goes.
Anwar

________________________________
From: helpdesk@ggus.org
Sent: Tuesday, November 12, 2024 10:52 AM
To: Shabnam Jabeen ; bhatti@umd.edu
Subject: GGUS-Ticket-ID: #168897 "ASSIGNED" "USCMS" "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD"

[EXTERNAL] This message is from an external sender
Hello,
GGUS ticket #168897 was updated.
REFERENCE LINK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168897
SUBJECT: Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD
LATEST MODIFICATIONS:
LAST MODIFIER: Stephan Lammel
PUBLIC DIARY:
Hallo Anwar,
there is no hard deadline for Tier-3s. We expect to have IPv6-only
worker nodes within the next couple of weeks. Those will then not be
able to access your storage.
Yes, CE, SE endpoints, and worker nodes should be all dual stacked.
You can get a public/globally routable IPv6 address in addition to your
private/NAT IPv4 address (and reduce load on your NAT box).
Thanks,
cheers, Stephan

*********************************************************************
This is an automated mail. When replying don't change the subject line!
S T R I P P R E V I O U S M A I L S please!!
*********************************************************************
Internal Diary:

----------- e-mail with large body ------
added in total as mailbody.2024-12-18_19.05.44.txt

------------ e-mail with large body ------
Assigning missing CMS site name error during import to new GGUS.

Jakrapop
Any update -- Thank you, Noy
Hi Noy,

I talked to our computing infrastructure people and they have not completed it yet. I did not push because as I understand, it is not required for T3 sites.

Anwar

On Mon, Jul 21, 2025 at 6:26 AM <helpdesk@ggus.org> wrote:

GGUS Helpdesk Notification
Ticket #1928 "Request for Dual Stack Support on Storage Element in ETF Pre-Production at T3_US_UMD"
was updated by Chan-Anun Rungphitakchai on 2025-07-21 10:26 (UTC).

Any update -- Thank you, Noy

Ticket is assigned to NGIs › USCMS

Updates:
Priority: less urgent → urgent
Notified Users: →

https://helpdesk.ggus.eu/#ticket/zoom/1928

You are receiving this because you were subscribed via Mailing List (Site Contact Email) in this ticket. | Manage your notification settings | EGI/WLCG


--
Anwar BhattiResearch Professor of Physics
-16d-15d-14d-13d-12d-11d-10d-9d-8d-7d-6d-5d-4d-3d-2d-1d
SAM92%92%89%91%89%81%89%76%90%100%94%88%93%91%99%89%
HammerCloud— no data —
FTS50%100%100%100%0%100%100%98%0%0%100%100%100%87%100%0%

No open GGUS tickets

CMS Site Support Team  |  Status Summary  |  SR Report  |  GGUS