Well, as promised, I did set up a crude monitor for this functionality. It tries to verify my own verification each hour and logs the results.
According to my logs, the system failed on Monday, February 17th, somewhere between 14:53:02 CET 2020 and 15:53:01 CET 2020. It has not been restored yet.
BTW, it may be a coincidence, but it struck me that 1500 CET is 9 AM in the US (EDT). One could imagine a mr Parkinson, arriving at the office at 9, reading a mail that says there is something wrong with the system, and that the old version needs to be restored, which Parkinson dutyfully does. Only to re-introduce the nasty bug that ensures that the system does not work anymore..
However, can somebody pretty please look into it and resolve the matter, preferably once-and-for-all..?
Similar to what I have seen before, the system was reporting the correct results a couple of hours after I made my last post, and is currently reporting the correct results - strange your systems still shows it as being down!
The system has been disfunctional from Mon Feb 17 15:53:01 CET 2020 until Wed Feb 19 16:53:02 CET 2020. It resumed working on Wed Feb 19 17:53:01 CET 2020
It was down for about 50 hours.
It currently is still working, I'm still actively monitoring it to see if I can detect some pattern.
Just an update: the verification site seems to be mostly working now. I'm running a script each hour, which does a query over the API that the verification site offers. These are the outages that I registered during the last month:
Sun Mar 15 15:53:01 CET 2020 Fri Mar 20 21:53:01 CET 2020 Sun Mar 22 13:53:01 CET 2020 Mon Mar 23 04:53:01 CET 2020 Tue Mar 24 14:53:01 CET 2020
Given these results, I re-instated the automatic checks during registry and also ran a batch to check the validity of our members data. I had some outages during that batch, but all in all I have decided to label it "sufficient for use" for now.
Just to let you know, I received a message from Member Services yesterday about the ticket I opened with them on this issue.
They were just checking with me to see if I had experienced the issue recently, as they believe it's now resolved.
I told them I hadn't run into the issue for at least a couple of weeks, and, as you have also reported no serious issues recently, it looks like the verification tool should operate more consistently now - great news!
Well, according to my logs, it isn't that stable at all, assuming that the network connection is not the culprit. On these dates the connection failed at least once:
Fri Mar 20
Sun Mar 22
Mon Mar 23
Tue Mar 24
Sat Mar 28
Tue Apr 14
Wed Apr 15
Thu Apr 16
Wed Apr 22
Yeah, it doesn't have perfect availability that's for sure!
I created my own system to monitor this too on the 31 March 2020.
Since then I have noted the following:
20200401203301 Verification check failed to run correctly
20200402173301 Verification check failed to run correctly
20200403123301 Verification check failed to run correctly
20200407063302 Verification check failed to run correctly
20200407083302 Verification check failed to run correctly
20200407103301 Verification check failed to run correctly
20200411221213 Verification tool is reporting incorrect results
20200411231212 Verification tool is reporting incorrect results
20200413043301 Verification check failed to run correctly
20200413063301 Verification check failed to run correctly
20200419113301 Verification check failed to run correctly
20200420103301 Verification check failed to run correctly
20200420153302 Verification check failed to run correctly
Times are UTC naturally.
The check runs once every hour, and uses Puppeteer to automate a headless browser connection to my personal member verification URL and then reports the results back.
Where the check failed to run that indicates the verification site was either down, or was providing very slow responses such as it was on Monday (20th) - potentially these could also be caused by my own Internet connection too.
Where it wasn't reporting correct results, those entries were triggered due to the order my certs were shown in changing on the tool, and my script not accounting for that. I've noticed this does still happen randomly, but my script no longer alerts on that since I adjusted it - it will only alert if not all of my certs are shown in the response.
EDIT: it's interesting to note how our logs don't seem to tally up. I didn't detect any errors on any of the days you did and vice versa!
Also, when I look at my logs, my check has been running for around 522 hours and only truly reported errors for 11 of those which is a 2.1% error rate or inversely results to 97.9% up time - that's not that bad really, all things considered!
Good to see that you have been starting your own monitoring system, Alec. Kudos to you!
One of the possible explanations for the differences between your logs and mine is the method we use. I use the JSON API, you seem to use the standard HTTP(S) connection. I'd figure that they would end on the same system, but perhaps there is a difference.
Another explanation you already gave is that availability is hampered by network congestion or outage. That can be a local phenomenon or it can be an international issue - e.g. when the American site is not reachable from here, in Holland. In this case, perhaps (ISC)2 could consider employing something like Akamai.
I'll keep monitoring the connection, perhaps we can compare notes from time to time.