Lync Server 2013 CU5 (August 2014) Unexpectedly Starts Front-End Services

So far, we’ve just patched our backup pool… but we normally (even in the backup pool) drain all the Lync services on a front end (Stop-CSWindowsService -graceful -nowait) before applying a CU.  This also helps prevent the need for a reboot of the server.  We do the -nowait because we still (at least as of CU4) have issues with the response group call performance counter going haywire (like say, showing 1 billion active calls — I think it is actually due to a subtraction from zero).  This inaccurate performance counter prevents the RTCRGS service from ever draining, so we usually give it a few minutes in this scenario and then stop it without -graceful.

Anyway, we noticed when running the updater for CU5, it applies an update to the Windows Fabric at the end.  This update appears to unexpectedly start up all of the Lync server services on the front end, even if they were previously all stopped, and it seems to wait for them to be running before finishing the fabric patch.  This restarting of services doesn’t allow us to plan for the timing of the restart or work in a desired reboot as in previous CUs.  Also, we encountered one of the 4 FEs in this pool which actually required a reboot on the fabric update step, I believe because one of the services either started too early or didn’t start quick enough.  In this case, I had to say “no” to the reboot, then drain the services again, so that then I could actually reboot without forcing the Lync services to shut down with the OS.

After doing all the FE’s I applied the database update.  It appears that on a quick glance that rtcab and mgc are the only databases to change version numbers.  I’ll be checking tomorrow to see if rtcab’s update reversed our ABSCONFIG custom settings, which has happened in past CUs.

 

Lync PIN authentication permanently locked out – event ID 47066

Thanks to SCOM and the Lync 2013 management pack, I came across the following error happening on Lync 2013 front-ends

Log Name: Lync Server
Source: LS UserPin Service
Date: 3/26/2014 3:51:14 PM
Event ID: 47066
Task Category: (1044)
Level: Error
Keywords: Classic
User: N/A
Computer: FE3.ad.domain.com
Description:
Found users who are already or about to be permanently locked
The following users are either permanently locked out, or about to be permanently locked out:-
1. User: c92e5697-0ceb-4e90-83f9-34f1f642de5e@domain.com, FailedLogonAttempts: 5000, MaximumFailedAttemptsAllowed: 5000
2. User: fabd07f9-4375-4812-a825-5a1da1c589ae@domain.com, FailedLogonAttempts: 5000, MaximumFailedAttemptsAllowed: 5000
Cause: The affected users might be using very old pin, or they are under denial of service and spoofing attacks.
Resolution:
Please get in touch with the affected users and ask them to change their pins. Examine server logs to verify that this was not an intentional attack.

This appears to be an undocumented error message, and an undocumented lockout feature.  (This is of course no surprise administering Lync — don’t you recall a time in the past when Microsoft products at least had all the error messages it could produce documented on TechNet?).  The only hits on web searches are the SCOM MP dumps on Viacode.

So, here is some documentation.  Basically, in our scenario, these ended up being common area phones that had their pin changed or updated at some point, and that never got updated on the Lync Phone Edition end.  These phones are out there still plugged into the network and still chugging away at attempting to register repeatedly.  Looking at logs, they are even still hitting our old Lync 2010 pool first, because apparently they are not using DHCP options to contact the new pool, they are relying on their old cache.  They also continue to grab LPE firmware updates, which actually work even though their authentication is bad, since internally no successful authentication is required to download an update.

I also think this is an interesting find because I can’t see anywhere (at least readily) where there is documentation that there is a 5000 attempt lockout on PIN authentication.  The lockout is a good thing… so I guess it is documented here now!

 

Bug in 2012/2012R2 DFS client kills Lync front-end servers

We have a reasonably large sized Lync Server 2013 deployment running on Windows Server 2012, and we use a DFS share for the file share.  After migrating user load from 2010 to the new 2013 pool, we started encountering periodic failures.

First, we would see oddness in some of the front-end services.  It seemed to happen most often on the active machine for pool backup, but not always.  We were getting increasing problems with calls being completed, particularly to the response group service.  The most disturbing part was that we would remote desktop to the front-end server which was having problems, only to find that logging into it would hang on processing group policy.  If you waited long enough, or were fortunate enough to have already had a session, you could get in.

Once logged in, it was clear that the OS was malfunctioning, not just Lync.  If you tried to access any file shares at the command prompt or powershell, you would end up with a hung window.  Not only hung, but unable to be closed, even with task manager!  In fact, it seemed like all the processes on the server were falling victim to this.  W3WP worker processes were building up and not exiting.  Lync services couldn’t be stopped and restarted.  Even the OS would not reboot without forcing it to crash, or powering the server off remotely.

Looking at handles and threads, the affected front-end would have many times more than the working front-ends: Our ‘normal’ front ends typically have 5,000 or less threads and about 90,000 handles in use.  On the affected front-end, we would see 10-15,000 threads or more and sometimes 1,000,000 handles in use.  This was definitely looking like an operating system bug.

Working with Microsoft Premier support, we configured the servers for a complete memory dump and after some time, we managed to capture one from a server having the problem.  Support analyzed the dump and found that it was caused by the DFS client service. The bug was matched up to a hotfix which was under development. We were able to obtain an early version of the hotfix, but my understanding now is that, from support, “This hotfix was made public early this week due to the high number of cases that were being reported.”

From the frequency we experienced the problem (every few days), I’d think practically any reasonably active Lync Server 2013 deployment running on 2012/2012R2 using DFS for the file share would run into this issue, and it will pretty much kill one front end at a time, slowly, without it.  Now that the hotfix is public, I wanted to not only share the above experience but also the article number: KB2925981.  It looks like the actual knowledge base article is not public yet, but you should be able to obtain that hotfix from product support.