Bug in 2012/2012R2 DFS client kills Lync front-end servers

We have a reasonably large sized Lync Server 2013 deployment running on Windows Server 2012, and we use a DFS share for the file share.  After migrating user load from 2010 to the new 2013 pool, we started encountering periodic failures.

First, we would see oddness in some of the front-end services.  It seemed to happen most often on the active machine for pool backup, but not always.  We were getting increasing problems with calls being completed, particularly to the response group service.  The most disturbing part was that we would remote desktop to the front-end server which was having problems, only to find that logging into it would hang on processing group policy.  If you waited long enough, or were fortunate enough to have already had a session, you could get in.

Once logged in, it was clear that the OS was malfunctioning, not just Lync.  If you tried to access any file shares at the command prompt or powershell, you would end up with a hung window.  Not only hung, but unable to be closed, even with task manager!  In fact, it seemed like all the processes on the server were falling victim to this.  W3WP worker processes were building up and not exiting.  Lync services couldn’t be stopped and restarted.  Even the OS would not reboot without forcing it to crash, or powering the server off remotely.

Looking at handles and threads, the affected front-end would have many times more than the working front-ends: Our ‘normal’ front ends typically have 5,000 or less threads and about 90,000 handles in use.  On the affected front-end, we would see 10-15,000 threads or more and sometimes 1,000,000 handles in use.  This was definitely looking like an operating system bug.

Working with Microsoft Premier support, we configured the servers for a complete memory dump and after some time, we managed to capture one from a server having the problem.  Support analyzed the dump and found that it was caused by the DFS client service. The bug was matched up to a hotfix which was under development. We were able to obtain an early version of the hotfix, but my understanding now is that, from support, “This hotfix was made public early this week due to the high number of cases that were being reported.”

From the frequency we experienced the problem (every few days), I’d think practically any reasonably active Lync Server 2013 deployment running on 2012/2012R2 using DFS for the file share would run into this issue, and it will pretty much kill one front end at a time, slowly, without it.  Now that the hotfix is public, I wanted to not only share the above experience but also the article number: KB2925981.  It looks like the actual knowledge base article is not public yet, but you should be able to obtain that hotfix from product support.