Sunday, January 10, 2016

Troubleshooting and fixing Distributed Cache service in SharePoint 2013

The Distributed Cache (DC) is a new component that is added to SharePoint 2013. Social networking tools, such as My Sites, and social content technologies, such as microblogs, activity feeds, news feeds, authentication tokens etc., are examples of social computing features. Thus, its one the most critical part for SharePoint 2013 in terms of social computing.

The Distributed Cache service uses Windows AppFabric caching technology behind the scenes.
The cache could consume a ton for memory the application and web servers. While implementing DC service, there are two modes that could be used:

1.       Collocated mode – in this mode, the Distributed Cache service runs together with other services on the application server.
2.       Dedicated mode – in this mode, all services other than the Distributed Cache service are stopped on the application server that runs the Distributed Cache service
Microsoft recommends to use dedicated mode in the SharePoint Farm.
Capacity planning is important factor which you will implement in the SharePoint farm.

These are Microsoft recommends for Distributed Cache capacity:

Deployment size
Small farm
Medium farm
Large  farm
Total number of users
< 10,000
< 100,000
                < 500,000
Recommended cache size for the Distributed Cache service
1 GB
2.5 GB
12 GB
Total memory allocation for the Distributed Cache service (double the recommended cache size above, plus reserve 2 GB for the OS)
2 GB
5 GB
34 GB
Recommended architectural configuration
Dedicated server or co-located on a front-end serve
Dedicated server
Dedicated server
Minimum cache hosts per farm

Note: The Distributed Cache service, cache size should not exceed 16 GB so Microsoft recommend that you use two servers in a large farm environment.

While implementing DC, it is better to have dedicated farm even for small farm. 
What I found in TechNet, troubleshooting for DC is not very documented. Especially when you run into issues. Fortunately, there are blogs that help in troubleshooting the DC. I have all the references in the end of this blog post. 
My SharePoint server 2013 farm is as follows:

OS: Windows Server 2012
SharePoint Version:
SharePoint Server 2013 Standard, Build number: 15.0.4420.1017 (RTM)
SQL Server:
SQL Server 2012
A) App Server, 8 GB RAM
B) Web Front End 01, 3 GB RAM
C) WEB Front End 02, 3 GB RAM

First things first. I will list down all the Pre requisites for Distributed Cache to function properly so that you do not pull your hair and become frustrated like me! :) 
  1. Warning while setting DC service.

    Do not restart
    the AppFabric Caching in the services console. Microsoft strongly recommends this and if you do this, you
    might need to rebuild your farm.
  2. Always use PowerShell the Distributed cache commandlets.
  3.  Firewall Ports
    1. Distributed Cache requires following high ports. (22233, 22234, 22235, 22236)
      Note: If firewall has been opened of above ports, use PowerShell using Distributed Cache Commandlets, the DC ports will opened automatically.   
    2. ICMPv4 and ICMPv6 have to be opened for DC to function properly.
      Besides this following ports have to be opened as well:
      8, 138, 139, 445 
      Ports required
  4. Firewalls in the organization

    If the Network topology has 2 – 3 firewalls for SharePoint farm, all Firewalls have to be opened as well.

    Search and User Profile requirements

  5. Search: Continuous crawl has to be enabled.
  6. User Profile:  The service account of the application pool of the web application for My Site should have Full Control.
  7. Use Stop-SPDistributedCacheServiceInstance –Graceful to stop any of the Distributed cache instance for any SharePoint server.
  8. Assign the Distributed Cache memory when you set up the Distributed cache instance for all SharePoint servers. DC eats memory like crazy and users will complain later on.
  9. Remote Services to be enabled. 

I will cover both collocated and dedicated modes for DC configuration.
  • In collocated configuration, each server in the farm will have DC instance with the STARTED status.
  • Whereas in the dedicated configuration, you can choose either one server to be dedicated Distributed Cache servers and other web servers MUST have STOPPED status. The Distributed Cache instance MUST be available in all SharePoint servers. 

Issue #1 Error: cacheHostInfo is null or removing existing DC instance  Remove-SPDistributedCacheServiceInstance


Forcefully delete the Distributed Cache Instance as follows:

$instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName –and ($ -eq "SP2013App"}

Issue #2
Error Starting the Distributed instance Cache 

While you provision DC instance you may receive above error.


Remove and Add the DC instance.

#Removing the service from SharePoint on local host.
Stop-SPDistributedCacheServiceInstance –Graceful Remove-SPDistributedCacheServiceInstance$instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName -and ($ -eq $env:computername}$serviceInstance.delete()

#Add DC Instance

$SPFarm = Get-SPFarm
$cacheClusterName = "SPDistributedCacheCluster_" + $SPFarm.Id.ToString()
$cacheClusterManager = [Microsoft.SharePoint.DistributedCaching.Utilities.SPDistributedCacheClusterInfoManager]::Local
$cacheClusterInfo = $cacheClusterManager.GetSPDistributedCacheClusterInfo($cacheClusterName);
$instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.Service.Tostring()) -eq $instanceName -and ($_.Server.Name) -eq $env:computername}

Issue #3 ErrorCode<ERRPS002>:SubStatus<ES0001>:Invalid provider and connection string read. Please provide the values manually.

Somehow, the connection string has been missing and we need to manually add the database entry for AppFabric as follows:

a) Run (Windows + R) and enter Regedit

c) Enter Connection String and Provider as follows:

Connection String:
Data Source=spsql;Initial Catalog=SPFarm_SharePoint_Config;Integrated Security=True;Enlist=False


Then use PowerShell to verify the Distributed Cache


Issue #4 Page load take 6 seconds.
Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage 'DistributedViewStateCache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later. (One ormore specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) ---> System.ServiceModel.ProtocolException

Page load took more than 6 seconds in Developer Dashboard as shown:

and you can see there is exactly 6 seconds in the developer dashboard. 

 In my SharePoint environment, I was getting the following errors as all in collocated mode for DC.Fix:
It took more than 4 weeks to find the actual issue for me. To troubleshoot the Distributed cache, we need to know what incorrect settings were in my environment:

As mentioned, I have 3 SharePoint Server 1 Application and 2 web front end.

a) On App Server


Only APP server status is UP.
Apps02: UP
Wfe01: Unknown
Wfe02: Unknown

And other WFE server were showing below errors: 

Error: SubStatus(ES0001): Cache host is not reachable. Error: SubStatus(ES0001): Cache host is not reachable. 
b) first Frond End Server
Apps02: Unknown
Wfe01: Down
Wfe02: Unknown

c) Second Frond End Server

Apps02: Unknown
Wfe01: Unknown
Wfe02: Down

Apps02: UP
Wfe01: Unknown
Wfe02: Unknown
Apps02: Unknown
Wfe01: Down
Wfe02: Unknown
Apps02: Unknown
Wfe01: Unknown
Wfe02: Down

Clearly, each cache host is not able to connect to each other in above errors.  So on each SharePoint server, the current server (Apps02) shows UP services status, whereas other WFEs shows UNKNOWN status. Same applies to WFE01 and WFE02. During my troubleshooting, I found if any server has UNKNOWN status, it means some configuration has be fixed. 

Collated mode

Step1: Inbound rule for Distributed Cache ports (22233 - 2223) for each server in Firewall.  

Perform this for each server.

Now, in my SharePoint farm WFE02 shows these settings

we have to open Firewall for WFE01 as well. 

Start the Remote services on each server as shown: 

Turn on Ping for all SharePoint servers.

Now, each SharePoint server has server status as UP.


App Server: 



This works perfectly in the collated mode for Distributed Cache.

Verify the page load and in my environment page load took 288.69 milliseconds with Distributed Cache started.

To simulate Dedicated Distributed Cache server, I stopped the DC instance for both the WFEs and only Application server to manage the Distributed Cache instance.




I hope this blog post help someone.
1. Plan for feeds and the Distributed Cache service in SharePoint Server 2013 

2. Manage the Distributed Cache service in SharePoint Server 2013 

cacheHostInfo is null



Social MSDN

No comments:

Telling a story with your data and Power BI

As a data analyst, all our work starts and ends with a story. Now, storytelling with data is a skill that each data professional needs to ...