Troubleshoot Storage Spaces Direct

Use the information in this article to troubleshoot your Storage Spaces Direct deployment.

In general, start with these steps:

  1. Confirm the make and model of SSD is certified for Windows Server 2016 and Windows Server 2019 by using the Windows Server Catalog. Confirm with the vendor that the drives are supported for Storage Spaces Direct.
  2. Inspect the storage for any faulty drives. Use storage management software to check the status of the drives. If any of the drives are faulty, work with your vendor.
  3. Update the storage and drive firmware if necessary. Ensure that the latest Windows Updates are installed on all nodes. You can get the latest updates for Windows Server 2016 from Windows 10 and Windows Server 2016 update history. Get the latest updates for Windows Server 2019 from Windows 10 and Windows Server 2019 update history.
  4. Update the network adapter drivers and firmware.
  5. Run cluster validation and review the Storage Space Direct section. Ensure that the drives you use for the cache are reported correctly and have no errors.

If you're still having problems, review the troubleshooting information for each of the specific issues in this article.

Virtual disk resources are in No Redundancy state

The nodes of a Storage Spaces Direct system restart unexpectedly because of a crash or power failure. Then, one or more of the virtual disks might not come online, and you see the description Not enough redundancy information.

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size PSComputerName
Disk4 Mirror OK Healthy True 10 TB Node-01.conto.
Disk3 Mirror OK Healthy True 10 TB Node-01.contoso.
Disk2 Mirror No Redundancy Unhealthy True 10 TB Node-01.contoso.
Disk1 Mirror Unhealthy True 10 TB Node-01.contoso.

Also, after an attempt to bring the virtual disk online, the following information is logged in the Cluster log, the DiskRecoveryAction .

[Verbose] 00002904.00001040::YYYY/MM/DD-12:03:44.891 INFO [RES] Physical Disk : OnlineThread: SuGetSpace returned 0. [Verbose] 00002904.00001040:: YYYY/MM/DD -12:03:44.891 WARN [RES] Physical Disk < DiskName>: Underlying virtual disk is in 'no redundancy' state; its volume(s) may fail to mount. [Verbose] 00002904.00001040:: YYYY/MM/DD -12:03:44.891 ERR [RES] Physical Disk : Failing online due to virtual disk in 'no redundancy' state. If you would like to attempt to online the disk anyway, first set this resource's private property 'DiskRecoveryAction' to 1. We will try to bring the disk online for recovery, but even if successful, its volume(s) or CSV may be unavailable. 

The No Redundancy Operational Status occurs if a disk failed or if the system is unable to access data on the virtual disk. This issue can happen if a reboot occurs on a node during maintenance on the nodes.

To fix this issue, follow these steps:

    Remove the affected virtual disks from CSV. Doing so puts them in the available storage group in the cluster and starts showing as a ResourceType of Physical Disk .

Remove-ClusterSharedVolume -Name "CSV Name" 
Get-ClusterGroup 
Get-ClusterResource "Physical Disk Resource Name" | Set-ClusterParameter -Name DiskRecoveryAction -Value 1 Start-ClusterResource -Name "Physical Disk Resource Name" 
 Get-ClusterResource "Physical Disk Resource Name" | Set-ClusterParameter -Name DiskRecoveryAction -Value 0 
Stop-ClusterResource "Physical Disk Resource Name" Start-ClusterResource "Physical Disk Resource Name" 
Add-ClusterSharedVolume -Name "Physical Disk Resource Name" 

DiskRecoveryAction is an override switch that lets you attach the Space volume in read-write mode without any checks. The property lets you diagnose why a volume isn't coming online. It's similar to maintenance mode but you can invoke it on a resource in a failed state. It also lets you access the data so you can copy it. This access is helpful in no-redundancy situations. The DiskRecoveryAction property was added in the February 22, 2018 in update KB 4077525.

Detached status in a cluster

When you run the Get-VirtualDisk cmdlet, the OperationalStatus for one or more Storage Spaces Direct virtual disks is Detached. However, the HealthStatus reported by the Get-PhysicalDisk cmdlet indicates that all the physical disks are in a Healthy state.

This example shows the output from the Get-VirtualDisk cmdlet.

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size PSComputerName
Disk4 Mirror OK Healthy True 10 TB Node-01.contoso.
Disk3 Mirror OK Healthy True 10 TB Node-01.contoso.
Disk2 Mirror Detached Unknown True 10 TB Node-01.contoso.
Disk1 Mirror Detached Unknown True 10 TB Node-01.contoso.

Also, the following events might be logged on the nodes:

Log Name: Microsoft-Windows-StorageSpaces-Driver/Operational Source: Microsoft-Windows-StorageSpaces-Driver Event ID: 311 Level: Error User: SYSTEM Computer: Node#.contoso.local Description: Virtual disk requires a data integrity scan. Data on the disk is out-of-sync and a data integrity scan is required. To start the scan, run this command: Get-ScheduledTask -TaskName "Data Integrity Scan for Crash Recovery" | Start-ScheduledTask Once you have resolved that condition, you can online the disk by using these commands in PowerShell: Get-VirtualDisk | ?< $_.ObjectId -Match "" > | Get-Disk | Set-Disk -IsReadOnly $false Get-VirtualDisk | ?< $_.ObjectId -Match "" > | Get-Disk | Set-Disk -IsOffline $false ------------------------------------------------------------ Log Name: System Source: Microsoft-Windows-ReFS Event ID: 134 Level: Error User: SYSTEM Computer: Node#.contoso.local Description: The file system was unable to write metadata to the media backing volume . A write failed with status "A device which does not exist was specified." ReFS will take the volume offline. It might be mounted again automatically. ------------------------------------------------------------ Log Name: Microsoft-Windows-ReFS/Operational Source: Microsoft-Windows-ReFS Event ID: 5 Level: Error User: SYSTEM Computer: Node#.contoso.local Description: ReFS failed to mount the volume. Context: 0xffffbb89f53f4180 Error: A device which does not exist was specified. Volume GUID: DeviceName: Volume Name: 

The Detached Operational Status occurs if the dirty region tracking (DRT) log is full. Storage Spaces uses dirty region tracking (DRT) for mirrored spaces to ensure that when a power failure occurs, any in-flight updates to metadata are logged. Logged updates ensure that the storage space can redo or undo operations. They return the storage space to a flexible and consistent state after power restores and the system comes back up. If the DRT log is full, the virtual disk can't be brought online until the DRT metadata is synchronized and flushed. This process requires running a full scan, which can take several hours to finish.

To fix this issue, follow these steps:

    Remove the affected virtual disks from CSV.

Remove-ClusterSharedVolume -Name "CSV Name" 
Get-ClusterResource -Name "Physical Disk Resource Name" | Set-ClusterParameter DiskRunChkDsk 7 Start-ClusterResource -Name "Physical Disk Resource Name" 
Get-ScheduledTask -TaskName "Data Integrity Scan for Crash Recovery" | Start-ScheduledTask