While testing the VNXe 3100 (OE 188.8.131.5208) I found a problem when changing the MTU settings for link aggregate. With specific combination of configurations, changing the MTU causes the ESXi (4.1.0, 502767) to loose all iSCSI datastores and even changing the settings back the datastores are still not visible on ESXi. VNXe also can’t provision new datastores to ESXi while this problem is occurring. There are a couple of workarounds for this but no official fix is available to avoid this kind of a situation.
How did I find it?
After the initial configuration I created link aggregate from two ports, set the MTU to 9000 and also created one iSCSI server on SP A with two IP addresses. I then configured ESXi also to use MTU 9000. Datastore creation on VNXe side went through successfully but on the ESXi side I could see an error that the VMFS volume couldn’t be created.
I could see the LUN under iSCSI adapter but manually creating a VMFS datastore also failed. I then realized that I hadn’t configured jumbo frames on the switch and decided to change the ESXi and VNXe MTUs back to 1500. After I changed the VNXe MTU the LUN disappeared from ESXi. Manually changing the datastore access settings from VNXe didn’t help either. I just couldn’t get the ESXi see the LUN anymore. I then tried to provision a new datastore to ESX but got this error:
Ok, so I deleted the datastore and the iSCSI server and then recreated the iSCSI server and provisioned a new datastore for the ESXi without any problems. I had a suspicion that the MTU change caused the problem and tried it again. I changed the link aggregation on VNXe from 1500 to 9000 and after that was done the datastore disappeared from ESXi. Changing MTU back to 1500 didn’t help, the datastore and LUN were not visible on ESX. Also creating a new datastore gave the same error as before. Datastore was created on VNXe but was not accessible from ESXi. Deleting and recreating datastores and iSCSI servers resolved the issue again.
What is the cause of this problem?
So it seemed that the MTU change was causing the problem. I started testing with different scenarios and found out that the problem was the combination of the MTU change and also the iSCSI server having two IP addresses. Here are some scenarios that I tested (sorry about the rough grammar, tried to keep the descriptions short):
Link aggregation MTU 1500 and iSCSI server with two IP addresses. Provisioned storage works on ESXi. Changing VNXe link aggregation MTU to 9000 and ESXi lose connection to datastore. Change VNXe MTU back to 1500 and ESXi still can’t see the datastore. Trying to provision new datastore to ESXi results an error. Removing the other IP address doesn’t resolve the problem.
Ling aggregation MTU 1500 and iSCSI server with two IP addresses. Provisioned storage works on ESXi. Removing the other IP from iSCSI server and changing MTU to 9000. Datastore is still visible and accessible from ESXi side. Changing MTU back to 1500 and datastore is still visible and accessible from ESXi. Datastore provisioning to ESXi is successful. After adding another IP address to iSCSI server ESX loses the connection to datastore. Provisioning new datastore to ESXi results an error. Removing the other IP address also doesn’t resolve the problem.
Ling aggregation MTU 1500 and iSCSI server with one IP address. Provisioned storage works on ESX. Change MTU to 9000. Datastore is still visible and accessible from ESXi side. Changing MTU back to 1500 and datastore is still visible and accessible from ESXi. Datastore provisioning to ESXi is successful. After adding another IP address to iSCSI server ESX loses the connection to datastore. Provisioning new datastore to ESXi results an error. Removing the other IP doesn’t resolve the problem.
Link aggregation MTU 1500 and two iSCSI servers on one SP both configured with one IP. One datastore on both iSCSI servers (there is also an issue getting the datastore on the other iSCSI server provisioned, see my previous post). Adding a second IP for the first iSCSI server and both datastores are still accessible from ESXi. When changing MTU to 9000 ESX loses connection to both datastores. Changing MTU back to 1500 and both datastores are still not visible on ESXi. Also getting the same error as previously when trying to provision new storage.
I also tested different combinations with iSCSI servers on different SPs and if SPA iSCSI server has two IP addresses and SPB iSCSI server has only one IP and the MTU is changed then the datastores on SPB iSCSI server are not affected.
How to fix this?
Currently there is no official fix for this. I have reported the problem to EMC support and demonstrated the issue to EMC support technician and uploaded all the logs, so they are working on trying to find the root cause of this.
Apparently when an iSCSI server has two IP addresses and the MTU is changed the iSCSI server goes to some kind of “lockdown” mode and doesn’t allow any connections to be initiated. Like I already described the VNXe can be returned to operational state by removing all datastores and iSCSI servers and recreating them. Of course this is not an option when there is production data on the datastores.
EMC support technician showed me a quicker and a less radical workaround to get the array back to operational state: Restarting the iSCSI service on the VNXe. CAUTION: Restarting iSCSI service will disconnect all provisioned datastores from hosts. Connection to datastores will be established after the iSCSI service is restarted. But this will cause all running VMs to crash.
The easiest way to restart iSCSI service is enabling the iSNS server from iSCSI server settings, giving it an IP address and applying changes. After the changes are applied iSNS server can be disabled. This will trigger the iSCSI service to restart and all datastores that were disconnected are again visible and usable on ESXi.
After this finding I would suggest not to configure iSCSI serves with two IP addresses. If MTU change can do this much damage what about other changes?
If you have two iSCSI servers with two IP addresses I would advise not to change MTU even if it would be done during a planned service break. If for some reason it is mandatory to do the change, contact EMC support before doing it. If you have arrays affected by this issue I would encourage to contact EMC support before trying to restart the iSCSI service.
Once again I have to give credit to EMC support. They have some great people working there.