Question : Vmware cluster issues

Dear fellow admins.

I have issues with my vsphere cluster.

First a breif explanation of my setup.

4 Fujitsu Siemens RX servers - each with 3 nics.
Ip net for ESX servers, and virtual servers: 192.168.10.x
Ip Net for iSCSI 192.168.11.x and 192.168.12.x

Each server have:
2 Nic for iSCSI - ip net 11.x and 12.x
1 nic for Everything else. 10.x net

1 48 POrt 1000 mbit switch (zyxel level 3 switch) I am not using vlan at all. All traffic goes to this switch.

1 IBM DS3300 iSCSI dual controller.
Controller setup:
COntroller A:
POrt1: 192.168.11.6
Port 2 192.168.12.6

Controller B
Port1 192.168.11.7
Port2 192.168.12.7

ISCSI is configured with 2 paths to each lun, active / active (io) and 2 inactive paths.

Got no internal DNS
Got 1 vmware infrastructure server witch controls the cluster. ip: 192.168.10.37
Snippet from host file from ESX servers:

192.168.10.20 ESX1.Hosting ESX1
192.168.10.21 ESX2.Hosting ESX2
192.168.10.22 ESX3.Hosting ESX3
192.168.10.23 ESX4.Hosting ESX4
192.168.10.37 VSS.Hosting VSS

Allright, i have attached a picture of my network setup on one server (the same on all)

Problem 1:
Now my problems:

Some of the ESX hosts sometimes gets disconnected from the cluster, and then they reconnect afterwards.
from event log:
*Host 192.168.10.21 in datacenter Datacenter is not responding.*Then the servs gets disconnected, and then:

Alarm 'Virtual machine cpu usage' on SERVER changed
from Green to Gray
info
12-05-2010 10:19:44

Alarm 'Host connection failure' on 192.168.10.21 triggered
an action
info
12-05-2010 10:19:44
Alarm 'Host connection failure' on entity 192.168.10.21
send SNMP trap
info
12-05-2010 10:19:44

Then normally it reconnects itself again.


Problem 2:
SOmetimes the virtual serves looses connection with the vmware network.
Today one virtual server even got shut down, and and startedd again. But before that happended, i noticed this from looking at "backup server" Tasks and events:

192.168.10.21 is disconnected (.21 is a ESX server where Backupserver is located)
And then:
Host is connected
info
12-05-2010 06:40:27

info
12-05-2010 05:58:19
This occoured 8 times within 1½ hours. and then the backupserver shut down.

Problem 3:
one of the esx servers shows this in the event log:

Alarm 'Cannot connect to storage' on entity 192.168.10.20
send SNMP trap
info
12-05-2010 01:43:13
Alarm 'Cannot connect to storage' on 192.168.10.20
changed from Gray to Gray
info
12-05-2010 01:43:13
Alarm 'Cannot connect to storage' on 192.168.10.20
changed from Gray to Gray
info
12-05-2010 01:43:13

Lost connectivity to storage device
naa.600a0b80005aedbb00000ac54b17dfff. Path
vmhba33:C7:T0:L3 is down. Affected datastores: "IBM
LUN3".
error
12-05-2010 01:41:03

Lost connectivity to storage device
naa.600a0b80005aedbb00000f7e4b656403. Path vmhba33:
C7:T0:L5 is down. Affected datastores: "IBM LUN5".
error
12-05-2010 01:41:03

Lost access to volume
4b2a36fb-986d63ce-b6ea-000ae48a8ba7 (IBM LUN2) due
to connectivity issues. Recovery attempt is in progress and
outcome will be reported shortly.
info
12-05-2010 01:41:04
Successfully restored access to volume 4b2a36fb-986d63ce
-b6ea-000ae48a8ba7 (IBM LUN2) following connectivity
issues.
info
12-05-2010 01:42:03


and so on


Please help me guys, i've read a lot of docu on this before posting here, but frankly, i no longer know what to do. This is serius problems.

Maybe installing a second nic in each server, and dedicating that to vm traffic, so i have service console traffic seperated would be a good idea?

Answer : Vmware cluster issues

I suggest you start checking the Network part (Cables, Switches,etc) since there seems to be intermittent disconnections. Check the speed and duplex part since I had a lot of issues with Cisco Switches and Dell Servers when we set it at Auto Config. Manually setting it to 1000Mbps and Full Duplex resolved some of the issues.

What RAID level are you using at the SAN ?

Random Solutions  
 
programming4us programming4us