Wednesday 26 October 2016

vSphere On-disk Metadata Analyzer

vSphere On-disk Metadata Analyzer (VOMA) is a utility which helps us in performing VMFS file system metadata checks

We might need to check metadata consistency of a file system when we experience problems related to storage outages or could be when we performed a disk replacement we might see errors in vmkernel.log file.

How does the error looks like well i haven't received this error in my Home Lab environment but when delivering VMware vSphere troubleshooting workshop class one particular troubleshooting topic talks about metadata consistencies and the troubleshooting steps we need to take to work with this kind of problem.

So thought of checking any related KB articles which talks about the same got one KB article 2036767 which talks about vSphere On-disk Metadata Analyzer (VOMA) to check VMFS metadata consistency.

vmkernel: 25:21:39:57.861 cpu15:1047)FS3: 130: <START termserv2-5160fe37.vswp>
vmkernel: 25:21:39:57.861 cpu15:1047)Lock [type 10c00001 offset 52076544 v 69, hb offset 4017152
vmkernel: gen 109, mode 1, owner 4a15b3a2-fd2f4020-3625-001a64353e5c mtime 3420]
vmkernel: 25:21:39:57.861 cpu15:1047)Addr <4, 1011, 10>, gen 36, links 1, type reg, flags 0x0, uid 0, gid 0, mode 600
vmkernel: 25:21:39:57.861 cpu15:1047)len 3221225472, nb 3072 tbz 0, zla 3, bs 1048576
vmkernel: 25:21:39:57.861 cpu15:1047)FS3: 132: <END termserv2-5160fe37.vswp>vmkernel: 0:00:20:51.964 cpu3:1085)WARNING: Swap: vm 1086: 2268: Failed to open swap file '/volumes/4730e995-faa64138-6e6f-001a640a8998/mule/mule-560e1410.vswp': Invalid metadata
vmkernel: 0:00:20:51.964 cpu3:1085)WARNING: Swap: vm 1086: 3586: Failed to initialize swap file '/volumes/4730e995-faa64138-6e6f-001a640a8998/mule/mule-560e1410.vswp': Invalid metadata
cpu11:268057)WARNING: HBX: 599: Volume 50fd60a3-3aae1ae2-3347-0017a4770402 ("<Datastore_name>") may be damaged on disk. Corrupt heartbeat detected at offset 3305472: [HB state 0 offset 6052837899185946624 gen 15439450 stampUS 5 $

Before running VOMA ensure all virtual machines on the affected datastore are powered off or migrated to another datastore.

It's time to fetch out the name and partition number of the device which backs the VMFS datastore that we are planning to check.

It can be done using esxcli storage vmfs extent list command after connecting to your ESXi host through putty.

Now that we have got the details we require to run vSphere On-disk Metadata Analyzer it's time to run the same.

*Note: Installation/Configurations/Specifications methods used here has been Tested in My Home Lab Nested Environment.

 voma -m vmfs -f check -d /vmfs/devices/disks/eui.5adcee56739fb3ea:1

Where eui.5adcee56739fb3ea:1 refers to device name and the partition.

***** We are able to run the VOMA successfully as there is no problems with data consistency but if incase you may have metadata consistency issues in your environment you may find errors  (Error: Missing LVM Magic. Disk doesn’t have a valid LVM Device Error: Failed to Initialize LVM Metadata)

We can also run the above command by specifying the log file where we want to store the output and send it to VMware support team.
*****When the corruption is irreversible, VMware recommends us to restore the datastore files from a backup.

In my environment i stored the output in output.txt file created in temp directory by using the same command as mentioned above with the name and location of the file.

 voma -m vmfs -f check -d /vmfs/devices/disks/eui.5adcee56739fb3ea:1 -s /tmp/output.txt

Tuesday 25 October 2016

ESXi Frequently Disconnects from vCenter Server

Recently in my Home Lab Environment i noticed one of my ESXi hosts frequently getting disconnected from vCenter Server. 

Before we proceed further with troubleshooting we need to understand that ESXi host share the heartbeats (UDP 902) with vCenter Server to inform the vCenter Server that it is accessible over management network so it seems to be the problem with the heartbeat packets which are getting dropped,blocked or lost between the vCenter Server and the ESXi host.

Now as part of troubleshooting process we may come up with many possible cause for this problem as we have three different layers involved i.e (vCenter Server, ESXi and hardware)

First layer first vCenter Server one possible cause could be windows firewall rules which was enabled on vCenter Server System and was blocking UDP Port 902.

At ESXi layer we may suspect that ESXi host is not using port 902 for receiving the heartbeats rather a different port is configured or it could be the Internal ESXi firewall which is blocking that port.

At the bottom layer Hardware Layer we are suspecting that the network between my ESXi hosts and vCenter Server is congested.

In my Home Lab i am using windows based vCenter Server so checked firewall settings to be on safer side i disabled Windows Firewall because there were no ports configured.

Now time to check ESXi host if it was using the default port 902 or any other port not sure if i changed as part of some testing.

*Note: Installation/Configurations/Specifications methods used here has been Tested in My Home Lab Nested Environment.

As suspected it wasn't using the default port 902 so now i got two ways to fix this problem either add a firewall rule to ESXi firewall to allow the port being used or change the port back to 902.

Changing the port back to default 902 can be done by editing the vpxa.cfg file residing at /etc/vmware/vpxa/.

For creating a firewall rule for ESXi host to allow other ports i referred KB Article 2020100 which talks about the details we need to enter when creating Heartbeat.xml file and adding the rules into the file.

We can make use of command vi /etc/vmware/firewall/hearbeat.xml which will create heartbeat.xml file if not already created and will also edit the file for you.

Now that we have find the problem and we fixed it as it was the issue with one of my ESXi what if it was't the problem than the last resort of troubleshooting was to check the hardware i.e whether if network congestion exist between my vCenter Server and ESXi host.

Either we can make use of some third party tools like Wireshark to Analyze the live traffic  or we can also make use of ESXTOP command line utility to analyze traffic which can give us an insight about the congestion if any in our network.

Thursday 20 October 2016

Roll Back to Older ESXi Version Home Lab Results

While delivering another VMware vSphere class this week got another question where in they were looking for downgrading the version of the ESXi host.

In today's date when we are talking about upgrading ESXi from 6 to 6.5, Did i just mentioned 6.5 yes those of you who are not aware about vSphere 6.5 here is the link for your reference Introducing VMware vSphere 6.5 announce yesterday at VMworld Barcelona why would someone like to degrade the version of ESXi. 

Well the answer which I got was pretty interesting which made me test it in My Home Lab environment. The Answer "We accidentally applied one of the Patch in our Environment post which ESXi got disconnected from vCenter Server and we are not able to add it back again" So we want to go back to the last good known configuration of our ESXi server.

It was time to install VMware vSphere Update manager in my Home Lab and then test it by adding a patch to one of my ESXi servers and find out the possible way to roll it back to an old one.

Installed update Manager using SQL Server 2012 express embedded database.

And also downloaded the available plugin for vSphere Update Manager client using VMware vSphere client.

After the successful installation of Update Manager created a host upgrade baseline so as i can upgrade the ESXi 6.0 to ESXi 6.0 Update 1 and roll it back later to same version.

Before we begin with the remediation of our ESXi we also need to attach the recently created baseline to the same ESXi host, Stage it and then remediate it.

Post successfull remediation connected to my ESXi host and check the old build and the recent update 1 whether it was installed successfully or not.

 Time to roll back our ESXi to previous version at the time of rebooting press Shift+R (KB Article1033604)which will provide us the details of the recently installed update and we can roll it back to an older version.

*Note: Installation/Configurations/Specifications methods used here has been Tested in My Home Lab Nested Environment.

Let's connect back to ESXi using SSH and have a look again on the version of ESXi we are running with and we can see the recent update 1 build 3029758 we installed has been rolled back and my ESXi host is again running on the old version i.e ESXi 6.0 build 2494585.

Friday 14 October 2016

VCAP6-DCV Design Objective 2.3

We already had a detailed discussion about VCAP6 - DCV Design exam, where we discussed about the Path we need to follow and what are the objectives that we need to take care, If in case you missed it here is the link for your reference Kick Start Your Journey Towards VCAP6-DCV Design.

We have also seen VCAP6-DCV Design Objective 1.1, VCAP6-DCV Design Objective 1.2 and VCAP6-DCV Design Objective 1.3 in which we discussed about Business and Application Requirements,Risks,Constraints, Assumptions and also discussed how to map the business requirements into VMware vSphere Logical Design and also talked about mapping service dependinces here are the links for your reference in case you missed it VCAP6-DCV Design Objective 1.1VCAP6-DCV Design Objective 1.2 , VCAP6-DCV Design Objective 1.3 , VCAP6-DCV Design Objective 2.1VCAP6-DCV Design Objective 2.2

Objective 2.3 – Build Availability Requirements into a vSphere 6 Logical Design

Skills and Abilities
  • Evaluate which logical availability services can be used with a given vSphere solution.
  • Differentiate infrastructure qualities related to availability.
  • Describe the concept of redundancy and the risks associated with single points of failure
  • Explain class of nines methodology
  • Determine availability component of service level agreements (SLAs) and service level management processes
  • Determine potential availability solutions for a logical design based on customer requirements.
  • Create an availability plan, including maintenance processes.
  • Balance availability requirements with other infrastructure qualities.
  • Analyze a vSphere design and determine possible single points of failure.
How to Prepare
Now that we have seen what are the skill's and abilities required towards the preparation of VCAP6-DCV Design Objective 2.3 it's time to proceed further and work towards the content available in the above mentioned Links and understand how they help us in building availability requirements into vSphere 6 logical Design.

I Still remember one of the class I delivered for VMware vSphere Design & Deploy 6 fast track course (which helps audience prepare for VCAP6-DCV Design and VCAP6-DCV Deploy exam).

During the class we had an awesome discussion about gathering the availability requirements and how these requirements plays a Vital role in your VMware vSphere Design.

Let's start with High Availability which is one of the key feature in VMware vSphere and helps provide us protection against various failures including (Application Failure, Guest OS Failure, Network Isolation, Host Failure and Datastore related issues).

Image Source VMware
Before we talk about how VMware vSphere High Availability helps us in providing the protection against various failure let's spend some time in understanding planned and unplanned downtime though both are downtimes and will hamper the production machines any how but understanding them better can help us in reducing them and improving the service level agreements.

Planned Downtime - Whether it is hardware maintenance, server migration, and firmware updates all require downtime for physical servers

With VMware vSphere we can reduce the planned downtime by migrating our workloads to different ESXi host without downtime or service disruption.

Unplanned Downtime - Various VMware vSphere capabilities including shared storage.
(we can eliminate single points of failure by storing virtual machine files on shared storage.

NIC Teaming which provide tolerance of individual network card failures.

Coming back to VMware vSphere High Availability which provides us the protection against various failures as mentioned above but whether we need to use these features or not again is an important question that need to be addressed?

My design requirements includes the application requirements will help me answer this question maybe the client for whom I am preparing this design doesn't need the VM and Application level monitoring as they are using third party tool for the same.

But they are looking for protection against host level failures so as the Virtual Machines can be restarted on a different ESXi host with minimal amount of downtime, yes downtime is still there as the affected Virtual Machines would be restarted on different ESXi hosts.

Another requirement may fall in the Design where in they have some mission critical virtual machines on which we can't afford any downtime being a solution architect we can propose them another important feature available in VMware vSphere i.e Fault Tolerance and how FT provides us protection without dataloss and TCP/IP loss. If in case you are new with how FT works here is a quick link for your reference Back to Basics - Part 10 Fault Tolerance
Another beautiful white paper which talks about how we can Protect our Business with Automated Business Continuity Solutions by overcoming various cost and complexity challenges and how we can Extend our Virtualization Investment to Achieve High Availability directly contribute in achieving our objective 2.3.

Key Note- It's always considered worth when working with the Design to consider VMware Design Best Practices which talks about Host Placement, Auto Deploy,VMware vCenter Server Availability considerations, Networking Design Considerations (Redundancy, NIC Teaming, Load Balancing Policies to be used), Storage Considerations.

Wednesday 12 October 2016

Replicate VM's with Nakivo Backup & Replication

We have already dedicated couple of articles related to Nakivo Backup and Replication v6.1 wherein we have seen the architectural components and also talked about new features available in Nakivo Backup and Replication v6.1 here is a link for your quick reference Demystifying Nakivo Backup and Replication v6.1

In our last post related to Nakivo Backup and Replication Series we also discussed about 

Backup/Recover Active Directory Objects with Nakivo Backup and Replication v6.1 in case you missed it here is the link for your quick reference 
Backup/Recover Active Directory Objects with Nakivo

Time to dig deeper and test some more Replication related 
functionalities of Nakivo Backup and Replication v6.1 those of you who are not aware about how this is working in my environment.

Well I downloaded the Windows based installer for Nakivo Backup and Replication v6.1 and integrated the same with my Home Lab environment here is the quick link for you to download the same Nakivo Backup and Replication v6.1.

Replicating VM's with Nakivo Back up and Replication

NAKIVO Backup & Replication provides us the ability to replicate VMware Virtual Machines by creating and maintaining an identical copy of the source VM/Instance at the target location

Using Nakivo Backup & Replication we can replicate our Virtual Machines by creating a job that specifies the VM's we are planning to replicate, and also talks about the location where the replicas should be located.

In my Home Lab environment I am planning to replicate one of my Virtual Machine running in my USGC PROD Datacenter. 

Once we have selected the Virtual Machine to Replicate it's time to choose the Destination Host (ESXi02), Destination Datastore (Shared) and the Destination Network (VM Network).

*Note: Installation/Configurations/Specifications methods used here has been Tested in My Home Lab Nested Environment

It's time specify the Job Schedule when are we planning to run the Replication Job we have created for our Virtual Machines. In order to proceed further with the testing i did not scheduled it rather made it run immediately.

Final Step where we can specify parameters like (Job Name,  App Aware Mode which is enabled by default to provide application and database consistency by relying on VMware quiescing technology and can also select recovery point retention).

Some advance parameter like whether to append replica names or leave it as it and what are pre and post action (including execution of scripts).

It's always good to see the Job running successfully at least for a person like me who has little exposure on Backup and Replication Products It's the user friendly GUI Experience that helps me feel more comfortable.

Will be writing another article to explain each of these above mentioned options and will also have a look at advance parameters in our next post related to Nakivo Backup and Replication Series.