11-12 April 2015, I totally bungled what should have been a simple move of a virtual from one physical DOM0 to another. This document describes the mistakes I made, and make recommendations for not doing them in the future.
The virtual was temporarily housed on one of our DOM0's (Xen Server) while we took down the clients DOM0 for maintenance. In addition to hardware maintenace, we reinstalled the entire operating system from scratch as it had been running through several major upgrades over a 4-5 year period. When the work was completed, we returned the client machine to the NOC and prepared to copy the virtual from our machine back to its normal home. My preparations did not take several things into account
The virtual consisted of two LVM LV's, an 5G root volume and a 150G /var volume. Optimum to transfer this over the gigabit switch that connected the two was 20 minutes, but I told the client to expet a 2 hour down time. Because of some significant failures in planning, the operation took over 16 hours.
The primary failure was not taking into account the network configuration. Both DOM0's in question has its primary interface set to xxx.xxx.xxx.xxx/32, meaning it can not talk to anyone, for security purposes. They then have their secondary interfaces set to private IP ranges, and in both cases, the private IP range has internet access via a virtual router. So, while both machines were connected to the same gigabit switch, the speed was governed by the settings in the two routers. See simplified diagram at left.
On the client side, no major tuning had been done to the virtual router (bottom of image), so the impact was minimal; simply the time nessessary to process the traffic through an IPFire firewall/router, which is very efficient. However, on the source router (top of image), we had turned on Quality of Service (QOS) to limit the bandwidth so we do not easily go over our contracted 20Mb/s bandwidth usage.
The end result was that the transfer took place at 20Mb/s, or about 2% of the available bandwidth. This was solely because of the router based QOS.
Notice that we had an alternate path available, via the black lines. This would have required minor reconfiguration of the source and target, but would have allowed us to bypass the routers completely and operate at the limits imposed only by the gigabit switch.
When the first transfer was begun, I noticed that the transfer was going horribly slower than expected. Because of this, I stopped the process about 5 minutes into the job, and added compression into the mix.
dd if=/dev/vg0/image bs=4M | pv -petrs 150G | bzip2 -c | ssh target "bunzip2 -c | dd of=/dev/vg0/image bs=4M"
When I noticed that it was not helping much, I continued, hoping that when it hit some text data it would speed up. However, since I had not cleaned up the disk before this process began, it really had no effect.
Know your environment
In this case, I was working on a system I designed, and which I have been the senior sysadmin on for about 5 years. I had made "one little change" in beginning to bring virtuals behind a DMZ, and did not consider that when I began work. Additionally, I had set the QOS, then forgotten that it existed for all intents and purposes. I did not consider planning the move, since it was similar to one I have done many times in the past. However, a quick sketch of the new network could have alerted me to the QOS problem. There were alternatives; there are spare NIC's on both machines, and I could have quite easily done a direct connect between them and been able to push the image across at NIC speeds. Also, the primary interface on both machines could have been easily reconfigured for a private network without shutting down either machine (using aliases) and achieved close to the same purpose.
However, since I did not plan this (because it was too easy), I did not note the problem.
Always Plan: Someone said "fast is slow and slow is fast." Taking an extra 15 minutes or so to plan the task would have stopped the problem. The client would have had the level of service they deserve, and I would not have spent 16 hours working on one issue
When a problem arises, stop and think
Probably the biggest mistake was not stopping and find out the solution when the problem first appeared. Instead, I allowed my need to follow the poorly thought-out "plan" without modification. Blindly adding compression on probably random information is not a solution. Looking at the system, of the 150G to be transferred, over 33% (52G) is unused space. The system could have been restarted and 0's written to the unoccupied space. That would have taken time, but the site could have been up the whole time. There is a quick and dirty mechanism to put 0's in empty sectors:
dd if=/dev/zero of=/path/mounted/deleteme bs=4M && rm /path/mounted/deleteme
simply running the above command (where /path/mounted is the mountpoint of the partition) would have filled a major part of the unused space in the partion with 0's. It would have taken 5-10 minutes, while the client machine was operating and resulted in a savings of up to 50G of data transferred. Most likely the job would have been done in 10 hours vs 16.
When problems arise, analyze the cause and solution. Taking a half hour, or even more if necessary, to find a better solution is worth while. In this case, it would have saved at least 1/3 of the time for the transfer, and could have conceivably given me the chance to find the correct solution (see "Know your environment") allowing me to easily meet the goal of 2 hours. Failing to "waste" a half hour to analyze the problem resulted in my losing the better part of a day to forcing the incorrection one.
Do not blindly follow the "plan"
We have serviced this client for well over a decade, so we know them very well. And, we had a good fall back plan; simply restart the virtual on our server. Our relationship with the client makes me confident that, if we had aborted, she would have been fine with it. We could have rescheduled for the next day or, better still, the next weekend. Then, we could have analyzed the problem, done some test runs, and arrived at the appropriate solution
One of my many failures is that, when I have a plan, I follow it without thinking. If I had spent a few minutes thinking about the problem which arose, decided I had insufficient information, I could have easily aborted the move, informed the client, then solved the issue at my leasure.
When problems arise, be prepared to throw out the plan and come up with a better one. There are some instances where this is not applicable, where rescheduling is simply not in the cards. However, if you take the time to think, you can judge that possibility better
The common thread above is not taking the time to think, at any stage. No matter how many times you have done a task, do not fall into the trap of thinking this project will be the same. Take a few minutes and itemize what is different about this task than the other similar ones (the firewalls and routers in this case). If something goes wrong, stop, think analyze and, if the solution is not immediately obvious, abort if possible.Tags: mistake, virtual move, xen