Slides from my yesterdays talk are here, uploaded to Bratislava.pm.org page.
My thanks go to everyone who participated on this great conference and helped this event happen!
Slides from my yesterdays talk are here, uploaded to Bratislava.pm.org page.
My thanks go to everyone who participated on this great conference and helped this event happen!
It's already 2 years since I had a chance to burn-in and install HP Proliant DL320G5p server. Even I took notes back then, I never found time until now when suddenly a good reason came... So I'll write a bit about the hardware, installation of Debian Lenny with Xen paravirtualization and performance tests that I've ran back then. Someone might find the test numbers useful as a reference.
Here are the HW parts:
1x SERV HP DL320G5p QC-3210, 2.13GHz/1333 2x1GB SAS/SATA, Rack 4x HP 2GB UB PC2-6400 1x2GB Kit (ML110G5, ML310G5, DL320G5p) 3x HDD HP 72GB DP 3.5" Hot Plug, SAS 15k 1x HDD HP 500GB 3,5" Hot Plug, SATA 7.2K 2x HP SC44Ge PCI-Ex HBA 1x HP DL320G5p iLO Port Opt Kit 1x HP DL1U 4 Drive Cage 1x HP HBA SAS-SATA 4x1LN Cable Kit
or in short - quad-core 2.13GHz Xeon CPU, 8GB RAM, 3x 72GB SAS + 500GB SATA HDD, with a iLO hardware remote console.
neo:~# lspci 00:00.0 Host bridge: Intel Corporation 3200/3210 Chipset DRAM Controller (rev 01) 00:01.0 PCI bridge: Intel Corporation 3200/3210 Chipset Host-Primary PCI Express Bridge (rev 01) 00:06.0 PCI bridge: Intel Corporation 3210 Chipset Host-Secondary PCI Express Bridge (rev 01) 00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 02) 00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 (rev 02) 00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 02) 00:1c.4 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 02) 00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 02) 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02) 00:1d.3 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) 00:1f.0 ISA bridge: Intel Corporation 82801IR (ICH9R) LPC Interface Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 4 port SATA IDE Controller (rev 02) 00:1f.5 IDE interface: Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller (rev 02) 01:02.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) 01:04.0 System peripheral: Compaq Computer Corporation Integrated Lights Out Controller (rev 03) 01:04.2 System peripheral: Compaq Computer Corporation Integrated Lights Out Processor (rev 03) 01:04.4 USB Controller: Hewlett-Packard Company Proliant iLO2 virtual USB controller 01:04.6 IPMI SMIC interface: Hewlett-Packard Company Proliant iLO2 virtual UART 02:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev b5) 03:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3) 03:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3) 15:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
Putting together the hardware pieces was quite straightforward, only later I've realized I have to put the SATA disk first and only then the SAS, as I had to change the cabling when I found out that the write performance on the SATA disk was "11.3 MB/s" using SC44Ge PCI-Ex controller vs "31.6 MB/s" using the on-board Intel ICH9 controller. So I left the three SAS disks connected to SC44Ge and the single SATA disk to the main board.
To install the base Debian Lenny system I didn't had to do any special tricks, just used the virtual CDROM and went through the installer.
neo:~# fdisk -l /dev/sda Disk /dev/sda: 73.4 GB, 73407865856 bytes 255 heads, 63 sectors/track, 8924 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x000c51f0 Device Boot Start End Blocks Id System /dev/sda1 * 1 31 248976 fd Linux raid autodetect /dev/sda2 32 8924 71433022+ fd Linux raid autodetect neo:~# fdisk -l /dev/sdb Disk /dev/sdb: 73.4 GB, 73407865856 bytes 255 heads, 63 sectors/track, 8924 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x000da224 Device Boot Start End Blocks Id System /dev/sdb1 * 1 31 248976 fd Linux raid autodetect /dev/sdb2 32 8924 71433022+ fd Linux raid autodetect neo:~# fdisk -l /dev/sdc Disk /dev/sdc: 73.4 GB, 73407865856 bytes 255 heads, 63 sectors/track, 8924 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x000ee4f3 Device Boot Start End Blocks Id System /dev/sdc1 * 1 31 248976 fd Linux raid autodetect /dev/sdc2 32 8924 71433022+ 8e Linux LVM neo:~# fdisk -l /dev/sdd Disk /dev/sdd: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00017cb0 Device Boot Start End Blocks Id System /dev/sdd1 1 60801 488384001 8e Linux LVM
The first partition of the three SAS disks (sda, sdb, sdc) is in RAID1 for the /boot with the grub loader. Then the rest of the first two disks (sda, sdb) is in RAID1 as a PV for LVM. Third SAS and the SATA disks (sdc, sdd) were stand-alone and also PV for LVM. So in total 3x volume groups vg00, vg01 and vg02. vg00, vg01 with capacity 68GB and vg02 with 465GB. All of them with a different characteristic. vg00 protected from one disk failure, vg01 standalone fast SAS 15k disk and vg02 standalone SATA with a big capacity.
neo:~# pvdisplay --- Physical volume --- PV Name /dev/md1 VG Name vg00 PV Size 68.12 GB / not usable 2.69 MB Allocatable yes PE Size (KByte) 4096 Total PE 17439 Free PE 2819 Allocated PE 14620 PV UUID FiVmLS-7f3H-0S9x-7YjQ-bKnE-7M1t-E1aYIw --- Physical volume --- PV Name /dev/sdc2 VG Name vg01 PV Size 68.12 GB / not usable 2.81 MB Allocatable yes PE Size (KByte) 4096 Total PE 17439 Free PE 11071 Allocated PE 6368 PV UUID Jxx0zz-8hBR-jfUW-ZhaQ-QXCY-J9sb-qJwqoJ --- Physical volume --- PV Name /dev/sdd1 VG Name vg02 PV Size 465.76 GB / not usable 1.50 MB Allocatable yes PE Size (KByte) 4096 Total PE 119234 Free PE 73858 Allocated PE 45376 PV UUID wMdrbc-s7G0-rsHj-O5CS-6JH8-MJTf-lHCKcy
The hadware remote console is an independent piece of the hardware inside the server chassis, sharing only power supply. Using this console server can be powered on or off and it allows also to see the "screen", access "keyboard" and virtual USB CDROM of the machine remotely allowing to reinstall the server remotely from anywhere.
Why paravirtualization even when the hardware was capable of full virtualization? Even thou the guest systems has to run modified Xen-domU kernels, the paravirtualizations brings the advantage that the single partitions from host system dom0 can be directly used in the guest domU systems. So there is no need to partition and use LVM in the domU-s again. It's easy to shut down the domU and mount the partition and do maintenance on it, like for example resize.
I proved that the system was able to use the speed of the disks in the three volume groups independently via running the badblocks read/write check on one and then on all three at the same time without noticing significant difference. So this hardware with 4 cores has a potential of running 4 really independent machines at once, with my set-up 3 as the two of the disks are joined in RAID1.
testing with hdparm -t:
/dev/sda: Timing buffered disk reads: 352 MB in 3.00 seconds = 117.26 MB/sec neo:~# hdparm -t /dev/sdb /dev/sdb: Timing buffered disk reads: 352 MB in 3.00 seconds = 117.33 MB/sec neo:~# hdparm -t /dev/sdc /dev/sdc: Timing buffered disk reads: 354 MB in 3.01 seconds = 117.44 MB/sec neo:~# hdparm -t /dev/sdd /dev/sdd: Timing buffered disk reads: 322 MB in 3.01 seconds = 106.94 MB/sec --- /dev/mapper/vg00-mirror--sas: Timing buffered disk reads: 352 MB in 3.00 seconds = 117.25 MB/sec /dev/mapper/vg01-single--sas: Timing buffered disk reads: 354 MB in 3.01 seconds = 117.59 MB/sec /dev/mapper/vg02-single--sata: Timing buffered disk reads: 324 MB in 3.02 seconds = 107.42 MB/sec
SAS RAID1 disk sync:
### sas disks sync
md1 : active raid1 sda2[0] sdb2[2]
71432896 blocks [2/1] [U_]
[>....................] recovery = 2.9% (2091456/71432896) finish=10.4min speed=110076K/sec
----total-cpu-usage---- --dsk/sda-----dsk/sdb-----dsk/sdc-----dsk/sdd-- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ: read writ: read writ: read writ| recv send| in out | int csw
0 0 100 0 0 0| 110M 0 : 0 110M: 0 0 : 102M 0 |2534B 3590B| 0 0 |8306 14k
0 0 100 0 0 0| 109M 0 : 0 109M: 0 0 : 100M 0 | 384B 3476B| 0 0 |8217 13k
bad blocks check - only one disk running
neo:~# time badblocks -s -w /dev/mapper/vg01-single--sas Testing with pattern 0xaa: done Reading and comparing: done Testing with pattern 0x55: done Reading and comparing: done Testing with pattern 0xff: done Reading and comparing: done Testing with pattern 0x00: done Reading and comparing: done real 134m39.806s user 3m0.267s sys 0m16.969s
bad blocks check - all three VG at once (stripped output only real time left):
neo:/mnt# time badblocks -s -w /dev/mapper/vg00-mirror--sas real 135m0.663s neo:~# time badblocks -s -w /dev/mapper/vg01-single--sas real 135m4.589s neo:~# time badblocks -s -w /dev/mapper/vg02-single--sata real 255m34.293s
bad blocks check - all three VG at once (stripped output only real time left) and inside a XEN virtual machines:
mirror:~# time badblocks -s -w /dev/sdb1 real 135m31.627s sas:~# time badblocks -s -w /dev/sdb1 real 135m28.460s sata:~# time badblocks -s -w /dev/sdb1 real 257m33.197s
note there is no difference in speed between a single disk and the same two disks in SW RAID1
read speeds SAS vs SATA with dd
sas:~# time dd if=/dev/sdb1 of=/dev/null bs=1M count=10000 skip=10000 10485760000 bytes (10 GB) copied, 88.9769 s, 118 MB/s # read sata:~# time dd if=/dev/sdb1 of=/dev/null bs=1M count=10000 skip=10000 10485760000 bytes (10 GB) copied, 93.1479 s, 113 MB/s
write speeds SAS vs SATA with dd
sas:~# time dd if=/dev/zero of=/dev/sdb1 bs=1M count=10000 10485760000 bytes (10 GB) copied, 87.8604 s, 119 MB/s # through SC44Ge sata:~# time dd if=/dev/zero of=/dev/sdb1 bs=1M count=10000 10485760000 bytes (10 GB) copied, 927.064 s, 11.3 MB/s # through on-board controller later sata:~# time dd if=/dev/zero of=/dev/sdb1 bs=1M count=10000 10485760000 bytes (10 GB) copied, 305.611 s, 34.3 MB/s
It turned out that the read speeds on the SAS disk and SATA disk were equal and the write speed was ⅓. Let's try to prove this via the backblocks run times. During the run there is the same amount data to read and write. So for the SAS disk it was 135min/2 = 67,5min for each operation. If the SATA writes would be 3x slower then the total time would be 67,5m + 3x 67,5m = 270m. Which is actually similar to the real run time of 257min. I'm pretty sure some mathematician will beat me for this prove, but ...
The two Ethernet ports are configured in active-backup bonding mode. So both ports can be plugged to one or two switches and only one is communicating. When one switch goes down (power down, port shut-down) the other port takes over. The bond0 virtual interface that is created from the two eth0 and eth1 interfaces is then configured in a bridge interface so that the virtual machines can get public ip from the same IP pool as the physical interface has. Here is the system configuration:
# apt-get install bridge-utils ifenslave-2.6
# echo bonding >> /etc/modules:
# echo "alias bond0 bonding" >> /etc/modprobe.d/aliases
# echo "options bonding mode=active-backup miimon=100 max_bonds=1" >> /etc/modprobe.d/aliases
# vim /etc/network/interfaces
auto br0
iface br0 inet static
address 62.40.64.245
netmask 255.255.255.240
broadcast 62.40.64.247
gateway 62.40.64.241
bridge_ports bond0
bridge_fd 0
bridge_stp off
pre-up ifconfig bond0 up
pre-up ifconfig eth0 up
pre-up ifconfig eth1 up
pre-up ifenslave bond0 eth0 eth1
I've crimped a 1Gb cross-over Ethernet cable and used netperf and ifstat to measure the throughput between my laptop and the server.
ifstat (one way and then the other way):
eth0
Kbps in Kbps out
961336.7 13976.19
961350.0 14070.64
961197.4 14146.69
961434.5 14138.59
eth0
Kbps in Kbps out
20822.81 958567.5
20731.41 953624.3
20842.09 959024.5
20700.29 952988.1
netperf (both sides had the netserver and netclient ran at the same time):
sas:~# netperf -l 60 -H 62.40.64.241 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 62.40.64.241 (62.40.64.241) port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 60.02 936.60 ant:~# netperf -l 60 -H 62.40.64.242 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 62.40.64.242 (62.40.64.242) port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 60.01 107.55
4x:
neo:~# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU X3210 @ 2.13GHz stepping : 11 cpu MHz : 2128.046 cache size : 4096 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae cx8 apic mtrr cmov pat clflush acpi mmx fxsr sse sse2 ss ht constant_tsc pni ssse3 bogomips : 4262.53 clflush size : 64 power management:
The server is running for over 2 years now hosting couple of virtual machines. Some stand-alone with own public IP, some private ones behind nginx reverse proxy. So far so good! :-)
PS it took me 3,5h to write this blog entry...
Here are some variants that I've seen so far.
Careful people that are afraid to spoil the system use:
Crazy people have always an original place to hide files in:
Linux distributions put files according to the Filesystem Hierarchy Standard.
Some years ago I belonged to the first group and I was putting everything into /usr/local to be sure that there is just my stuff. Today I belong to the second one... I use /data for my laptop. Which is not so great idea, but it is my laptop and no one ever will have to (won't even be able) to touch it, so it is my mess. Some of the other ones like /corp, /shared and /usr/local64 I have to use at $work, because it is ops decision or a historical reason (sniff, sniff).
I hope one day I'll further evolve to work according to the standards. Tiding-up my mess would be the easy part of the journey...
Once upon a time I was a young system administrator. Having all the strange looking /usr, /var, /etc, ... all round me was scary and I was not sure what to "do" with all those folder trees. At some point I started to compile the extra programs that I needed. With a default prefix all ended up in /usr/local which looked safe to me. I knew that my stuff is there and the mysterious system stuff is everywhere else.
Well it worked. Having to maintain some more servers later I started to do some packaging. I was using Slackware at that time and the Slackware packaging system was really simple - just a tarball that got extracted to the root of the file system with some scripts that got executed during the package installation time. Simple and worked for me. Still I kept the stuff in /usr/local to be on the safe side.
And the time passed :-) and I'm using Debian now, but most important change is that I've lost all my respect to the file system and I've learned where and why to put files. Why to use /etc for configs, /var/log for logfiles, /var/lib for state files, /var/cache for cache files, /usr/share for templates or static data files, etc.? Simple because it is standard. Because it is standard, standard compliant distribution will stick to it. Besides being standard there are some really good reasons too. Helper tools will understand the files and then act based on the files category. Automatically rotating logs, backup-ing the important (non static distribution) files, cleaning up the temporary files etc. Well and there are also humans out there. Co-admins or newcomers, that will login to the machine and look for files or trouble shoot the programs. Knowing where to look for stuff really helps!
With today advance of virtualization techniques there is no reason to mix too many things (projects) on one server. So there is no reason to play safe with the paths and files should be put to the right place where they belong - FHS.
(to be continued with Perl part of the story...)Reply to two commanets from "when virtualization is too expensive".
1st was from grantm. Thanks for pointing out the OpenVZ. It's much more professional tool than debootstraping and chrooting, but comes with more configuration and setting-up complexity. It's possible to share memory using OpenVZ which is the resource that we never have enough.
2nd was from Alias. Alias is pointing out the deduplication of disk data. Disk space is kind of cheep these days, comparing to memory, bandwidth and cpu processing costs when dealing with huge data. The deduplication can offer, besides saving the disk space, a better cache efficiency, that will increase disk read speed, that is never fast enough. ZFS should soon (first quarter of 2010) support deduplication. It seems that there will be no native ZFS support for Linux because of the licencing problems :-( . Easy Linux deduplication that can come in handy is cowdancer. `cp -al some-folder/ some-folder.copy && cd some-folder.copy && cow-shell` will copy "some-folder" via creating hardlinks which is really fast. Than any write operation under this shell and folder will result in removing the hardlink and creating a new file => copy-on-write. This works good with chroot-ed systems.
Yes the thing that turns your single computer to multiple computers. Why should it be expensive? Simply because the virtual machines are not sharing one disk space and one memory. To make the the virtual system run smoothly, it needs some decent amount of dedicated memory and a disk volume with some reserve so it doesn't have to be resized too often.
First find the reasons for doing virtualization, why would anyone want to run multiple machines on a single hw? Most likely to clearly separate the programs and the whole operating systems. Give the strictly defined virtual hw resources, limit the access for security reasons. And also to add one level of abstraction which then allows systems to live in a cloud. But that is a different topic.
Let's search for the solutions how not to do virtualization and fulfil some (!) of the requirements of it. Mainly the clear files and whole system separation for Perl development.
If just Perl is in the play, then compiling and installing user own Perl is an option. Simply having the Perl binary and all the installed CPAN modules in the $HOME directory of the user.
If Apache is needed or some extra binary libraries, it is still possible to compile and install to the user home, but it is quite a lot of "hand work" and not every one has time and passion to do it. Much more simple way is to use chroot. What chroot does is that it sets root of the filesystem for the child processes to a folder. And as we are in UNIX, where (nearly) anything is a file, this means a different machine. Both systems, the parent and the chroot-ed, still share the same /proc, /dev, network devices etc., but the separation is enough to be able to install programs with standard distribution commands and run them. Fair enough to have chroot-ed machine as a development machine. Benefiting of shared memory and disks pace, easy file sharing (one filesystem) and not having to maintain virtualization sw.
Here is how to create a chrooted system on Debian and switch to it:
debootstrap lenny /usr/chroot/$MACHINENAME
echo ${MACHINENAME}_chroot > /usr/chroot/$MACHINENAME/etc/debian_chroot
chroot /usr/chroot/$MACHINENAME su -
Thanks to Yuval Kogman for pointing out this great talk - Systems that Never Stop (and Erlang).
Yesterday I was watching "Google Chrome OS Open Source Project Announcement":
The "Chrome OS" got demystified, well at least for me. It's nothing more, but also nothing less than a project to throw away some conventions about current systems. Or some people say returning back to the thin client era. Basically "Chrome OS" is (will be?) a damn fast init loader. Init that will load just the things needed to run a browser. Then it is up to the browser to fulfil the consumer needs.
The challenging part will be to add all the HTML5 and other features needed to make browser have all the power of classical desktop applications. Like access to HW acceleration, offline storage, popups, etc. Another challenging part will be to convince people to write real web apps using these features.
Everyone knows make and Makefiles? At least in Perl world everyone is using it when installing ExtUtils::MakeMaker CPAN distributions. It is present on most of the systems, so why no to find it some more usage? (read save time, ease the work)
Makefile was originally created to help compiling and linking C source code. So? So the point is that we don't have just to compile source code, but we can use Makefile-s to process any kind of dependency based files chain. Inside the Makefile there is always a target file that should be generated together with dependency files needed for generation and a set of commands to perform the task. In addition targets can be made PHONY which means that the target commands will be always executed. This is most often used for "clean" target - `make clean`, which should removed all temporary build/generated files and tidy-up the folder.
The PHONY functionality can be used beyond housekeeping do define set of commands (or a library of commands) that make sense for current folder. For development project this can be `make upload`, `make deploy` or `make ajoke` or what ever comes in handy.
The Makefile of ba.pm.org has couple of targets. Transforming .po files to .js (using po2json), .xml to .tt2, .rdf or .js (using XSLT), minifying js and css (using yuicompressor), etc. There are also a couple of PHONY targets like 'all' (to build the page), 'test' (to test the xml and site linking), 'clean' + 'distclean' to tidy-up.
This MT4 installation seems to have some encoding problems. It's destroying the non-ASCII characters. I'm not suspecting the database and also not the web server, but I don't know... Let's see first some screen shots.
Writing new Entry
after clicking "Preview"(note the "You are previewing the entry titled" on top compared to the title in the preview)
after just clicking "Re-Edit this Entry" and again "Preview"
So it seems that the characters comes properly to the code and are properly written to the preview file, but then when used back again for MT4 administration interface, gets messed up.
Little more investigation and it looks like that the FastCGI is doing the double encoding. When using just CGI versions everything works just fine and I can write (šžčťľúüô) what ever I like...