Greg Smith's Note Magnet: 2008

Monday, December 8, 2008

Copying Virtual Box snapshots

I've really become comfortable nesting all of my Windows installs inside of Virtual Box lately (my main systems run both RedHat and Ubuntu Linux). Just being able to shuttle that image around to wherever I happen to be working is one big help. And the value of working with VMs was just reinforced this week when I learned that my recently installed XP Service Pack 3 introduced an incompatibility with the version of Microsoft SQL Server 2005: Express I needed to install to extract some client data. I just rolled back to the snapshot I took just before I installed that service pack (I am that paranoid), and then it installed fine. While I've been known to do a backup before such an operation even on a real hard disk, that's painful; VM snapshots are so trivial I can take them far more often.

This left me with a dilemma though: my XP install now has this bloated SQL Server mess installed on it that I can't delete until that project is done, but I need to do some real work that I want to keep beyond when I blow that image away. The snapshot tool in the VirtualBox GUI is pretty coarse: the only thing it will do is destructively revert a snapshot. What I want to do is fork the previous snapshot into another machine.

I found some clues for image import/export; it seemed easy. As usual, it wasn't at all.

Background: each disk image in VirtualBox gets a unique UUID. This is why you can't just copy the underlying disk image files to somewhere else--the UUID will still be the same when you import it and it won't work. The export tool "clonevdi" takes care of that for you, but you'll need all the relevant UUIDs for that to do anything useful. I'll show how you can get this info below; when I worked through it the first time I got an unpleasant surprise once I tried using the clone utility I want to talk about first:


$ VBoxManage clonevdi b534a21d-a24a-44b2-35ae-66502938a0b9 image.vdi
VirtualBox Command Line Management Interface Version 1.6.4
(C) 2005-2008 Sun Microsystems, Inc.
All rights reserved.

[!] FAILED calling hardDisk->CloneToImage(Bstr(argv[1]), vdiOut.asOutParam(), progress.asOutParam()) at line 3314!
[!] Primary RC  = NS_ERROR_FAILURE (0x80004005) - Operation failed
[!] Full error info present: true , basic error info present: true
[!] Result Code = NS_ERROR_FAILURE (0x80004005) - Operation failed
[!] Text        = Cloning differencing VDI images is not yet supported ('/d2/virtualbox/Machines/XP Pro/Snapshots/{b534a21d-a24a-44b2-35ae-66502938a0b9}.vdi')
[!] Component   = HardDisk, Interface: IHardDisk, {fd443ec1-000f-4f5b-9282-d72760a66916}
[!] Callee      = IHardDisk, {fd443ec1-000f-4f5b-9282-d72760a66916}

How fun is that? Differencing image, no copies for you!

After some poking around with the forum uber-thread covering copies, it sounded to me like anybody who makes any sort of snapshot is just screwed here. If it's not a plain old disk image, too bad.

As it would take far too long to recreate what's in this VM, I was plenty motivated to find a workaround. Here's how I eventually managed to extract those images:

With Virtual Box not running, backup the entire .VirtualBox directory.
Start the main image tool and navigate to the snapshot list for the relevant image.
If the one you want is in the history, rather than the current one you want, start at the bottom and blow away any change sets below that one; right-click on them and choose "Revert to current snapshot" if you just want to get rid of changes since then, or "Discard current snapshot and state" if the one you want is actually below either of them.
Once you've gotten to where the image you want is current, select each of the snapshots above it and right-click for "Discard snapshot" to merge their differences into the image below. You'll see the encouraging "Preserving changes to normal hard disk" message here.
Only the one image you want left? It's description should read like this: "IDE Primary Master: [Normal, 10.00GB]" Now you're set to use the command line tools! Exit the GUI, create the image as shown below, save that file somewhere else (it's created in the same VDI directory all the other images live in), then you can restore your original config to get everything back.

Here's how the command-line tools worked once I'd done the above to slim down to only the one image I wanted as available. First I take a look at all the UUIDs available:


$ VBoxManage list vms | grep UUID
UUID:            aeb41dd3-d9f5-44fb-b1a2-a32e84f79e64
Primary master:  /d2/virtualbox/VDI/XP Pro.vdi (UUID: 328bcae8-ddf8-4121-139c-f7d0566526f4)

That first UUID is for the whole VM (including the associated config files), that we can ignore. What I want to do then is copy the current running copy, the one labeled "Primary master", to a new image file. First I confirm I can access the right one via the command line tools:


$ VBoxManage showvdiinfo 328bcae8-ddf8-4121-139c-f7d0566526f4

That shows what I expected, so now I can carefully edit that working line via the old up arrow, changing that to the copy command instead. This time it works:


$ VBoxManage clonevdi 328bcae8-ddf8-4121-139c-f7d0566526f4 imagename.vdi

Now, how to actually use one of these images? You need a configuration XML file in the "Machines" directory that matches the one associated with this image copy. Make sure to save that matching file from the VDI/[Machine]/[Machine].xml directory before you do anything drastic (like restore the original configuration with all the snapshots); we'll need it later.

Once I made all the VDI images and had their matching config files, I put back the original .VirtualBox directory, then copied the new images into its VDI directory. Setting up a new VM to use those copies went like this:

Create a new snapshot with the correct type. When you get to "Virtual Hard Disk", select "Existing". Click on "Add".
You'll see everything listed in your VDI directory. Select the one associated with the snapshot you made and finish making this entry. Exit VirtualBox.
Now what you want to do is copy the long Machine... line from the new Machines/[Machine]/[Machine].xml file you just created to somewhere else (a text editor perhaps), along with the HardDiskAttachment... one that refers to your relocated snapshot. There are more details about this part at Cloning a complete VM.
Overwrite the new machine XML file with the original one associated with your VM, replacing just those two lines with the ones you saved. Basically, you need the new machine UUID and hard disk UUID, everything else should be the same as your original configuration to make sure this machine clones the original as closely as possible.

After going through all that, I had everything: the original VM with all its respective snapshots were still there. I had a copy of the install with SQL Server I could tinker with separately, while continuing my regular work in the original VM. Quite an unexpectedly long diversion, but now that I understand what does and doesn't work here I'll make sure to structure my images and snapshots accordingly.

Since I realize snapshots are a lot less useful than I originally thought because of these limitations, it strikes me I might even switch to using the more portable VMDK image files instead of the native VDI format. Snapshot compatibility was the only reason I didn't do that in the first place (can't use them with VMDK). I think I can do that by creating a new virtual disk in VMDK format, attaching that as the secondary master, using a boot CD image to dd the VDI one to the VMDK one, then detaching those disk images and making a new machine based on the VMDK-formatted one. Easy as can be, right? I got the idea from How To Resize a VirtualBox Virtual Disk. But not right now; I've had enough of a VM disk manipulation workout already today.

Sunday, November 30, 2008

Automating initial postgresql.conf tuning

About two years ago I returned to full-time work on PostgreSQL performance tuning for a contract job, after having done some earlier work in that area for a personal project. One of the things that struck me as odd was that no progress had been made in the intervening years toward providing an automatic tool to help with that goal. Since many of the existing guides only covered older versions of PostgreSQL (the 8.1 release changed what you can adjust usefully quite a bit), even the few template suggestions for various machine sizes floating around the web were largely obsolete--not to mention that the modern hardware was a lot more capable. When I tried writing an automatic tool myself, I started to realize just how much of a harder problem it was than it seemed.

One stumbling block is that you need a fair amount of information about the parameters in the postgresql.conf file, and some of that information wasn't all available any way except to read the associated source code. There's enough information embedded only there that it was impractical to expect a potential tool writing author to extract and maintain their own copy of the settings database. For example, I had to submit a patch myself earlier this year to make it possible to easily discover what value a parameter you don't set will get. Now you can look at the pg_settings view and that shows up as "boot_val".

A second problem is that coming up with a robust model for what to set everything to was harder than it seems at first. Earlier this year I worked with some other community members on codifying some of that into tuning guidelines, and there was a somewhat arcane tuning spreadsheet from Josh Berkus floating around. After showing I was close to an initial solution to this long outstanding TODO item, Josh came to my aid with a simplified model that was easy to implement as code.

There have been more than one attempt to write this kind of tool that fizzled after some initial work. The other classic trap people have fallen into here is worrying about the user interface before getting all the boring parts done. It takes a fair amount of code to read a postgresql.conf file usefully, learn the rules for what can you do with it (again, some of which are only documented in the database source code), and publish an updated file. As an example there, while it's possible to just maintain a list of which parameters are specified using "memory units" and what unit they default to (some are multiples of a single kB, others in 8kB blocks), what you really should do is look at the "unit" column that comes out of pg_settings instead to figure that out. And to do that, you need to save a copy of pg_settings into a file (since at the point this tool is being run the server isn't up yet) and load it into memory such that the related code can rely on it. Then it's trivial for the tool to distinguish between a value like "max_connections" (which is a simple integer that accepts no additional units) and "work_mem", which is specified in 1kB units but you can use 'MB' and 'GB' suffixes for. Oh, and don't forget the upper limit for that particular setting differs depending on whether you're on a 32 or 64 bit platform. What fun.

One of the guidelines I try to live by is "don't complain about anything you're not willing to help fix yourself", and I've done my share of complaining on this topic. This past holiday weekend, in between rounds of turkey overdose I reached a new milestone here: my post to the pg hackers mailing list yesterday includes a Python implementation that solves all the traditional problems besetting tuning tools authors in the context of the upcoming PostgreSQL 8.4. You feed it an existing postgresql.conf file and tell it what kind of database application you expect to build (data warehouse, web app, etc.). It tries to figure out how much RAM you have, then produces an updated postgresql.conf file for you with updates to 10 critical parameters.

Since the default configuration file for PostgreSQL is aimed at working with the conservative parameters many operating systems ship with, that makes those defaults only appropriate for a system having approximately 128MB of total RAM. If you have more than that and haven't considering performance tuning improvements for your PostgreSQL database, this tool, the model expressed in its source code, or a future version of it (once 8.4 is done I plan a version backward compatible with versions 8.1-8.3) may eventually be useful to you.

And if you're one of the frustrated potential tool authors who was stopped by one of the problems I mentioned here, consider this code release as an initial resource you can use. Even if it's not your preferred language, Python is pretty easy to read. I think it's easier to start by using this program as a spec to reference and emulate than it is to dig into the source code and resources I did to write it.

Tuesday, August 26, 2008

Linux disk failures: Areca is not so SMART

One of the most frequently asked questions on the PostgreSQL Performance list is "what disk controller works best with PostgreSQL?" It's a tough question; every product has different things it's good and bad at, and it can be hard to through all that to figure out what makes sense for your application. You need to focus equally on performance and reliability, as disk drives are very prone to failure.

I have a mini-server at home I use mainly for performance testing. The main thing that makes it a mini-server rather than an oversized desktop is that I've got an Areca ARC-1210 as the disk controller, which when I bought it cost more than the whole rest of the computer I installed it in. Areca is a popular recommendation because for a while there, their products were considerably faster per dollar than the competition in the SATA RAID space (mainly 3Ware and LSI Logic), and reports of how the controllers handled disk failure suggested reasonable behavior there.

I knew the performance was good, but with the relatively small number of drives I run at home I hadn't actually run into a SATA disk failure to comment on that part of the controller's firmware yet. Luckily Murphy stalks my life, and during a weekend I really didn't have time to fool around with hardware I got my chance. Snipped from /var/log/messages:


sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 0:0:1:0: [sdb] Sense Key : Medium Error [current]
Info fld=0x0
kernel: sd 0:0:1:0: [sdb] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdb, sector 33470855
sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
kernel: sd 0:0:1:0: [sdb] Sense Key : Medium Error [current]
kernel: Info fld=0x0
kernel: sd 0:0:1:0: [sdb] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdb, sector 33470855

Here the end_request lines show the sector on the drive that went bad--two of them here, there were plenty more. This disk was part of a 3-disk software RAID0 set, which meant the file these sectors were in the middle of was trashed. Obviously I wasn't optimizing for reliability using RAID0 so the file wasn't important. What was important is that I finally had a chance to see exactly how the Areca controller handled this situation.

Normally what you do when a drive looks like it's going bad is to run smartctl. Actually, you should be running smartctl proactively all the time to try and catch these errors as soon as possible, and getting e-mailed the results. Again, this is a development and test system so I hadn't set all that up. And the results when I tried to run a scan on this disk were disappointing:


# smartctl --all /dev/sdb
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: WDC      WD1600AAJS-00WAA Version: R001
Serial number: WD-WCAS20994885
Device type: disk
Transport protocol: Fibre channel (FCP-2)
Local Time is: Sun Aug 24 02:23:58 2008 EDT
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     30 C
Drive Trip Temperature:        25 C
Manufactured in week 30 of year 2002
Current start stop count:      256 times
Recommended maximum start stop count:  4278190080 times
Elements in grown defect list: 0

Error counter log:
       Errors Corrected by           Total   Correction     Gigabytes    Total
           ECC          rereads/    errors   algorithm      processed    uncorrected
       fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.000           0
write:         0        0         0         0          0          0.000           0

Non-medium error count:        0
Device does not support Self Test logging

The data returned here is clearly bogus--every time I looked at it after encountering errors, the counters were still all 0. Generally the best way to figure out what's going on is to run a self-test:


# smartctl -t long /dev/sdb
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Long (extended) offline self test failed [unsupported scsi opcode]

Oops. Much like how the other counters aren't supported, the Areca controller doesn't pass through SMART test requests. I vaguely recalled that from when I was doing the shopping for this card. Areca supplies a couple of utilities on their FTP site to handle chores like this. The first one, the HTTP server, lets you access all the card configuration you might normally do as part of the BIOS setup from a web browser. You run that utility and then access it via ports 81 and 82 on localhost. That worked OK but didn't tell me too much.

The other thing they provide, the part I was intending to rely on here, is the command line interface program. They provide two versions for both 32-bit and 64-bit Linux. The problem with this utility is that unlike the disk driver, the CLI is not open source.

The Areca driver made its way into the mainstream Linux kernel as of 2.6.19, and has been improved significantly by contributors outside of Areca since. The version that ships with kernel 2.6.22 is the one I recommend, that rev is solid. Areca's CLI they provide in binary form, tested against a couple of supported distributions, and that's all you get.

My server system was recently upgraded to run Ubuntu 8.04.1 in addition to the CentOS 5.2 it's had on it for a while now. The CLI tool didn't work at all on Ubuntu; it didn't find the device and I have no idea why. If you're really committed to having an true open-source stack for your controller card, perhaps because you value being able to understand and possibly resolve issues yourself when they happen, cross Areca off your list. Their essential CLI utility can easily break on you and leave you stuck.

Luckily I still had the CentOS partition setup, and as that's a derivative of the supported RHEL their CLI is aimed at it worked fine there. Here's a little session I had with it, looking for what went wrong here:

Copyright (c) 2004 Areca, Inc. All Rights Reserved.
Areca CLI, Version: 1.80A, Arclib: 260, Date: Oct 12 2007( Linux )

Controllers List
----------------------------------------
Controller#01(PCI): ARC-1210
Current Controller: Controller#01
----------------------------------------

CMD     Description
==========================================================
main    Show Command Categories.
set     General Settings.
rsf     RaidSet Functions.
vsf     VolumeSet Functions.
disk    Physical Drive Functions.
sys     System Functions.
net     Ethernet Functions.
event   Event Functions.
hw      Hardware Monitor Functions.
mail    Mail Notification Functions.
snmp    SNMP Functions.
exit    Exit CLI.
==========================================================
Command Format: cmd [Sub-Command] [Parameters].
Note: Use cmd -h or -help to get details.

CLI> event info
Date-Time               Device              Event Type     
===============================================================================
2008-08-24  06:14:51    H/W MONITOR         Raid Powered On
2008-08-24  06:01:35    H/W MONITOR         Raid Powered On
2008-08-24  05:48:13    Proxy Or Inband     HTTP Log In
2008-08-24  05:44:24    IDE Channel #02     Reading Error
2008-08-24  05:44:16    IDE Channel #02     Reading Error
2008-08-24  05:44:08    IDE Channel #02     Reading Error
2008-08-24  05:43:54    IDE Channel #02     Reading Error
2008-08-24  05:43:46    IDE Channel #02     Reading Error
2008-08-24  05:37:12    Proxy Or Inband     HTTP Log In
2008-08-24  05:20:18    H/W MONITOR         Raid Powered On

That's pretty clear: adjusting for time zone the disk attached to channel #02 (which is in fact /dev/sdb) had a read error at the same time the I/O errors showed up in the kernel logs. So the controller did catch that. Clearly any production system with an Areca controller should be monitoring the output from "cli
event info" for these messages.

If I were running this in a hardware RAID environment, I suspect the Areca controller would have kicked the disk out after that. Sometimes, though, the SMART errors are more subtle. I like to schedule an extended disk test for any production drive at least once a week, trying to find issues before they actually wipe data out. Here's the options there, and what the SMART info for a good disk in the system looks like:

CLI> disk -h
smart       Parameter: 
        Fn: Display Disk S.M.A.R.T Data.
        Ex: Display Disk#1 S.M.A.R.T Data.
        Command: disk smart drv=1 [Enter]

dctest      Parameter: .
        Fn: Perform A S.M.A.R.T Off-Line Data Collection.
        Ex: Perform A Test On Disk#1~4.
        Command: disk dctest drv=1~4 [Enter]

sttest      Parameter:  [mode=].
        Fn: Perform A S.M.A.R.T Off-Line Short or Extended Self-Test.
        Ex: Perform A Short Self-Test On Disk#1~4.
        Command: disk sttest drv=1~4 mode=short [Enter]
        Note: Default Mode=Short.

CLI> disk smart drv=1
S.M.A.R.T Information For Drive[#01]
# Attribute Items                           Flag   Value  Thres  State
===============================================================================
1 Raw Read Error Rate                       0x0f     200     51  OK      
3 Spin Up Time                              0x03     152     21  OK      
4 Start/Stop Count                          0x32     100      0  OK      
5 Reallocated Sector Count                  0x33     200    140  OK      
7 Seek Error Rate                           0x0e     100     51  OK      
9 Power-on Hours Count                      0x32      99      0  OK      
10 Spin Retry Count                          0x12     100     51  OK      
11 Calibration Retry Count                   0x12     100     51  OK      
12 Device Power Cycle Count                  0x32     100      0  OK      
192 Power-off Retract Count                   0x32     200      0  OK      
193 Load Cycle Count                          0x32     200      0  OK      
194 Temperature                               0x22     102      0  OK      
196 Reallocation Event Count                  0x32     200      0  OK      
197 Current Pending Sector Count              0x12     200      0  OK      
198 Off-line Scan Uncorrectable Sector Count  0x10     200      0  OK      
199 Ultra DMA CRC Error Count                 0x3e     200      0  OK      
===============================================================================

Here's what the bad one reported:

CLI> disk smart drv=2
S.M.A.R.T Information For Drive[#02]
# Attribute Items                           Flag   Value  Thres  State
===============================================================================
1 Raw Read Error Rate                       0x0f     200     51  OK      
3 Spin Up Time                              0x03     153     21  OK      
4 Start/Stop Count                          0x32     100      0  OK      
5 Reallocated Sector Count                  0x33     200    140  OK      
7 Seek Error Rate                           0x0e     100     51  OK      
9 Power-on Hours Count                      0x32      99      0  OK      
10 Spin Retry Count                          0x12     100     51  OK      
11 Calibration Retry Count                   0x12     100     51  OK      
12 Device Power Cycle Count                  0x32     100      0  OK      
192 Power-off Retract Count                   0x32     200      0  OK      
193 Load Cycle Count                          0x32     200      0  OK      
194 Temperature                               0x22     104      0  OK      
196 Reallocation Event Count                  0x32     200      0  OK      
197 Current Pending Sector Count              0x12     193      0  OK      
198 Off-line Scan Uncorrectable Sector Count  0x10     193      0  OK      
199 Ultra DMA CRC Error Count                 0x3e     200      0  OK      
===============================================================================

Ah ha--so the old "Uncorrectable Sector Count" quality factor
is going down. That's usually a sign of early failure. But how to
trigger a more comprehensive test?

CLI> disk sttest drv=2 mode=extended
ErrMsg: Drive 2 Is Not Available

CLI> disk dctest drv=2
ErrMsg: Drive 2 Is Not Available

Oops. A bit of research reinforced my fear here: if you've got a device that's part of a RAID group or, like in this case, assigned as a JBOD disk, the Areca controller won't let you run a SMART test on it. This is really unreasonable behavior for a disk controller aimed at enterprise use. I only ran into these bad sectors because I used them and then tried to read the result back. The extended test may have caught the read problem before I did that, saving me from the data loss. That's certainly how it's worked for the last two disk failures I ran into at home.

The next question, then, is how to run a full scan over this disk if I can't do it via smart? First I tried rebuilding the RAID0 /dev/md0 filesystem this disk is part of using the bad block checker you can invoke as part of mkfs:


mkfs -j -c -c /dev/md0

This returned clean, no I/O errors at the kernel level. It took many hours to complete though. What about the Areca logs?


Date-Time               Device              Event Type     
===============================================================================
2008-08-24  12:26:37    IDE Channel #02     Reading Error
2008-08-24  12:26:21    IDE Channel #02     Reading Error
2008-08-24  12:26:13    IDE Channel #02     Reading Error
2008-08-24  12:26:05    IDE Channel #02     Reading Error
2008-08-24  12:25:57    IDE Channel #02     Reading Error
2008-08-24  12:25:43    IDE Channel #02     Reading Error
2008-08-24  12:25:35    IDE Channel #02     Reading Error
2008-08-24  12:25:27    IDE Channel #02     Reading Error
2008-08-24  12:25:06    IDE Channel #02     Reading Error
2008-08-24  12:24:59    IDE Channel #02     Reading Error
...

That goes on for a while. Seeing a bunch of errors in the controller that didn't show up in the kernel is kind of disconcerting. I'm left distrusting the way mkfs does this test; perhaps the writes it does to disk are staying cached by the controller, causing the read to return with that cached data? Not sure what's going on here, but I sure don't like it.

Time to test this out at the filesystem level by forcing a read of everything instead, which is easier to do anyway because it doesn't write:

e2fsck -c /dev/md0

Ah, here come the kernel level errors in /var/log/messages. The end_request lines are the important ones here, here's the first of those (so many showed up they didn't even all appear in the logs):

end_request: I/O error, dev sdb, sector 34426095
end_request: I/O error, dev sdb, sector 34426127
end_request: I/O error, dev sdb, sector 34426151
end_request: I/O error, dev sdb, sector 34428623
end_request: I/O error, dev sdb, sector 34428623
end_request: I/O error, dev sdb, sector 34444831
end_request: I/O error, dev sdb, sector 34453535
end_request: I/O error, dev sdb, sector 34453631

It ends saying that it made modifications to the filesystem, presumably to mark the blocks that were bad. But how do you find out where those are? You can get that info with the superblock header option of dumpe2fs:

# dumpe2fs -f /dev/md0
Bad blocks: 12909760, 12909778, 12909779, 12909785, 12909790, 12909791, 12909796, 12909797, 12909802, 12909803, 12909809, 12909814, 12909815, 12909820, 12909821, 12910712, 12910714, 12916792, 12920048, 12920092

What have we learned:

The Areca controller has serious problems reporting SMART errors when you're using it in JBOD mode. It seems it's only really designed for handling all that internally
The CLI they provide is a closed source mess that you must test for compatibility
You must monitor the Areca event log using their CLI if you're using their controller in production
The controller doesn't support any user-requested SMART scans
On a caching controller, using "e2fsck -c" may give more reliable results than the full "mkfs -c"

Since the disk here is obviously trashed, and I can't do any of the SMART operations to try and make it better when attached to the Areca controller, at this point I moved it to another system and replaced it with the spare I had for this array. I didn't lose any data here, but I did find myself a lot less comfortable recommending Areca controllers.

I just pointed out last month that current 3ware controllers have caught up with Areca's performance. Of course, as far as I can tell, 3ware's tw_cli utility is just as closed source as Areca's. But since you can run smart tests directly with the Linux smartctl on 3ware controllers, all you really need the CLI for is looking inside the hardware RAID to see what the disk status there is.

As for what I went through trying to get the failing drive back to operating order again, that's a whole 'nother story.

Friday, August 15, 2008

"The essential postgresql.conf" at the BWPUG

One of the common complaints about how you setup and tune a PostgreSQL server is that the postgresql.conf file is way too big, and there's no good guidance to where you should start. I've been talking and corresponding with many PostgreSQL administrators over the last few months, and getting writing help from Christopher Browne and then Robert Treat, to work on that problem. I'm proud to announce a document to address that need that is now both current and pretty complete, while not being overwhelming: Tuning Your PostgreSQL Server.

Robert and I finished this off just in time to present the results at this week's Baltimore/Washington PostgreSQL User Group meeting, providing edutainment going through the whole list of 18 most important parameters (plus logging). Thanks to Robert, Chris, the regulars on the PostgreSQL mailing lists, and everyone who gave us feedback at the meeting (now applied to the Wiki document) for helping to bring all this information together into one relatively easy to digest bite.

Monday, August 4, 2008

Virtualbox and custom kernels

I've been using Virtualbox successfully for a few months now. On my laptop, with Linux as the host operating system, the virtual Windows XP install runs faster than the real XP install on another partition, even though I've only given it 512MB of RAM to work with. Very interesting program.

But on my main desktop system I run a custom kernel, and the same Virtualbox install didn't work there. The kernel module it loads just failed. You can get that to reinstall using


/etc/init.d/vboxdrv setup

Or at least you're supposed to be able to. The error message it gave me in /var/log/vbox-install.log suggested I needed to set KERN_DIR to point to the root of my kernel source; did that. No help, and now there's nothing useful in that log file. The module was there, it just didn't work. Manually doing "modprobe vboxdrv" said it was missing symbols.

To track this down, I started by reading the source for the vboxdrv init script. Under the hood, it runs /usr/share/virtualbox/src/build_in_tmp to recompile the module. You can run this by hand to see what's really going on. Annoyingly, there's a crummy test that generates this bogus warning:


test -e include/linux/autoconf.h -a -e include/config/auto.conf || (            \
      echo;                                                           \
      echo "  ERROR: Kernel configuration is invalid.";               \
      echo "         include/linux/autoconf.h or include/config/auto.conf are missing.";      \
      echo "         Run 'make oldconfig && make prepare' on kernel src to fix it.";  \

This is a lie, both files are there, and it prints this even in the final working configuration after I fixed the problem. I also saw some warnings about deprecated symbols in the middle and the linker complained about missing things, just like when I try to load the module.

I found some notes about using Virtualbox with a custom kernel in an ubuntu forum which was interesting background but no help. Another nice link that I want to remember for future use shows how to repackage Virtualbox using dpkg to support a customer kernel. But since this was on a RedHat system that wasn't going to help, and ultimately I suspected that the same compilation problem would kick in.

After some more searching, I found the root cause here. Some of the kernel interfaces that existed in the original 2.6.18 kernel I was using last time this worked were changed in the 2.6.25 kernel I was running now, which is why the symbols it was looking for were missing. A Gentoo bug report goes over the ugly details and suggests that Virtualbox versions after 1.6.2 fixes the problem (I had been running 1.5.6). Sure enough, download the latest version, install that one, and the module rebuilds itself just fine now as long as KERN_DIR is set correctly.

This is one of those interesting bits to debate: the reason this problem was introduced was that an API was changed "to fix bugs and add a really important feature". Linux benefits from being free to make such enhancements as necessary. But at the same time, the complete lack of concern for backward compatibility in kernel modules can really make things difficult for regular users of Linux software that must link to the kernel. The obligatory reading here for the developer's side of this argument is Stable API Nonsense.

Sunday, August 3, 2008

A Linux write cache mystery

One happy coincidence for me last month is that I discovered a friend of mine had built a Linux-based server running PostgreSQL and put it into a colo, something I had been pricing out myself. He created me an account and now I've got a place to host some personal MediaWiki projects I'd been planning. One known issue my friend mentioned is that he'd been running into regular problems where the whole server just froze up for a few seconds. Because of the pattern of when it happened, he suspected it was related to heavy writes, and furthermore suspected the software RAID-1 implementation. Since that seemed unlikely to me, I told him to hold off on breaking the RAID until I could take a look at things.

The server is a quad-core system with 8GB of RAM and a pair of SATA disks in software RAID-1. The OS is CentOS 5.2, based on the RHEL5 release, and it's using kernel 2.6.18 (the regular one, not the Xen one).

I started by doing the same read/write testing I always do on a system whose I/O I don't necessarily trust or understand: that procedure is outlined at Testing your disks for PostgreSQL. Since the server has 8GB of RAM I used 2,000,000 blocks. I opened a couple of sessions to the server, executed that in one window, top in a second, and "vmstat 1" in a third. Watching a single second vmstat is one of the most useful things you can do for figuring out where bottlenecks are at on a system.

In this case, what quickly became apparent is that the system was alternating between healthy periods that looked like this:


procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
2  5   5196  46428  28160 6188640    0    0     0 539720 1261  673  0 28 25 47  0
0  6   5196  47268  28212 6047744    0    0     0 988400 1235  615  0 21 25 54  0
0  6   5196  46980  28272 6047216    0    0     0 64032 1260  658  0  6 13 81  0
1  5   5196  45884  28336 6046788    0    0     0 61568 1281  677  0  7  0 93  0
1  5   5196  47276  28400 6043408    0    0     0 65632 1267  665  0  7  0 93  0
0  6   5196  46272  28460 6044080    0    0     0 65568 1272  641  0  6  0 94  0
0  6   5196  48188  28524 6042420    0    0     0 65536 1271  626  0  6  0 94  0
0  6   5196  46228  28592 6044836    0    0     0 66928 1298  668  0  7  0 94  0
0  5   5196  46648  28652 6044812    0    0     0 61504 1264  648  0  6  8 86  0

The bo (block out) number is the number to watch on this write test. That's in KB/s, so the entries in the later section here are all approximately 65MB/s. But at the beginning, it's writing in the Linux disk cache at a really high speed, as much as 988MB/s at the beginning. Note that these numbers are total I/O, which includes both of the disks in the RAID-1 pair. That means the actual per-disk write rate is closer to 32MB/s, a bit on the low side, but that's presumably because the disks are already backlogged with writes from the initial burst.

That's a healthy period. Here's what the unhealthy ones looked like:


procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  4   3780 935592  32048 5205528    0    0     0     0 1270  371  0  4 25 71  0
0  4   3780 945140  32048 5205528    0    0     0     0 1293  383  0  3 25 72  0
0  4   3780 954316  32048 5205528    0    0     0     0 1271  370  0  3 25 72  0
0  4   3780 963616  32048 5205528    0    0     0     0 1293  385  0  3 25 72  0
1  4   3780 973288  32048 5205528    0    0     0     0 1293  372  0  3 25 72  0
0  4   3780 982464  32048 5205528    0    0     0     0 1280  395  0  3 25 72  0
0  4   3780 992384  32048 5205528    0    0     0     0 1283  373  0  4 25 71  0
0  4   3780 1002180  32048 5205528    0    0     0     0 1320  380  0  3 25 72  0
0  4   3780 1011480  32048 5205528    0    0     0     0 1280  371  0  3 25 72  0
0  4   3780 1021028  32048 5205528    0    0     0     0 1315  389  0  4 25 71  0
0  4   3780 1030204  32048 5205528    0    0     0     0 1280  372  0  3 25 72  0
0  4   3780 1039132  32048 5205528    0    0     0     0 1279  383  0  3 25 72  0
0  4   3780 1049052  32048 5205528    0    0     0     0 1273  369  0  3 25 72  0
0  4   3780 1058600  32048 5205528    0    0     0     0 1285  388  0  3 25 72  0
0  4   3780 1067900  32048 5205528    0    0     0     0 1259  373  0  3 25 72  0
0  4   3780 1077696  32048 5205528    0    0     0     0 1293  388  0  4 25 71  0
0  4   3780 1087740  32048 5205528    0    0     0     0 1267  371  0  3 25 72  0
0  4   3780 1096296  32048 5205528    0    0     0     0 1262  385  0  3 25 72  0
0  4   3780 1106216  32048 5205528    0    0     0     0 1274  367  0  3 25 72  0
1  3   3780 898384  32268 5428364    0    0     0     8 1344  489  1 19 25 55  0
2  4   3780 313476  32816 5991000    0    0     0     0 1248  374  0 28 25 47  0
1  5   5196  47324  27828 6253400    0 1928     0  1928 1289  470  0 19 17 64  0

That's over 20 seconds straight where zero blocks were written. That certainly seems to match the reported problem behavior of a long unresponsive period, and sure enough some of the sessions I had open were less responsive while this was going on. The question, then, is why it's happening? The disks seem to be working well enough; here's the summary at the end of the dd (the version of dd included in RHEL5 now provides this for you):


16384000000 bytes (16 GB) copied, 209.117 seconds, 78.3 MB/s

78MB/s to each disk in the pair is completely reasonable.

I wrote a long paper on how Linux handles heavy writes called The Linux Page Cache and pdflush because I never found a source that really covered what happens in this situation. What I recommend there is watching /proc/meminfo to see what's going on. Here's a little shell bit you can execute to do that:


while [ 1 ]; do cat /proc/meminfo; sleep 1; done

With some practice you can note what numbers are constantly moving, catch when the bad behavior occurs, then hit control-C to break and peruse the last few entries in your terminal app scrollback buffer. Here's what I found right around the same time as the lull periods:


MemTotal:      8174540 kB
MemFree:         62076 kB
Buffers:         21724 kB
Cached:        6158912 kB
SwapCached:          0 kB
Active:        1126936 kB
Inactive:      6101688 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      8174540 kB
LowFree:         62076 kB
SwapTotal:    16771840 kB
SwapFree:     16766644 kB
Dirty:            6640 kB
Writeback:     3230792 kB ***
AnonPages:     1047724 kB
Mapped:          65736 kB
Slab:           823308 kB
PageTables:      23036 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  20859108 kB
Committed_AS:  1893448 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    271572 kB
VmallocChunk: 34359465955 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

Note the line I starred there for Writeback. At the point where the system was stalling, a full 3.2GB of data was queued to write. That's 40% of RAM. Going back to my Linux page cache paper, you'll find that number listed: 40% is the point where Linux switches to the high dirty_ratio behavior, where all processes are blocked for writes. On a fast server with this much RAM, you can fill gigabytes of RAM in seconds, but writing that out to disk is still going to take a long time. If we work out the math, 3.2GB to write to two disks capable of 78MB/s each works out to...20.5 seconds. Well look at that, almost exactly the length of our long slow period, where process writes were stalled waiting for the kernel to clean up. I love it when the math comes together.

So, what to do? Well, this problem (way too many writes buffered on systems with large amounts of RAM) was so obvious that in the 2.6.22 Linux kernel, the defaults for the caching here were all lowered substantially. This is from the release notes to 2.6.22:

Change default dirty-writeback limits. This means the kernel will write "dirty" caches differently...dirty_background_ratio defaulted to 10, now defaults to 5. dirty_ratio defaulted to 40, now it's 10

A check of this server showed it was using the 2.6.18 defaults as expected:


[gsmith@server ~]$ cat /proc/sys/vm/dirty_ratio
40
[gsmith@server ~]$ cat /proc/sys/vm/dirty_background_ratio
10

So what I suggested to my friend the server admin was to change these to the new values that are now standard in later kernels. It's easy to put these lines into /etc/rc.d/rc.local to make this change permanent after trying it out:


echo 10 > /proc/sys/vm/dirty_ratio
echo 5 > /proc/sys/vm/dirty_background_ratio

After doing that, I re-ran the dd test and things were much better. There were a few seconds where there was a small blip in throughput. During the 4 minute long test I found one 4-second long period writes dropped to the 2MB/s level. But for the most part, the giant bursts followed by lulls were gone, replaced by a fairly steady 130MB/s of writing the whole time. The final dd numbers looked like this after the retuning:


16384000000 bytes (16 GB) copied, 261.112 seconds, 62.7 MB/s

So this did drop average and peak throughput a bit. That's usually how things worst: best throughput to disks usually involves writing in larger bursts, which is efficient but very disruptive. But that's a small price to pay for making the many second long pauses go away.

This particular problem shows up in all kinds of places where heavy writing is being done. Tuning these parameters is also one of the suggestions I make for people running PostgreSQL 8.2 or earlier in particular who want to tune checkpoint behavior better. In that context, there have even been reports of people turning this particular write cache off altogether, which again would lower average throughput, but in that case it was worth it for how much it decreased worst-case behavior.

Time will tell if there's anything else going on that was contributing to the pauses originally reported that is still happening on this server, but this looks like a pretty clear smoking gun that's now been holstered.

Friday, May 23, 2008

pgbench suffering with Linux 2.6.23-2.6.26

About a month ago I got Linux installed on some new quad-core hardware at home, with the intention being that box would primarily be used for my PostgreSQL work which includes a lot of benchmarking. The hardware on the system is new enough that I needed a very recent Linux kernel version for it to work well. It was possible to install my standard distributions, CentOS 5 and Ubuntu 7.04, if I set the motherboard to "ATA/IDE Mode=legacy". But that limited me to old ATA modes without DMA. That topped out at 3MB/s of hard drive speed as measured by hdparm -t, an issue well documented in the libATA FAQ. In order to get reasonable speed (close to 70MB/s) I had to use a much newer kernel that allowed me to switch to AHCI mode. The earliest kernel that worked well was 2.6.22-19.

I should have stopped there.

For kicks, I decided to try the latest kernel at the time, 2.6.24.3, to see if it was any faster. The big change in recent kernels I was curious about was the introduction of the controversial Completely Fair Scheduler (CFS). Because that was merged into the kernel quite a bit faster than I find comfortable, I was rather suspicious of its performance, and I launched into a full set of tests using pgbench.

These went very badly, with an extreme drop in performance compared with earlier versions. By the time I quantified that and reported to PostgreSQL land, I was told Nick Piggin had already reported the problem and a fix was due for 2.6.25, which literally came out in the middle of the night the day was I coming up with a repeatable test case for 2.6.24.

2.6.25 gave the worst pgbench results yet. I did confirm that it's not in the server though; if I run the pgbench client on a remote system, the results scale fine to the level I expect.

It took me some time to organize all my test results and automate my test cases well enough that I thought they could be replicated. This week I submitted a kernel regression report about this problem. Mike Galbraith was easily able to reproduce it, and I've already gotten and tested one patch to improve things.

Still a ways back from 2.6.22 though, and with 2.6.26 being on -rc3 I have my doubts this will be fully wrapped up before that goes live. So here's my suggestions for those of you using PostgreSQL on recent Linux versions who care about performance:

Kernel 2.6.18 has been a fairly solid kernel for me under RHEL5 and the matching CentOS.
Kernel 2.6.19 was the first to merge the libata drivers and I avoid it because of that.
Kernel 2.6.20 in the form of Ubuntu 7.04 Feisty Fawn seems to have stabilized after the libata merge. I can't comment on the kernel.org distribution.
Kernel 2.6.22 seems solid on everything I throw at it, this is the latest Linux kernel I recommend using for PostgreSQL without serious caveats. The first 2.6.22 came out in July of last year, so it's almost a year old now, and it's currently at 2.6.22-19. Ubuntu's 7.10 Gutsy Gibbon includes 2.6.22.
Kernels 2.6.23 and 2.6.24, the first to use CFS, have known serious server-side PostgreSQL performance issues. The main distributions I'm aware of that are likely impacted here are Fedora Core 8 and Ubuntu 8.04. I expect Fedora to be running bleeding edge kernels with regressions (you could argue the existing of Fedora has pushed kernel testing toward the users rather than being done as much by the developers). I'm really disappointed Ubuntu adopted such a kernel given the rawness of CFS. Already ranted about this a bunch on my last blog post. I think this is just a timing problem for them. Had they adopted 2.6.22 instead, they'd be facing 5 years of LTS with a kernel using a model abandoned by the mainstream development and therefore have nowhere to turn when issues popped up. The right answer would be not to do a LTS right now, but I digress.
Kernel 2.6.25 seems to have resolved the worst of the server-side issues. This is what Fedora Core 9 is running. But results from pgbench should be considered suspect, particularly under high client loads, and without that working I can't really prove to myself that 2.6.25 is fine in all the situations I like to test.
Kernel 2.6.26 may get some fixes in to improve pgbench, depends on how things go. The patch I've already gotten closed much of the performance gap with only a few lines of code changed. You can follow my thread on lkml to see how things are going.

The really important lesson here I drill into people whenever I talk about performance and benchmarking issues is that you've got to measure baseline expectations on any system you put together, continue to check periodically to make sure things haven't eroded, and use that as guidance for any new system. Here I compared new hardware about to go into service against my older, trusted "production" unit to confirm it was faster, discovering both the disk and the pgbench issues. It's really handy to know how to quantify how fast your reference benchmark is on a system from a CPU/memory/disk perspective when you get to where you're looking to upgrade it. Not only will that help guide what should be changed, but it will let you know if the upgrade really worked or not when you're done.

Monday, May 19, 2008

Linux printing, as "fun" as ever

I've been using Ubuntu 7.04 "Feisty Fawn" as my primary desktop for the last six months. Let me start by saying I'm a big fan of the distribution for desktop use, and I'm looking into deeper use of Ubuntu in the next six months despite the occasional problems with it and its underlying infrastructure. (Debian lets a guy who doesn't know how to read C repackage crypto code? Seriously?)

Anyway, I needed to get that Feisty install talking to my QMS/Minolta (now Konica) Magicolor 2350 laser printer over the network, the only way I ever talk to that printer. Recalling Eric Raymond's Linux usability rant The Luxury of Ignorance, I was curious to see how much things have improved since then.

This printer has all sorts of network interfaces supported and genuine Postscript. Last time I networked it on a Linux host was all via command line and that was pretty simple, as command prompt interfaces go. Under Windows, all I have to do is install the driver and tell it the IP address of the printer. Ubuntu also provides a gui to handle all this, gnome-cups-manager. I started there. This printer supports the Internet Printing Protocol which seemed promising for an easy configuration.

Setup things for IPP to that address, just got errors with no useful descriptions in the GUI. Turning to Debugging Ubuntu Printing Problems, I found that on Feisty I had to manually edit the cupsd.conf file to get any useful feedback here. Basically, on this version of Ubuntu, the Gnome tool is useless the minute you have a problem. Supposedly it's better on Gutsy, not that it would have helped me.

I also learned a bit about troubleshooting this type error from the IPP: Only Raw Printing Works thread on the forums, which wasn't my problem but was in the same area. The other troubleshooting suggestion there for helping with reinstalls was:


sudo foomatic-cleanupdrivers

That was nice to make sure I was starting from closer to scratch properly every time, but was also no help at actually resolving the issue.

The debugging level logs led me to this bit that seemed my real problem:


E No %%BoundingBox: comment in header!
E PID 10539 (/usr/lib/cups/backend/socket) stopped with status 1!
E [Job 63] Unable to write print data: Broken pipe

What the hell? I found a bunch of people with this same error, but few suggestions. Subsequent reading suggested the whole gnome-cups-manager does little but get in the way if it doesn't work right off the bat, so now it was back to using


http://localhost:631/

To manage cups directly.

I tried a few more variations on the IPP protocol before deciding that whole protocol was yet another thing getting in the way. Back to the old standby of talking directly to the printer with lpd. While playing with that, cups went completely insane on multiple occasions, prompting a need for:


/etc/init.d/cupsys restart

Sigh. Did some digging on the default queue info for this printer, and the magic URI that finally worked is:


lpd://192.168.0.6/ps

So basically the old-school setup I used to use. The additional layers of cups and its gnome interface did nothing but get in my way by obfuscating what was going on underneath, and cups remains as buggy as ever. Gosh, maybe I should follow the link on the localhost cups page to purchase Easy Software's "ESP Print Pro"? The only thing worse than spamming my printer setup page with their ad is that when I click on it, the page doesn't even exist. Come on--you put a damn ad on Linux systems all over the world, and you can't even keep a redirect alive to the URL you used? Not exactly a way to get me so confident in your skills that I'd want to give you money.

Now, at this point the obvious flame is "what the hell are you complaining about Ubuntu 7.04 for when 8.04 is available?". That's easy--I only use LTS versions because they're the only ones I'd recommend a business deploy, and dissapointingly Hardy is a good beta quality release that was pushed out the door anyway to meet a pre-planned release schedule. It doesn't work anywhere close to well enough for my standards yet. They decided to use a Linux kernel so fresh (2.6.24) it's gone through almost no QA before release, with a crippling bug in the brand-new scheduler model that completely bogs down the application I spend most of my day using, PostgreSQL. Again, seriously? They just introduced a whole new scheduler in 2.6.23 and you expect it will work already? Have you ever actually developed software before? That sucker is many kernel revs away from having all the unexpected corner cases knocked out. Hardy should have shipped with 2.6.22 or earlier if they wanted a stable kernel at launch. Combined with the beta standard Firefox (I must have a reliable Flash plugin for my work as well) and the whole Pulseaudio mess, Hardy isn't even on my radar until service pack, err, update 1 comes out in a couple of months.

A final twist to my story: just after I got printing going, CUPS recognized my printer by itself! It suggested:


ipp://192.168.0.6:631/COM1

Of course, that didn't actually work either--just got IPP errors about not being able to get the status of the printer. Looks like Eric Raymond's Aunt Tillie is still a considerable distance away from easy Linux printing. As for my setup, web pages print fine but lpd output gets truncated at the margins. If only I had a fancy GUI to help set that up...

Friday, April 11, 2008

Conference East 08 and Solaris Notes

I had fun and picked up a lot of information at the East Coast PostgreSQL conference two weeks ago. I was waiting for a few things to come online before writing about it, rather than post before I had links to everything.

I was surprised at how good the reaction to my own presentation on the internals of the PostgreSQL buffer cache was. Much of that information you can only pick up from staring at the source code to the database, and apparently that's not so popular. There's theoretical aspects to the whole thing, but I did close with some practical suggestions for tuning with queries that use the pg_buffercache module. Slides and sample query source code now posted at my PostgreSQL Performance Pitstop. I'm hoping to present that again at the upcoming OSCON PostgreSQL Day in Portland. And this time I'll be prepared for the question Jeff Davis nailed me on about how the 8.3 buffer re-use changes actually work.

The first really informative talk I saw was Greg Sabino Mullane's discussion of replication solutions. I just used some of the insight gained there to write an updated intro to the Replication, Clustering, and Connection Pooling page on the PostgreSQL Wiki. That page still needs a lot more attention.

Solaris Adventures

One of the things I've been doing recently is deploying more systems on Solaris 10 instead of my traditional Linux environment. There were a few talks discussing Solaris-specific issues and I picked up some tricks at all of them.

Theo from OmniTI talked about their experiences as well as tools they've made available, like their PostgreSQL-Solaris Tools and
Project DTrace. Some things I picked up there:

ZFS has a pretty substantial overhead compared to the traditional UFS filesystem. It takes some tuning to pull that down for PostgreSQL use: dropping the default large block/stripe size values, reducing the ZFS buffer cache to being less aggressive, and adjusting the direct I/O parameters as appropriate.
They've found the shared lock that pg_dump takes on large tables makes it really inappropriate for use on larger databases. When they do instead is take a ZFS snapshot, then run the dump against that snapshot. Then there's no database locks on the snapshot to be worried about stomping on.
PostgreSQL major version upgrades are hard to pull off because the recommended procedure is a dump/reload; that doesn't work on TB scale databases very well. Slony is often recommended here, but using it requires a lot of hardware: you basically need a whole additional server just to handle the version upgrade in addition to the primary/backup combination the main business is running on. Here again you can play some tricks with ZFS snapshots to try things out on the production boxes while leaving the ability to roll-back if there's a problem. This greatly enhances the potential for a tool like pg_migrator to be useful for verions upgrades to a large databases, instead of a full dump/reload.

Jignesh from Sun did two talks about Solaris deployments. The benchmarking one was interesting for me personally as a performance tuning geek, but the focus on commercial benchmarks (TPC is not cheap) and practical suggestions aimed at gigantic user loads isn't exactly mainstream. But his Best Practices with PostgreSQL on Solaris talk is must reading for people using this combination. The comments about UFS and ZFS tuning cleared up many mysteries I'd wondered about differences in performance on those two compared to the Linux ext3 setups I'm used to tuning. Some details from the talk to expand on the slides a bit:

Disabling the freebehind algorithm turns off the somewhat unexpected Solaris behavior that avoids caching larger files like those you'll find in a database application. It's worth your time to look into all the parameters he mentions for filesystem tuning, they're pretty obscure but the defaults are not tuned well for database use--by PostgreSQL at least, which relies much more heavily on the OS filesystem cache than, say, Oracle.
One non-obvious reason that direct I/O really doesn't work well on the main database filesystem is that PostgreSQL really expects that access to the pg_clog files will be readily cached by the OS. When they're not that causes some ugly performance issues.
It wasn't really anticipated by the original design, but apparently the postgresql.conf commit_delay parameter has a highly useful application in very high throughput situations because it allows database writes to be much larger on average. That makes them better match the underlying storage/filesystem block sizes while also decreasing the number of I/O operations aka iops (which can become a bottleneck if you're doing lots of small writes).
There are alternative memory allocation and scheduler methods available (ibumem and the FX Scheduler) that seem more appropriate for PostgreSQL use than the defaults.

Finally, another talk that mentioned Solaris tools like ZFS snapshots was Robert Treat's "Big, Bad, Broken PostgreSQL". I would recommend waiting to see if the video for this talk becomes available; it was so funny in a black comedy vein to see how his disaster unfolded that I'd hate to see you ruin the surprises by just reading the slides first.

Saturday, February 23, 2008

"The stage was now set for..."

I've never been much of a blog fan. I really like wikis instead, and I've spent way too much time fiddling with them for fun the last few years. But I've noticed that lately I find myself spending an increasing amount of time reading blogs that primarily consist of people posting little technical tidbits about software that I use, and I've realized I should be one of those people. One of my long standing organizational problems is that I collect more information than can be organized usefully with the time I've got. While I've got a private wiki I use for that, it fails to work well for jotting down quick notes I'd like to be able to find again one day, but that don't need to be put into a larger structure.

Thus is born the note magnet. Let the shredding of scribbled post-its I've transcribed begin.

Greg Smith's Note Magnet