Posted: March 1st, 2010 | Author: charlie | Filed under: IT Management | Tags: capacity, performance, tuning, virtualization | 3 Comments »
When purchasing server hardware, do you tend to purchase more power than you need, or not enough? Specifying the correct server for your current need is a fine art, and it’s easy to get wrong. Here are some helpful hints and considerations to remember that will ensure you make the right server purchasing decision.
We’re going to focus on standalone (non-blade) servers for the moment, but many aspects are also applicable to blade servers. Blade servers are wonderful for centralized management of the hardware, but the specs of the individual server blades can vary tremendously.
Want to avoid trudging down to the datacenter late at night, or even worse, across the world if something breaks? Then don’t skimp on the management controller, lights out manager, or whatever the vendor is calling it. Many vendors ship a simple version by default: it may allow serial console access only, for example. Make sure to get the full-featured controller, because even if the hardware is only a few doors down, getting up from your desk should never be necessary.
If you aren’t thinking of switching vendors any time soon, you might think that the management interface will always work the same as it has on all your other servers. Unfortunately, that’s not the case. Sun x86 hardware, for example, has many different hardware management controllers to choose from. The more expensive and feature-rich servers have the better controllers, but don’t make the mistake of thinking the interface never changes. The unfortunate part is that you never know how well it works until you get a server on-site.
Hardware management comes in two forms: IPMI (most support), and the user interface. The user interface is more often than not, a Web-based java application that provides remote console access. Some are extremely buggy, and others work quite well from all Web browsers. We can’t make a recommendation, though, because these things change often.
Shucks, this one is a no-brainer: as much as you can afford. Within reason, that is. If you aren’t going to run virtual machines, and this server’s only job is to serve up some simple Web pages, then 16GB of RAM is likely overkill. Likewise, make sure you know what your application can support. Many java applications are limited to a heap size of 2 or 4GB.
It’s also overkill to purchase more than 4GB of RAM if you need to run a 32-bit operating system. Yes, Windows Server does some tricks and it can use more than 4GB, but it’s a huge performance it.
If virtualization is in your future, load up as much as possible. You also want to pay attention to how many DIMM slots the server has. The 8GB DIMMs are horribly expensive now, so you’ll probably want to stick with 4GB sticks. Just remember, if you fill all the slots in the server, the only memory upgrade path is to buy higher capacity DIMMs.
Do you want to run many threads at an even pace or just a few threads as fast as possible? Sun’s T2 processors aren’t fast by any measure, but they can run many threads at the same speeds consistently. These are ideal for database servers, but not for Web servers.
Will this server be executing a wide variety of processes over and over again, as opposed to just running the same big application server constantly? If so, make sure you pay attention to the amount of cache each core of the CPU has.
For virtualization, you want the fastest multi-core processors available, with the largest amount of L2 cache. Cache is very important as it minimizes the number of times the CPU needs to fetch data from slower RAM. It makes a very noticeable difference on heavily used servers.
Disks, Controllers, and RAID
If you need local storage, do pay attention to the type of disks you’re ordering. A SATA disk is likely to disappoint if you have an IO-heavy workload. SAS, and FC disks should perform equally well, since they are both SCSI disks underneath.
Even if you don’t need much local storage, you should always buy a server with a RAID controller that can mirror the operating system disks, unless you’re SAN booting of course. You don’t want the OS to crash just because of a failed disk. Likewise, if you’re keeping tons of local storage for some reason, make sure to get a RAID card that does RAID-5, so that you can at least lose one disk at a time without losing data. If performance is a concern you should really be using iSCSI or SAN storage, but you may also think about a RAID 0+1 configuration to avoid the slower RAID-5 parity calculations.
If you’re attaching to a SAN, make sure to include the correct HBA as well.
When servers started showing up with two or four gigabit NICs I must admit, I was confused. Why would someone need that many? Aside from large servers that do a lot of network IO, you might also want to separate out your iSCSI traffic from normal Ethernet. It’s also important these days to make sure that the network cards support TOE, or a TCP Offload Engine. This will task the network card with computing TCP checksums, freeing your CPUs for more important things.
In summary, most of these things may seem common sense, but you need to remember to ask all the right questions every time you spec a server. Here’s a good checklist:
- Adequate hardware management controller
- Enough (but not too much) RAM, that’s fast enough, but not faster than the CPU’s front-side bus
- Enough memory slots for expansion, if that seems likely
- Correct CPU for this server’s needs
- RAID-1 for the OS, and (optionally) other RAID levels for other local storage
- FC HBAs?
- Multiple gigabit NICs with TOE capabilities
3 Comments »
Posted: February 24th, 2010 | Author: charlie | Filed under: Linux / Unix | Tags: education, linux, monitoring, performance, solaris, tuning | No Comments »
Unix and Linux systems have forever been obtuse and mysterious for many people. They generally don’t have nice graphical utilities for displaying system performance information; you need to know how to coax the information you need. Furthermore, you need to know how to interpret the information you’re given. Let’s take a look at some common system tools that can provide tons of visibility into what the opaque OS is really doing.
Unfortunately, the same tools don’t exist universally across all Unix variants. A few commonly underused ones do, however, and that is what we’ll focus on first.
A common source of “slowness” is disk I/O, or rather the lack of available I/O. On Linux especially, it may be a difficult diagnosis. Often the load average will climb quickly, but without any corresponding processes in top eating much CPU. Linux counts “iowait” as CPU time when calculating load average. I’ve seen load numbers in the tens of thousands, on more than one occasion.
The easiest way to see what’s happening to your disks is to run the ‘iostat’ program. Via iostat, you can see how many read and write operations are happening per device, how much CPU is being utilized, and how long each transaction takes. Many arguments are available for iostat, so do spend some time with the man page on your specific system. By default, running ‘iostat’ with no arguments produces a report about disk IO since boot. To get a snapshot of “now” add a numerical argument last, which will prompt iostat to gather statistics for that number of seconds.
Linux will show number of blocks read or written per second, along with some useful CPU statistics. This is one particularly busy server:
avg-cpu: %user %nice %system %iowait %steal %idle
1.36 0.07 5.21 23.80 0.00 69.57
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 18.22 15723.35 643.25 65474958946 2678596632
Notice that iowait is at 23%. This means that 23% of the time this server is waiting on disk I/O. Some Solaris iostat output shows a similar thing, just represented differently(iostat -xnz):
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
295.3 79.7 5657.8 211.0 0.0 10.3 0.0 27.4 0 100 d101
134.8 16.4 4069.8 116.0 0.0 3.5 0.0 23.3 0 90 d105
The %b (block) column shows that I/O to device d101 is 100% blocked waiting for the device to complete transaction. The average service time isn’t good either: disk reads shouldn’t take 27.4ms. Arguably, Solaris’s output is more friendly to parse, since it gives the reads per second in kilobytes rather than blocks. We can quickly calculate that this server is reading about 19KB per read by dividing the number of KB read per second by the number of reads that happened. In short: this disk array is being taxed by large amounts of read requests.
The ‘vmstat’ program is also universally available, and extremely useful. It, too, provides vastly different information among operating systems. The vmstat utility will show you statistics about the virtual memory subsystem, or to put it simply: swap space. It is much more complex than just swap, as nearly every IO operation involves the VM system when pages of memory are allocated.A disk write, network packet send, and the obvious “program allocates RAM” all impact what you see in vmstat.
Running vmstat with the -p argument will print out statistics about disk IO. In Solaris you get some disk information anyway, as seen below:
kthr memory page disk
faults cpu r b w swap
free re mf pi po fr de sr m0 m1 m2 m7
in sy cs us sy id 0 0 0 7856104 526824 386 2401 0 0 0 0 0 3 0 0 0
16586 22969 12576 8 9 83 1 0 0 7851344 522016 18 678 32 0 0 0 0 2
0 0 0 13048 11737 10197 7 6 86 0 0 0 7843584 514128 76 3330 197 0
0 0 0 2 0 0 0 4762 131492 4441 16 8 76
A subtle, but important differences between Solaris and Linux is that Solaris will start scanning for pages of memory that can be freed before it will actually start swapping RAM to disk. The ‘sr’ column, scan rate, will start increasing right before swapping takes place, and continue until some RAM is available. The normal things are available in all operating systems; these include: swap space, free memory, pages in and out (careful, this doesn’t mean swapping is happening), page faults, context switches, and some CPU idle/system/user statistics. Once you know how to interpret these items you quickly learn to infer what they indicate about the usage of your system.
The two main programs for finding “slowness” are therefore iostat and vmstat. Before the obligatory tangent into “what Dtrace can do for you,” here’s a few other tools that no Unix junkie should leave home without:
- Lists open files (including network ports) for all processes
- Lists all sockets in use by the system
- Shows CPU statistics (including IO), per-processor
We cannot talk about system visibility without mentioning Dtrace. Invented by Sun, Dtrace provides dynamic tracing of everything about a system. Dtrace gives you the ability to ask any arbitrary question about the state of a system, which works by calling “probes” within the kernel. That sounds intimidating, doesn’t it?
Let’s say that we wanted to know what files were being read or written on our Linux server that has a high iowait percentage. There’s simply no way to know. Let’s ask the same question of Solaris, and instead of learning Dtrace, we’ll find something useful in the Dtrace ToolKit. In the kit, you’ll find a few neat programs like iosnoop and iotop, which will tell you which processes are doing all the disk IO operations. Neat, but we really want to know what files are being accessed so much. In the FS directory, the rfileio.d script will provide this information. Run it, and you’ll see every file that’s read or written, and cache hit statistics. There’s no way to get this information in other Unixes, and this is just one simple example of how Dtrace is invaluable.
The script itself is about 90 lines, inclusive of comments, but the bulk of it is dealing with cache statistics. An excellent way to start learning Dtrace is to simply read the Dtrace ToolKit scripts.
Don’t worry if you’re not a Solaris admin: Dtrace is coming soon to a FreeBSD near you. SystemTap, a replica of Dtrace, will be available for Linux soon as well. Until then, and even afterward, the above mentioned tools will still be invaluable. If you can quickly get disk IO statistics and see if you’re swapping the majority of system performance problems are solved. Dtrace also provides amazing application tracing functionality, and if you’re looking at the application itself, you already know the slowness isn’t likely being caused by a system problem.
Soon, I’ll publish a few Dtrace tutorials.
Some things have surely been left out – discuss below!
No Comments yet... be the first »
Posted: February 15th, 2010 | Author: charlie | Filed under: Linux / Unix | Tags: linux, performance, swap, tuning, vmm | No Comments »
Virtual memory is one of the most important, and accordingly confusing, piece of an operating system. Understanding the basics of virtual memory is a requisite to understanding operating system performance. Beyond the basics, a deeper understanding allows a systems administrator to interpret system profiling tools better, leading to quicker troubleshooting and better decisions.
The concept of virtual memory is generally taught as though it’s only used for extending the amount of physical RAM in a system. Indeed, paging to disk is important, but virtual memory is used by nearly every aspect of an operating system.
In addition swapping, virtual memory is used to manage all pages of memory, which incidentally are required for file caching, process isolation, and even network communication. Anything that queues data, you can be assured, traverses the virtual memory system. Depending on a server’s role, virtual memory functionality may not be optimal. An administrator can dramatically improve overall system performance by adjusting certain virtual memory manager settings.
To optimally configure your Virtual Memory Manager (VMM), it’s necessary to understand how it does its job. We’re using Linux for example’s sake, but the concepts apply across the board, though some slight architectural differences will exist between the Unixes.
Nearly every VMM interaction involves the MMU, or Memory Management Unit, excluding the disk subsystem. The MMU allows the operating system to access memory through virtual addresses by using data structures to track these translations. Its main job is to translate these virtual addresses into physical addresses, so that the right section of RAM is accessed.
The Zoned Buddy Allocator interacts directly with the MMU, providing valid pages when the kernel asks for them. It also manages lists of pages and keeps track of different categories of memory addresses.
The Slab Allocator is another layer in front of the Buddy Allocator, and provides the ability to create cache of memory objects in memory. On x86 hardware, pages of memory must be allocated in 4KB blocks, but the Slab Allocator allows the kernel to store objects that are differently sized, and will manage and allocate real pages appropriately.
Finally, a few kernel tasks run to manage specific aspects of the VMM. The bdflush manages block device pages (disk IO), and kswapd handles swapping pages to disk. Pages of memory are either Free (available to allocate), Active (in use), or Inactive. Inactive pages of memory are either dirty or clean, depending on if it has been selected for removal yet or not. An inactive dirty page is no longer in use, but is not yet available for re-use. The operating system must scan for dirty pages, and decide to deallocate them. After they have been guaranteed sync’d to disk, an inactive page my be “clean,” or ready for re-use.
Tunable parameters may be adjusted in real-time via the proc fils system, but to persist across a reboot, /etc/sysctl.conf is the preferred method. Parameters can be entered in real-time via the sysctl command, and then recorded in the configuration file for reboot persistence.
You can adjust everything from the interval pages are scanned to the amount of memory to reserve for pagecache use. Let’s see a few examples.
Often we’ll want to optimize a system for IO performance. A busy database server, for example, is generally only going to run the database, and it doesn’t matter if the user experience is good or not. If the system doesn’t require much memory for user applications decreasing the available bdflush tunables is beneficial. The specific parameters being adjusted are just too lengthy to explain here, but definitely look into them if you wish to adjust the values further. They are fully explained in vm.txt, usually located at: /usr/src/linux/Documenation/sysctl/vm.txt.
In general, a IO-heavy server will benefit from the following setting these values in sysctl.conf:
vm.bdflush=”100 5000 640 2560 150 30000 5000 1884 2”
The pagecache values control how much memory is used for pagecache. The amount of pagecache allowed translates directly to how many programs and open files can be held in memory.
The three tunable parameters with pagecache are:
- Min: the minimum amount of memory reserved for pagecache
- Borrow: the percentage of pages used in the process of reclaiming pages
- Max: percentage at which kswapd will only page pagecache pages; once it falls below, it can swap out process pages again
On a file server, we’d want to increase the amount of pagecache available, so that data isn’t moved to disk as often. Using vm.pagecache=”10 50 100″ provides more caching, allowing larger and less frequent disk writes for file IO intensive work loads.
On a single-user machine, say your workstation, large number will keep pages in memory, allowing programs to execute faster. Once the upper limit is reached, however, you will start swapping constantly.
Conversely, a server with many users that frequently executes many different programs will not want high amounts of pagecache. The pagecache can easily eat up available memory if it’s too large, so something like vm.pagecache=”10 20 30” is a good compromise.
Finally, the swappiness and vm.overcommit parameters are also very powerful. The overcommit number can be used to allow more memory allocation than RAM exists, which allows you to overcommit the amount of pages. Programs that have a habit of trying to allocate many gigabytes of memory are a hassle, and frequently they don’t use nearly that much memory. Upping the overcommit factor will allow these allocations to happen, but if the application really does use all the RAM, you’ll be swapping like crazy in no time (or worse: running out of swap).
The swappiness concept is heavily debated. If you want to decrease the amount of swapping done by the system, just echo a small number of the range 0-100 into: /proc/sys/vm/swappiness. You don’t generally want to play with this, as it its more mysterious and non-deterministic than the advanced parameters described above. In general, you want applications to swap to avoid them using memory for no reason. Task-specific servers, where you know the amount of RAM and the application requirements, are best suited for swappiness tuning (using a low number to decrease swapping).
These parameters all require a bit of testing, but in the end, you can dramatically increase the performance of many types of servers. The common case of disappointing disk performance stands to gain the most: give the settings a try before going out and buying a faster disk array.
No Comments yet... be the first »
Posted: February 13th, 2010 | Author: charlie | Filed under: Networking | Tags: linux, networks, NIC, tuning, windows | 1 Comment »
Many new workstations and servers are coming with integrated gigabit network cards nowadays, but quite a few people soon discover that they can’t transfer data much faster than they did with 100 Mb/s network cards. Multiple factors can affect your ability to transfer at higher speeds, and most of them revolve around operating system settings. In this article we will discuss the necessary steps to make your new gigabit enabled server obtain close to gigabit speeds in Linux, FreeBSD, and Windows.
First and foremost we must realize that there are hardware limitations to consider. Just because someone throws a gigabit network card in a server doesn’t mean the hardware can keep up. Network cards are normally connected to the PCI bus via a free PCI slot. In older workstation and non server-class motherboards the PCI slots are normally 32 bit, 33MHz. This means they can transfer at speeds of 133MB/s, but since it is a shared bus between many parts of the computer, realistically it’s limited to around 80MB/s in the best case. Gigabit network cards are 1000Mb/s, or 125MB/s. If the PCI bus is only capable of 80MB/s this is a major limiting factor for gigabit network cards. The math works out to 640Mb/s, which is really quite a bit faster than most gigabit network card installations, but remember this is probably the best-case scenario. If there are other hungry data loving PCI cards in the server, you’ll likely see much less throughput. The only solution for overcoming this bottleneck is to purchase a motherboard with a 66MHz PCI slot, which can do 266MB/s. Also, the new 64 bit PCI slots are capable of 532MB/s on a 66MHz bus. These are beginning to come standard on all server-class motherboards.
Assuming we’re using decent hardware that can keep up with the data rates necessary for gigabit, there is now another obstacle – the operating system. For testing, we used two identical servers: Intel Server motherboards, Pentium 4 3.0 GHz, 1GB RAM, integrated 10/100/1000 Intel network card. One was running Gentoo Linux with a 2.6 SMP kernel, and the other is FreeBSD 5.3 with an SMP kernel to take advantage of the Pentium 4’s HyperThreading capabilities. We were lucky to have a gigabit capable switch, but the same results could be accomplished by connecting both servers directly to each other.
For testing speeds between two servers, we don’t want to use FTP or anything that will require data be fetched from disk. Memory to memory transfers are a much better test, and many tools exist to do this. For our tests, we used ttcp (http://www.pcausa.com/Utilities/pcattcp.htm).
The first test between these two servers was not pretty. The maximum rate was around 230 Mb/s, about two times as fast as a 100Mb/s network card. This is an improvement, but far from optimal. In actuality, most people will see even worse performance out of the box. However, with a few minor setting changes, we quickly realized major speed improvements – more than a threefold improvement over the initial test.
Many people recommend setting the MTU of your network interface larger. This basically means telling the network card to send a larger sized Ethernet frame. While this may be useful when connecting two hosts directly together, it becomes less useful when connecting through a switch that doesn’t support larger MTUs. At any rate, this isn’t necessary. 900Mb/s can be attained at the normal 1500 byte MTU setting.
For attaining maximum throughput, the most important options involve TCP window sizes. The TCP window controls the flow of data, and is negotiated during the start of a TCP connection. Using too small of a size will result in slowness, since TCP can only use the smaller of the two end system’s capabilities. It is quite a bit more complex than this, but here’s the information you really need to know:
For both Linux and FreeBSD we’re using the sysctl utility. For all of the following options, entering the command ‘sysctl variable=number’ should do the trick. To view the current settings use: ‘sysctl ’
Maximum window size:
Default window size:
FreeBSD, sending and receiving:
Linux, sending and receiving:
net.core.wmem_default = 65536
net.core.rmem_default = 65536
This enables the useful window scaling options defined in rfc1323, which allows the windows to dynamically get larger than we specified above.
When sending large amounts of data, we can run the operating system out of buffers. This option should be enabled before attempting to use the above settings. To increase the amount of “mbufs” available:
net.ipv4.tcp_mem= 98304 131072 196608
These quick changes will skyrocket TCP performance. Afterwards we were able to run ttcp and attain around 895 Mb/s every time – quite an impressive data rate. There are other options available for adjusting the UDP datagram sizes as well, but we’re mainly focusing on TCP here.
Windows XP / 2000 Server / Server 2003
The magical location for TCP settings in the registry editor is:
We need to add a registry DWORD named TcpWindowSize, and enter a sufficiently large size. 131400 (make sure you click on decimal) should be enough.
Tcp1323Opts should be set to 3. This enables both rfc1323 scaling and timestamps.
And similarly to Unix, we also want to increase the TCP buffer sizes:
One last important note for Windows XP users needs to be made. If you’re installed service pack 2, then there is another likely culprit of poor network performance. Explained in knowledge base article 842264, Microsoft says that disabling Internet Connection Sharing after an SP2 install should fix performance issues.
The above tweaks should enable your sufficiently fast server to attain much faster data rates over TCP. If your specific application makes significant use of UDP, then it will be worth looking into similar options relating to UDP datagram sizes. Remember, we obtained close to 900Mb/s with a very fast Pentium 4 machine, server-class motherboard, and quality Intel network card. Results may vary wildly, but adjusting the above settings are a necessary step toward realizing your server’s capabilities.
1 Comment »