Posted: May 8th, 2010 | Author: charlie | Filed under: Networking | Tags: cisco, networks, performance, VoIP | No Comments »
WANs often need Quality of Service (QoS) configured to ensure that certain traffic is classified as “more important” than other traffic. Until now, it took a serious Cisco guru to configure a network properly for VoIP if the network was at all bandwidth constrained. AutoQoS, a new IOS feature for Cisco routers, makes deploying VoIP easy, even on busy WAN links. In this article we’ll cover the basics, what AutoQoS does, and some of its limitations.
The first whack at AutoQoS was Cisco recognizing the need to simplify VoIP traffic prioritization. VoIP is especially sensitive to any latency, jitter, or loss, and users will notice problems. To ensure the best possible VoIP call, the network must ensure that lower priority traffic does not interfere with time-sensitive VoIP. AutoQoS can be enabled on both WAN links and Ethernet switches to automatically provide a nice best-practices based template for VoIP prioritization. If you’re lucky enough to have metro Ethernet service, like AT&T ethernet service for example, you should contact your provider to find out if QoS settings on your switches can be duplicated through theirs.
How it Works
QoS allows a router to classify which types of traffic are most important, and ensure that that traffic passed as quickly as possible. If necessary, other traffic will be queued until the higher priority traffic has had a chance to pass. Before a router can know when to queue versus when to attempt to pass all traffic, it must be configured with bandwidth settings for each link.
Configuring QoS on a Cisco router normally involves a complex series of interactions, which require understanding not only the protocols, but a router’s strange way of associating policies. The basic steps are:
- Use an ACL to define which traffic gets matched
- A class-map classifies matched traffic into classes
- A policy-map assigns priorities to the classes
- The policy-map is applied to the interface, which enables the processing of all packets through the ACL, class-map, and policy-map
Each of these “maps” are quite complicated and prone to error. Most sites are going to be duplicating effort because of common problems, like VoIP, needing QoS help.
QoS configuration is not simple. It requires understanding the protocols your network interfaces are using, as well as the type of data you’re passing. To configure QoS for VoIP, for example, you must understand how VoIP works. In short, it requires a guru. If you’re like me, you literally giggled out loud the first time you encountered the word, “AutoQoS.”
AutoQoS enables any network administrator to just “turn on” a solid solution for ensuring VoIP is happy. VoIP is the pain point for most organizations, so that’s what Cisco focused on first, and that’s what we’re focusing on here. Given the limited scope of AutoQoS, it’s believable that it works well enough. In reality, QoS configurations generally classify many types of traffic, and then place a priority on each one.
The main benefit of AutoQoS is that administrator training is much quicker. It also means that VoIP deployments often go much smoother, and upgrading WAN links isn’t usually required. Finally, AutoQoS creates templates that can be modified as needed and copied elsewhere for deployment.
Before talking about how to enable AutoQoS, which is literally three commands, let’s talk about where this works best, and what’s required to use AutoQoS.
First and foremost, you can only configure AutoQoS on a few types of router interfaces. These interfaces include:
- PPP or HDLC serial interfaces
- ATM PVCs
- Frame Relay (point-to-point links only)
Cisco catalyst switches also support an AutoQoS command to prioritize Cisco VoIP phones, but you cannot prioritize (using AutoQoS) generic VoIP protocols.
Next, there are some limitations with ATM sub-interfaces. If you have a low-speed ATM link (less than 768Kbps), then AutoQoS will only work on point-to-point sub-interfaces. Higher speed ATM PVCs are fully supported though. For standard serial links, AutoQoS is not supported at all on sub-interfaces. A quick litmus test to see if AutoQoS will work on your desired interfaces or not is to verify that the service-policy configuration is supported. If not, you’ll probably have to reconfigure some links.
AutoQoS will not work if an existing QoS configuration exists on an interface. Likewise, when you disable the AutoQoS configuration, any changes you may have made to the template after the initial configuration will be lost.
Bandwidth statements are used by AutoQoS to determine what settings it should use, so remember that after updating bandwidth statements in the future, you have to re-run the AutoQoS commands.
Making it Work
In the most standard situation, where VoIP isn’t performing as it was promised, the network admin can quickly save the day by running the following on the WAN interface:
If it’s the local network that needs tuning, the following can be run on Catalyst switches (if running Enhanced Images):
auto qos voip cisco-phone
auto qos voip trust
It really couldn’t be easier than that. For the WAN example, we told the router that interface Serial0 has 256 Kbps, and to enable VoIP QoS. The switch example is similar, for Cisco phones.
The neat part about this is that AutoQoS is actually doing more than just generating a configuration for you and forgetting about it. If you run the command show autoqos interface s0, you will see much more than just your standard old interface configuration. It will show that a Virtual Template “interface” has been created, and that a class is applied to the interface. The same output will also show you the configuration of the template and class-map, with an asterisk next to each entry that was generated by AutoQoS. It’s actually keeping track of what was done automatically so that you can learn what AutoQoS is doing. As mentioned previously, however, don’t forget that removing the AutoQoS configuration will destroy all QoS settings on an interface, not just the ones that AutoQoS configured.
Finally, remember to enable QoS on both sides of a WAN link to truly prioritize VoIP packets. Don’t forget to read through the Cisco documentation before deploying it, even though AutoQoS is simple, in comparison. It is simple, but the more prepared you are the easier it is to deploy.
Cisco will hopefully continue this trend of providing Auto features for complicated, but common tasks. AutoQoS for VoIP sure does enable a much larger audience to correctly deploy VoIP over a wide variety of networks.
No Comments yet... be the first »
Posted: March 1st, 2010 | Author: charlie | Filed under: IT Management | Tags: capacity, performance, tuning, virtualization | 3 Comments »
When purchasing server hardware, do you tend to purchase more power than you need, or not enough? Specifying the correct server for your current need is a fine art, and it’s easy to get wrong. Here are some helpful hints and considerations to remember that will ensure you make the right server purchasing decision.
We’re going to focus on standalone (non-blade) servers for the moment, but many aspects are also applicable to blade servers. Blade servers are wonderful for centralized management of the hardware, but the specs of the individual server blades can vary tremendously.
Want to avoid trudging down to the datacenter late at night, or even worse, across the world if something breaks? Then don’t skimp on the management controller, lights out manager, or whatever the vendor is calling it. Many vendors ship a simple version by default: it may allow serial console access only, for example. Make sure to get the full-featured controller, because even if the hardware is only a few doors down, getting up from your desk should never be necessary.
If you aren’t thinking of switching vendors any time soon, you might think that the management interface will always work the same as it has on all your other servers. Unfortunately, that’s not the case. Sun x86 hardware, for example, has many different hardware management controllers to choose from. The more expensive and feature-rich servers have the better controllers, but don’t make the mistake of thinking the interface never changes. The unfortunate part is that you never know how well it works until you get a server on-site.
Hardware management comes in two forms: IPMI (most support), and the user interface. The user interface is more often than not, a Web-based java application that provides remote console access. Some are extremely buggy, and others work quite well from all Web browsers. We can’t make a recommendation, though, because these things change often.
Shucks, this one is a no-brainer: as much as you can afford. Within reason, that is. If you aren’t going to run virtual machines, and this server’s only job is to serve up some simple Web pages, then 16GB of RAM is likely overkill. Likewise, make sure you know what your application can support. Many java applications are limited to a heap size of 2 or 4GB.
It’s also overkill to purchase more than 4GB of RAM if you need to run a 32-bit operating system. Yes, Windows Server does some tricks and it can use more than 4GB, but it’s a huge performance it.
If virtualization is in your future, load up as much as possible. You also want to pay attention to how many DIMM slots the server has. The 8GB DIMMs are horribly expensive now, so you’ll probably want to stick with 4GB sticks. Just remember, if you fill all the slots in the server, the only memory upgrade path is to buy higher capacity DIMMs.
Do you want to run many threads at an even pace or just a few threads as fast as possible? Sun’s T2 processors aren’t fast by any measure, but they can run many threads at the same speeds consistently. These are ideal for database servers, but not for Web servers.
Will this server be executing a wide variety of processes over and over again, as opposed to just running the same big application server constantly? If so, make sure you pay attention to the amount of cache each core of the CPU has.
For virtualization, you want the fastest multi-core processors available, with the largest amount of L2 cache. Cache is very important as it minimizes the number of times the CPU needs to fetch data from slower RAM. It makes a very noticeable difference on heavily used servers.
Disks, Controllers, and RAID
If you need local storage, do pay attention to the type of disks you’re ordering. A SATA disk is likely to disappoint if you have an IO-heavy workload. SAS, and FC disks should perform equally well, since they are both SCSI disks underneath.
Even if you don’t need much local storage, you should always buy a server with a RAID controller that can mirror the operating system disks, unless you’re SAN booting of course. You don’t want the OS to crash just because of a failed disk. Likewise, if you’re keeping tons of local storage for some reason, make sure to get a RAID card that does RAID-5, so that you can at least lose one disk at a time without losing data. If performance is a concern you should really be using iSCSI or SAN storage, but you may also think about a RAID 0+1 configuration to avoid the slower RAID-5 parity calculations.
If you’re attaching to a SAN, make sure to include the correct HBA as well.
When servers started showing up with two or four gigabit NICs I must admit, I was confused. Why would someone need that many? Aside from large servers that do a lot of network IO, you might also want to separate out your iSCSI traffic from normal Ethernet. It’s also important these days to make sure that the network cards support TOE, or a TCP Offload Engine. This will task the network card with computing TCP checksums, freeing your CPUs for more important things.
In summary, most of these things may seem common sense, but you need to remember to ask all the right questions every time you spec a server. Here’s a good checklist:
- Adequate hardware management controller
- Enough (but not too much) RAM, that’s fast enough, but not faster than the CPU’s front-side bus
- Enough memory slots for expansion, if that seems likely
- Correct CPU for this server’s needs
- RAID-1 for the OS, and (optionally) other RAID levels for other local storage
- FC HBAs?
- Multiple gigabit NICs with TOE capabilities
3 Comments »
Posted: February 24th, 2010 | Author: charlie | Filed under: Linux / Unix | Tags: education, linux, monitoring, performance, solaris, tuning | No Comments »
Unix and Linux systems have forever been obtuse and mysterious for many people. They generally don’t have nice graphical utilities for displaying system performance information; you need to know how to coax the information you need. Furthermore, you need to know how to interpret the information you’re given. Let’s take a look at some common system tools that can provide tons of visibility into what the opaque OS is really doing.
Unfortunately, the same tools don’t exist universally across all Unix variants. A few commonly underused ones do, however, and that is what we’ll focus on first.
A common source of “slowness” is disk I/O, or rather the lack of available I/O. On Linux especially, it may be a difficult diagnosis. Often the load average will climb quickly, but without any corresponding processes in top eating much CPU. Linux counts “iowait” as CPU time when calculating load average. I’ve seen load numbers in the tens of thousands, on more than one occasion.
The easiest way to see what’s happening to your disks is to run the ‘iostat’ program. Via iostat, you can see how many read and write operations are happening per device, how much CPU is being utilized, and how long each transaction takes. Many arguments are available for iostat, so do spend some time with the man page on your specific system. By default, running ‘iostat’ with no arguments produces a report about disk IO since boot. To get a snapshot of “now” add a numerical argument last, which will prompt iostat to gather statistics for that number of seconds.
Linux will show number of blocks read or written per second, along with some useful CPU statistics. This is one particularly busy server:
avg-cpu: %user %nice %system %iowait %steal %idle
1.36 0.07 5.21 23.80 0.00 69.57
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 18.22 15723.35 643.25 65474958946 2678596632
Notice that iowait is at 23%. This means that 23% of the time this server is waiting on disk I/O. Some Solaris iostat output shows a similar thing, just represented differently(iostat -xnz):
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
295.3 79.7 5657.8 211.0 0.0 10.3 0.0 27.4 0 100 d101
134.8 16.4 4069.8 116.0 0.0 3.5 0.0 23.3 0 90 d105
The %b (block) column shows that I/O to device d101 is 100% blocked waiting for the device to complete transaction. The average service time isn’t good either: disk reads shouldn’t take 27.4ms. Arguably, Solaris’s output is more friendly to parse, since it gives the reads per second in kilobytes rather than blocks. We can quickly calculate that this server is reading about 19KB per read by dividing the number of KB read per second by the number of reads that happened. In short: this disk array is being taxed by large amounts of read requests.
The ‘vmstat’ program is also universally available, and extremely useful. It, too, provides vastly different information among operating systems. The vmstat utility will show you statistics about the virtual memory subsystem, or to put it simply: swap space. It is much more complex than just swap, as nearly every IO operation involves the VM system when pages of memory are allocated.A disk write, network packet send, and the obvious “program allocates RAM” all impact what you see in vmstat.
Running vmstat with the -p argument will print out statistics about disk IO. In Solaris you get some disk information anyway, as seen below:
kthr memory page disk
faults cpu r b w swap
free re mf pi po fr de sr m0 m1 m2 m7
in sy cs us sy id 0 0 0 7856104 526824 386 2401 0 0 0 0 0 3 0 0 0
16586 22969 12576 8 9 83 1 0 0 7851344 522016 18 678 32 0 0 0 0 2
0 0 0 13048 11737 10197 7 6 86 0 0 0 7843584 514128 76 3330 197 0
0 0 0 2 0 0 0 4762 131492 4441 16 8 76
A subtle, but important differences between Solaris and Linux is that Solaris will start scanning for pages of memory that can be freed before it will actually start swapping RAM to disk. The ‘sr’ column, scan rate, will start increasing right before swapping takes place, and continue until some RAM is available. The normal things are available in all operating systems; these include: swap space, free memory, pages in and out (careful, this doesn’t mean swapping is happening), page faults, context switches, and some CPU idle/system/user statistics. Once you know how to interpret these items you quickly learn to infer what they indicate about the usage of your system.
The two main programs for finding “slowness” are therefore iostat and vmstat. Before the obligatory tangent into “what Dtrace can do for you,” here’s a few other tools that no Unix junkie should leave home without:
- Lists open files (including network ports) for all processes
- Lists all sockets in use by the system
- Shows CPU statistics (including IO), per-processor
We cannot talk about system visibility without mentioning Dtrace. Invented by Sun, Dtrace provides dynamic tracing of everything about a system. Dtrace gives you the ability to ask any arbitrary question about the state of a system, which works by calling “probes” within the kernel. That sounds intimidating, doesn’t it?
Let’s say that we wanted to know what files were being read or written on our Linux server that has a high iowait percentage. There’s simply no way to know. Let’s ask the same question of Solaris, and instead of learning Dtrace, we’ll find something useful in the Dtrace ToolKit. In the kit, you’ll find a few neat programs like iosnoop and iotop, which will tell you which processes are doing all the disk IO operations. Neat, but we really want to know what files are being accessed so much. In the FS directory, the rfileio.d script will provide this information. Run it, and you’ll see every file that’s read or written, and cache hit statistics. There’s no way to get this information in other Unixes, and this is just one simple example of how Dtrace is invaluable.
The script itself is about 90 lines, inclusive of comments, but the bulk of it is dealing with cache statistics. An excellent way to start learning Dtrace is to simply read the Dtrace ToolKit scripts.
Don’t worry if you’re not a Solaris admin: Dtrace is coming soon to a FreeBSD near you. SystemTap, a replica of Dtrace, will be available for Linux soon as well. Until then, and even afterward, the above mentioned tools will still be invaluable. If you can quickly get disk IO statistics and see if you’re swapping the majority of system performance problems are solved. Dtrace also provides amazing application tracing functionality, and if you’re looking at the application itself, you already know the slowness isn’t likely being caused by a system problem.
Soon, I’ll publish a few Dtrace tutorials.
Some things have surely been left out – discuss below!
No Comments yet... be the first »
Posted: February 15th, 2010 | Author: charlie | Filed under: Linux / Unix | Tags: linux, performance, swap, tuning, vmm | No Comments »
Virtual memory is one of the most important, and accordingly confusing, piece of an operating system. Understanding the basics of virtual memory is a requisite to understanding operating system performance. Beyond the basics, a deeper understanding allows a systems administrator to interpret system profiling tools better, leading to quicker troubleshooting and better decisions.
The concept of virtual memory is generally taught as though it’s only used for extending the amount of physical RAM in a system. Indeed, paging to disk is important, but virtual memory is used by nearly every aspect of an operating system.
In addition swapping, virtual memory is used to manage all pages of memory, which incidentally are required for file caching, process isolation, and even network communication. Anything that queues data, you can be assured, traverses the virtual memory system. Depending on a server’s role, virtual memory functionality may not be optimal. An administrator can dramatically improve overall system performance by adjusting certain virtual memory manager settings.
To optimally configure your Virtual Memory Manager (VMM), it’s necessary to understand how it does its job. We’re using Linux for example’s sake, but the concepts apply across the board, though some slight architectural differences will exist between the Unixes.
Nearly every VMM interaction involves the MMU, or Memory Management Unit, excluding the disk subsystem. The MMU allows the operating system to access memory through virtual addresses by using data structures to track these translations. Its main job is to translate these virtual addresses into physical addresses, so that the right section of RAM is accessed.
The Zoned Buddy Allocator interacts directly with the MMU, providing valid pages when the kernel asks for them. It also manages lists of pages and keeps track of different categories of memory addresses.
The Slab Allocator is another layer in front of the Buddy Allocator, and provides the ability to create cache of memory objects in memory. On x86 hardware, pages of memory must be allocated in 4KB blocks, but the Slab Allocator allows the kernel to store objects that are differently sized, and will manage and allocate real pages appropriately.
Finally, a few kernel tasks run to manage specific aspects of the VMM. The bdflush manages block device pages (disk IO), and kswapd handles swapping pages to disk. Pages of memory are either Free (available to allocate), Active (in use), or Inactive. Inactive pages of memory are either dirty or clean, depending on if it has been selected for removal yet or not. An inactive dirty page is no longer in use, but is not yet available for re-use. The operating system must scan for dirty pages, and decide to deallocate them. After they have been guaranteed sync’d to disk, an inactive page my be “clean,” or ready for re-use.
Tunable parameters may be adjusted in real-time via the proc fils system, but to persist across a reboot, /etc/sysctl.conf is the preferred method. Parameters can be entered in real-time via the sysctl command, and then recorded in the configuration file for reboot persistence.
You can adjust everything from the interval pages are scanned to the amount of memory to reserve for pagecache use. Let’s see a few examples.
Often we’ll want to optimize a system for IO performance. A busy database server, for example, is generally only going to run the database, and it doesn’t matter if the user experience is good or not. If the system doesn’t require much memory for user applications decreasing the available bdflush tunables is beneficial. The specific parameters being adjusted are just too lengthy to explain here, but definitely look into them if you wish to adjust the values further. They are fully explained in vm.txt, usually located at: /usr/src/linux/Documenation/sysctl/vm.txt.
In general, a IO-heavy server will benefit from the following setting these values in sysctl.conf:
vm.bdflush=”100 5000 640 2560 150 30000 5000 1884 2”
The pagecache values control how much memory is used for pagecache. The amount of pagecache allowed translates directly to how many programs and open files can be held in memory.
The three tunable parameters with pagecache are:
- Min: the minimum amount of memory reserved for pagecache
- Borrow: the percentage of pages used in the process of reclaiming pages
- Max: percentage at which kswapd will only page pagecache pages; once it falls below, it can swap out process pages again
On a file server, we’d want to increase the amount of pagecache available, so that data isn’t moved to disk as often. Using vm.pagecache=”10 50 100″ provides more caching, allowing larger and less frequent disk writes for file IO intensive work loads.
On a single-user machine, say your workstation, large number will keep pages in memory, allowing programs to execute faster. Once the upper limit is reached, however, you will start swapping constantly.
Conversely, a server with many users that frequently executes many different programs will not want high amounts of pagecache. The pagecache can easily eat up available memory if it’s too large, so something like vm.pagecache=”10 20 30” is a good compromise.
Finally, the swappiness and vm.overcommit parameters are also very powerful. The overcommit number can be used to allow more memory allocation than RAM exists, which allows you to overcommit the amount of pages. Programs that have a habit of trying to allocate many gigabytes of memory are a hassle, and frequently they don’t use nearly that much memory. Upping the overcommit factor will allow these allocations to happen, but if the application really does use all the RAM, you’ll be swapping like crazy in no time (or worse: running out of swap).
The swappiness concept is heavily debated. If you want to decrease the amount of swapping done by the system, just echo a small number of the range 0-100 into: /proc/sys/vm/swappiness. You don’t generally want to play with this, as it its more mysterious and non-deterministic than the advanced parameters described above. In general, you want applications to swap to avoid them using memory for no reason. Task-specific servers, where you know the amount of RAM and the application requirements, are best suited for swappiness tuning (using a low number to decrease swapping).
These parameters all require a bit of testing, but in the end, you can dramatically increase the performance of many types of servers. The common case of disappointing disk performance stands to gain the most: give the settings a try before going out and buying a faster disk array.
No Comments yet... be the first »