Now, when I discovered this, it *really* blew me away. Linux 2.2 comes with everything to manage bandwidth in ways comparable to high-end dedicated bandwidth management systems.
Linux even goes far beyond what Frame and ATM provide.
The two basic units of Traffic Control are filters and queues. Filters place traffic into queues, and queues gather traffic and decide what to send first, send later, or drop. There are several flavours of filters and queues.
The most common filters are fwmark and u32, the first lets you use the Linux netfilter code to select traffic, and the second allows you to select traffic based on ANY header. The most notable queue is Class Based Queue. CBQ is a super-queue, in that it contains other queues (even other CBQs).
It may not be immediately clear what queueing has to do with bandwidth management, but it really does work.
For our frame of reference, I have modelled this section on an ISP where I learned the ropes, so to speak, Casema Internet in The Netherlands. Casema, which is actually a cable company, has internet needs both for their customers and for their own office. Most corporate computers there have access to the internet. In reality, they have lots of money to spend and do not use Linux for bandwidth management.
We will explore how our ISP could have used Linux to manage their bandwidth.
With queueing we determine the order in which data is *sent*. It it important to realise this, we can only shape data that we transmit. How this changing the order determine the speed of transmission? Imagine a cash register which is able to process 3 customers per minute.
People wishing to pay go stand in line at the 'tail end' of the queue. This is 'fifo queueing'. Let's suppose however that we let certain people always join in the middle of the queue, in stead of at the end. These people spend a lot less time in the queue and are therefore able to shop faster.
With the way the internet works, we have no direct control of what people send us. It's a bit like your (physical!) mailbox at home. There is no way you can influence the world to modify the amount of mail they send you, short of contacting everybody.
However, the internet is mostly based on TCP/IP which has a few features that help us. TCP/IP has no way of knowing the capacity of the network between two hosts, so it just starts sending data faster and faster ('slow start') and when packets start getting lost, because there is no room to send them, it will slow down.
This is the equivalent of not reading half of your mail, and hoping that people will stop sending it to you. With the difference that it works for the Internet :-)
FIXME: explain that normally, ACKs are used to determine speed
[The Internet] ---<E3, T3, whatever>--- [Linux router] --- [Office+ISP]
eth1 eth0
Now, our Linux router has two interfaces which I shall dub eth0 and eth1. Eth1 is connected to our router which moves packets from to and from our fibre link.
Eth0 is connected to a subnet which contains both the corporate firewall and our network head ends, through which we can connect to our customers.
Because we can only limit what we send, we need two separate but possibly very similar sets of rules. By modifying queueing on eth0, we determine how fast data gets sent to our customers, and therefor how much downstream bandwidth is available for them. Their 'download speed' in short.
On eth1, we determine how fast we send data to The Internet, how fast our users, both corporate and commercial can upload data.
CBQ enables us to generate several classes, and even classes within classes. The larger devisions might be called 'agencies'. Within these classes may be things like 'bulk' or 'interactive'.
For example, we may have a 10 megabit internet connection to 'the internet' which is to be shared by our customers, and our corporate needs. We should not allow a few people at the office to steal away large amounts of bandwidth which we should sell to our customers.
On the other hand, or customers should not be able to drown out the traffic from our field offices to the customer database.
Previously, one way to solve this was either to use Frame relay/ATM and create virtual circuits. This works, but frame is not very fine grained, ATM is terribly inefficient at carrying IP traffic, and neither have standardised ways to segregate different types of traffic into different VCs.
Hover, if you do use ATM, Linux can also happily perform deft acts of fancy traffic classification for you too. Another way is to order separate connections, but this is not very practical and also not very elegant, and still does not solve all your problems.
CBQ to the rescue!
Clearly we have two main classes, 'ISP' and 'Office'. Initially, we really don't care what the divisions do with their bandwidth, so we don't further subdivide their classes.
We decide that the customers should always be guaranteed 8 megabits of downstream traffic, and our office 2 megabits.
Setting up traffic control is done with the iproute2 tool tc
.
# tc qdisc add dev eth0 root handle 10: cbq bandwidth 10Mbit avpkt 1000
Ok, lots of numbers here. What has happened? We have configured the 'queueing discipline' of eth0. With 'root' we denote that this is the root discipline. We have given it the handle '10:'. We want to do CBQ, so we mention that on the command line as well. We tell the kernel that it can allocate 10Mbit and that the average packet size is somewhere around 1000 octets.
FIXME: Double check with Alexey the the built in cell calculation is sufficient.
FIXME: With a 1500 mtu, the default cell is calculated same as the old example.
FIXME: I checked the sources (userspace and kernel), so we should be safe omitting it.
Now we need to generate our root class, from which all others descend:
# tc class add dev eth0 parent 10:0 classid 10:1 cbq bandwidth 10Mbit rate \
10Mbit allot 1514 weight 1Mbit prio 8 maxburst 20 avpkt 1000
Even more numbers to worry about - the Linux CBQ implementation is very generic. With 'parent 10:0' we indicate that this class descends from the root of qdisc handle '10:' we generated earlier. With 'classid 10:1' we name this class.
We really don't tell the kernel a lot more, we just generate a class that completely fills the available device. We also specify that the MTU (plus some overhead) is 1514 octets. We also 'weigh' this class with 1Mbit - a tuning parameter.
We now generate our ISP class:
# tc class add dev eth0 parent 10:1 classid 10:100 cbq bandwidth 10Mbit rate \
8Mbit allot 1514 weight 800Kbit prio 5 maxburst 20 avpkt 1000 \
bounded
We allocate 8Mbit, and indicate that this class must not exceed this by adding the 'bounded' parameter. Otherwise this class would have started borrowing bandwidth from other classes, something we will discuss later on.
To top it off, we generate the root Office class:
# tc class add dev eth0 parent 10:1 classid 10:200 cbq bandwidth 10Mbit rate \
2Mbit allot 1514 weight 200Kbit prio 5 maxburst 20 avpkt 1000 \
bounded
To make this a bit clearer, a diagram which shows our classes:
+-------------[10: 10Mbit]----------------------+
|+-------------[10:1 root 10Mbit]--------------+|
|| ||
|| +-[10:100 8Mbit]-+ +--[10:200 2Mbit]-----+ ||
|| | | | | ||
|| | ISP | | Office | ||
|| | | | | ||
|| +----------------+ +---------------------+ ||
|| ||
|+---------------------------------------------+|
+-----------------------------------------------+
Ok, now we have told the kernel what our classes are, but not yet how to manage the queues. We do this presently, in one fell swoop for both classes.
# tc qdisc add dev eth0 parent 10:100 sfq quantum 1514b perturb 15
# tc qdisc add dev eth0 parent 10:200 sfq quantum 1514b perturb 15
In this case we install the Stochastic Fairness Queueing discipline (sfq), which is not quite fair, but works well up to high bandwidths without burning up CPU cycles. There are other queueing disciplines available which are better, but need more CPU. The Token Bucket Filter is often used.
Now there is only one thing left to do and that is to explain to the kernel which packets belong to which class. Initially we will do this natively with iproute2, but more interesting applications are possible in combination with netfilter.
# tc filter add dev eth0 parent 10:0 protocol ip prio 100 u32 match ip dst \
150.151.23.24 flowid 10:200
# tc filter add dev eth0 parent 10:0 protocol ip prio 25 u32 match ip dst \
150.151.0.0/16 flowid 10:100
Here is is assumed that our office hides behind a firewall with IP address 150.151.23.24 and that all our other IP addresses should be considered to be part of the ISP.
The u32 match is a very simple one - more sophisticated matching rules are possible when using netfilter to mark our packets, which we can than match on in tc.
Now we have fairly divided the downstream bandwidth, we need to do the same for the upstream. For brevity's sake, all in one go:
# tc qdisc add dev eth1 root handle 20: cbq bandwidth 10Mbit avpkt 1000
# tc class add dev eth1 parent 20:0 classid 20:1 cbq bandwidth 10Mbit rate \
10Mbit allot 1514 weight 1Mbit prio 8 maxburst 20 avpkt 1000
# tc class add dev eth1 parent 20:1 classid 20:100 cbq bandwidth 10Mbit rate \
8Mbit allot 1514 weight 800Kbit prio 5 maxburst 20 avpkt 1000 \
bounded
# tc class add dev eth1 parent 20:1 classid 20:200 cbq bandwidth 10Mbit rate \
2Mbit allot 1514 weight 200Kbit prio 5 maxburst 20 avpkt 1000 \
bounded
# tc qdisc add dev eth1 parent 20:100 sfq quantum 1514b perturb 15
# tc qdisc add dev eth1 parent 20:200 sfq quantum 1514b perturb 15
# tc filter add dev eth1 parent 20:0 protocol ip prio 100 u32 match ip src \
150.151.23.24 flowid 20:200
# tc filter add dev eth1 parent 20:0 protocol ip prio 25 u32 match ip src \
150.151.0.0/16 flowid 20:100
In our hypothetical case, we will find that even when the ISP customers are mostly offline (say, at 8AM), our office still gets only 2Mbit, which is rather wasteful.
By removing the 'bounded' statements, classes will be able to borrow bandwidth from each other.
Some classes may not wish to borrow their bandwidth to other classes. Two rival ISPs on a single link may not want to offer each other freebees. In such a case, you can add the keyword 'isolated' at the end of your 'tc class add' lines.
FIXME: completely untested suppositions! Try this!
We can go further than this. Should the employees at the office decide to all fire up their 'napster' clients, it is still possible that our database runs out of bandwidth. Therefore, we create two subclasses, 'Human' and 'Database'.
Our database always needs 500Kbit, so we have 1.5Mbit left for Human consumption.
We now need to create two new classes, within our Office class:
# tc class add dev eth0 parent 10:200 classid 10:250 cbq bandwidth 10Mbit rate \
500Kbit allot 1514 weight 50Kbit prio 5 maxburst 20 avpkt 1000 \
bounded
# tc class add dev eth0 parent 10:200 classid 10:251 cbq bandwidth 10Mbit rate \
1500Kbit allot 1514 weight 150Kbit prio 5 maxburst 20 avpkt 1000 \
bounded
FIXME: Finish this example!
FIXME: document TEQL