Debian ready for your netbook?

April 28th, 2012 by Gary Richards - Categories: Linux, Operating Systems

A few years ago I caved and began using Ubuntu on various machines just because it was quick and easy. No more compiling stuff constantly on my Gentoo machines. Far less issues trying to get totally random packages that I might want to use than i’ve had in the past with Fedora/Redhat…

Recently with various upgrades and now the new Ubuntu LTS Precise, I found my netbook (a 1.6GHz Intel Atom with 2GB ram) was starting to struggle a little bit. Their unity interface was pretty slow, so I tried switching to xfce. Things were better, but everything still seemed slower than it needed to be. From boot through login and to the stage where I was able to type into a terminal was now about 45 seconds. I felt like I was running Windows on a significantly underpowered machine again! By the time i’d logged in and was using my terminal, I was using around 800MB of RAM. Once you fire up your browser, mail, etc. I was using almost all my 2GB of RAM. So I began to look for alternatives.

The obvious choice seeing as i’ve become so much more comfortable with Debian style systems from all the Ubuntu work i’ve done is Debian. I use it on a whole bunch of servers, but decided to see what I could do with it on my netbook.

I set out to try and give myself similar to what I am currently using on Ubuntu so that I had some sort of comparison. So the plan was a graphical login manager, xfce as the desktop environment, the fairly standard apps that we’re all so used to. And the simplicity of things like network-manager and related applets etc.

As you may expect, Ubuntu and Debians setups are still quite similar, therefore most of the packages in one are available for the other. It didn’t take long to get Debian Wheezy installed to a stage where I was presented with the still rather unattractive xdm, logged in to be presented with xfce. I have wired networking. So I installed Iceweasel (yeah Iceweasel, it’s what everyone else calls firefox) and as expected it worked fine (i’m even writing this post on it!). I could go and install thunderbird (or whatever its name is in debian) or evolution but i’m sure they will work fine too. So I set out to see if some of the other features that I use most days work.

Suspend… nope. Things happen, the screensaver locks, but no suspend.

Wireless… I already knew from past experience that I would need to make sure various firmware was installed, so I won’t go into it too much. Basically in the non-free part of the wheezy distribution contains various packages with firmware in the name. Install the relevent one and away you go. I had a wlan0 device, and iwlist scan showed me some access points.

Network-manager and its applet… Easy enough to install and network-manager itself seems to be doing something. But I can’t actually configure a wireless network within the applet because whenever I try it tells me

(32) Not authorized to control networking.

I thought it was as simple as forgetting to add myself to the netdev group. But it wasn’t. I could login as root and configure things. So there was definitely somethign permission related. I searched and found various debian bugs with people having similar issues. Most suggestions seemed to be related to your login manager not talking with console kit correctly. There was one suggestion to add a small config snippet somewhere under the policykit config. But that was like voodoo to me. I tried it and I was able to control networking. I didn’t really like it but thought I would live with it for the moment as it meant that I could continue my work on the sofa rather than at my desk!

I then tried out bluetooth. I had a similar issue. The bluetooth applet couldn’t start due to permissioning problems of /dev/rfkill. Further searching… Ok, similar suggestions. The permissions of /dev/rfkill are ‘wrong’ but various things should take care of this for you so long as the whole policykit/consolekit stuff is working. It seemed that more voodoo would be required, so I left it for now. I did look at ck-list-sessions. And just like everything I read, it told me

ACTIVE = false

People appear to have raised relevent bugs against things like xdm because they don’t support the neccessary things properly. So there’s not much I can do. One solution is to install gdm, but when I looked at that you pull in almost all of gnome by the looks of it. So i’d rather not. My ultimate goal was to have as little unnecessary stuff as possible afterall!

At this stage I started to get a little fed up, i’d spent most of the morning fighting stuff that should just work. I’m sure I was missing something, so I went to shutdown the machine… But no, I could logout, but the shutdown and restart buttons are greyed out. Once logged out, I could see a plain xdm login screen, with no buttons or anything other than a box to type my username.

I gave up for a bit and came back to it today. I started looking at alternative lightweight display managers/login managers whatever you want to call them. And came across lightdm. It seems to be nice and simple, it’s just a bit nicer looking than xdm and most of all it looked like it had a button I could click to shut my machine down after I logged out of xfce.

Installation was easy as ever, 4 additional packages on what I already have installed, confirming that I wanted to use it instead of xdm and that was it. I rebooted to try it (I probably could have switched runlevels, but remember i’ve got a machine that now boots in 10 seconds rather than 45 so it’s a novelty again!).

I was presented with the login box and all seemed well. I logged in, great. Everything still works…. lets logout. Erm? What happened? Now I can shutdown AND restart from within xfce? Why?

So I begin to wonder… does lightdm actually support the whole policykit/consolekit/whatever stuff that xdm doesn’t? I quickly find that now I can start the bluetooth applet and it no longer complains about not being able to access rfkill. I try to suspend… and the netbook suspends. I revert the voodoo config addition that I did to allow me to configure wifi access points. I rebooted again (yep, still all about the novelty) and wow… I can configure access points with network manager applet.

So, i’ve wasted 3 or so hours trying to work out why everything wasn’t working, when it seems that xdm itself is the problem as it doesn’t correctly support stuff.

The last few things I tried out where:

Screen brightness – check, works like a charm

Audio – I can control volume with the volume keys, mute with the mute key. That seems to be enough for now.

The only outstanding things now are that I can’t scroll with the right side of my track pad yet. And some other special keyboard keys don’t work. Such as teh suspend key, the wifi kill switch (not the software one that I fixed by switching to lightdm, the ‘hardware’ one).

That’s about it. Oh, all in all, once logged in my netbook now uses 200MB of RAM. With firefox open with about 10 tabs and a terminal with a few tabs. I’m currently just pushing past 400M.

Oh and my netbook feels like a new one once again!

So my conclusion… Yes, get rid of Ubuntu and get Debian on your netbook! Just remember to use lightdm instead of xdm if you’re not planning to use gnome and all should mostly be well!

LVS/IPVS NAT without the LVS server being the default gateway

April 13th, 2012 by Gary Richards - Categories: IPVS / LVS, Linux, Operating Systems

I’m in a situation where we have a whole bunch of virtual servers configured (some on the same IP addresses, but different ports).

Some of the services that we load balance are to clients on the internet, some services we load balance are to other services on the same network, some services serve both clients on the internet and other local services.

The LVS machines aren’t the default gateway for our network, so using LVS in NAT mode would usually be out of the question. If we use DR mode, everything would usually work. However, as mentioned above, some of the virtual servers we have configured are on the same IP addresses.

So imagine machine A which has application A and application B running on it.
There’s virtual servers X and Y on the LVS machine that are configured with application A and B respectively.
As it’s DR mode, we put the IP address used for the virtual servers X and Y on loopback of machine A.
Eveything works fine until application A needs to talk to the virtual server for application B.

What happens is Application A tries to talk to the IP address of the virtual server, instead of it talking to the LVS box it talks to the IP on loopback of the machine and talks to application B that way. This is fine and it works… but what happens if applcation B is down? Everything else on the network can still talk to virtual server Y, but application A cannot.

One obvious solution would be to separate every application onto its own real server and to make sure that every virtual server is on an entirely different IP address. Although possible, we didn’t want to go down this route for a few reasons.

However, it did occur to us that if it were possible to use NAT mode and force responses from any application that’s part of a virtual server back through the machine that owns the IP used by the virtual server for this application, then in theory we could use NAT mode for everything.

It turns out there’s a fairly nice way to do this. I originally tested it using iproute2 and iptables on our real servers, the way we’ll be implementing it is with policy based routing on some layer 3 switches.

To do it in Linux there’s a little magic involved, but nothig too serious.

With iproute2 you can configure additional routing tables. You can then specify rules that govern which traffic uses which routing table.

Firstly edit /etc/iproute2/rt_tables and add a new routing tables:

 #
 # reserved values
 #
 255     local
 254     main
 253     default
 0       unspec
 #
 # local
 #
 #1      inr.ruhep
+100     lb

Using route, we see the default routing table:

root@server:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.140.0   0.0.0.0         255.255.255.0   U     0      0        0 bond0
0.0.0.0         192.168.140.254 0.0.0.0         UG    100    0        0 bond0

We see the same config, just in a different format with ip route (the default routing table is the ‘main’ table, not the ‘default’ table like you might imagine):

root@server:~# ip route list table main
192.168.140.0/24 dev bond0  proto kernel  scope link  src 192.168.140.4
default via 192.168.140.254 dev bond0  metric 100

We need our ‘lb’ table to have a route that performs the job that we require. We’re not too fussy, anything that ends up using the ‘lb’ routing table for its routing should route via a specific IP (it makes sense for this IP to be the IP address used by the virtual server that load balances to these real servers). The simplest way to do that is to add a default route into our ‘lb’ routing table to do that:

root@server:~# ip route add default via 192.168.140.100 dev bond0 table lb
root@server:~# ip route list table lb
default via 192.168.140.100 dev bond0

So now we have to get certain packets coming from our services to use this routing table. We do that using a combination of iproute2 and iptables.

The way we do it is by marking packets with iptables and then making those marked packets use our new routing table.
The only packets that we care about are ones that are returning from the service that we’re load balancing to. So we need to mark packets with a source port (or ports) of the service(s) that we’re load balancing (optionally with a destination of the the server/network/whatver will be accessing this service):

root@server:~# iptables -t mangle -A OUTPUT -d 192.168.140.0/24 -p tcp -m multiport --sports 8001,8002 -j MARK --set-xmark 0x1/0xffffffff

Then we configure a rule to make packets with this mark use our new routing table

root@server:~# ip rule add fwmark 0x1 table lb

That should be it. The virtual server on 192.168.140.100 is handing off to ports 8001 and 8002 on this machine. Packets returning from these two ports should now return via the machine which is hosting the 192.168.140.100 IP address.

That’s not quite it. Our LVS servers (yes we have two of them), perform healthchecks of our services. The healthchecks talk to the services whose packets are now returning via a specific IP address. That works fine as that IP will always be on one of our two LVS machines. However that machine (because it has its own firewall) doesn’t forward packets from the other machine (because its packets don’t talk via the virtual server, they talk directly). So we have to exclude those machines from having their packets forced back through the machine hosting the IP of the virtual server.

Again, we do this with iptables, the same table, the same chain, but rules BEFORE the rule above, that simply accept the packet so that it is not processed by any further rule

root@server:~# iptables -t mangle -I OUTPUT -d 192.168.140.252/32 -p tcp -m multiport --sports 8001,8002 -j ACCEPT
root@server:~# iptables -t mangle -I OUTPUT -d 192.168.140.253/32 -p tcp -m multiport --sports 8001,8002 -j ACCEPT

Now both of our pair of LVS machines can perform healthchecks against our real servers.

IPVS/LVS slow with larger payloads

March 23rd, 2012 by Gary Richards - Categories: IPVS / LVS, Linux, Operating Systems

So… some existing IPVS setup that has worked fine for quite a while. We configure a new virtual server to some real servers, everything seems to be working fine.

Until someone comes along and says “When we talk to one of those servers directly we get our response and data back in 45 seconds. When we talk to it via the load balancer, the response and data take 11 minutes”

Say what?!

Further investigation and conversations turned up a number of things. The main one being, this application is a little different to the other applications as there’s significantly more data transfer involved. The other being tcpdump shows various ICMP unheachable messages like:

10:31:43.561006 IP 192.168.140.2.35804 > 192.168.140.100.80: Flags [.], seq 728543:731439, ack 1, win 46, options [nop,nop,TS val 1182942909 ecr 1182984655], length 2896
10:31:43.561009 IP 192.168.140.2.35804 > 192.168.140.100.80: Flags [.], seq 728543:731439, ack 1, win 46, options [nop,nop,TS val 1182942909 ecr 1182984655], length 2896
10:31:43.561231 IP 192.168.140.100 > 192.168.140.2: ICMP 192.168.140.100 unreachable - need to frag (mtu 1500), length 556
10:31:43.780560 IP 192.168.140.2.35804 > 192.168.140.100.80: Flags [.], seq 660487:661935, ack 1, win 46, options [nop,nop,TS val 1182942931 ecr 1182984655], length 1448
10:31:43.780568 IP 192.168.140.2.35804 > 192.168.140.100.80: Flags [.], seq 660487:661935, ack 1, win 46, options [nop,nop,TS val 1182942931 ecr 1182984655], length 1448
10:31:43.780823 IP 192.168.140.100.80 > 192.168.140.2.35804: Flags [.], ack 661935, win 1856, options [nop,nop,TS val 1182984677 ecr 1182942931,nop,nop,sack 2 {692343:693791}{686551:689447}], length 0
10:31:43.780861 IP 192.168.140.2.35804 > 192.168.140.100.80: Flags [.], seq 731439:734335, ack 1, win 46, options [nop,nop,TS val 1182942931 ecr 1182984677], length 2896
10:31:43.780865 IP 192.168.140.2.35804 > 192.168.140.100.80: Flags [.], seq 731439:734335, ack 1, win 46, options [nop,nop,TS val 1182942931 ecr 1182984677], length 2896
10:31:43.781022 IP 192.168.140.100 > 192.168.140.2: ICMP 192.168.140.100 unreachable - need to frag (mtu 1500), length 556

So it looks like every time the client machine that is sending all of this data to the application tries to send IP packets with a length greater than the MTU of the ethernet interface, the ICMP unreachable is sent. Odd. Surely if your IP packets are bigger than that then the tcp/ip stack is supposed to work magic for you and split these packets up into ethernet frame sized chunks? So why isn’t it?

Much googlage later, it turns out that LVS when used with Linux kernels prior to 2.6.39 are incompatible with with ethernet device option ‘generic-receive-offload’ (or gro).

So I turned off gro using ethtool for the interfaces on the LVS machine. All of a sudden accessing the app via the load balancing now works pretty much the same as accessing it directly *grumble*

Puppet create_resources

March 20th, 2012 by Gary Richards - Categories: Linux, Puppet

Whilst working on some code to handle debian/ubuntu networking with puppet, I wanted a way to create the various configuration from hiera data (to support some legacy systems configured with a horrible networking module that’s in use).

I had recently read about create_resources, but all documentation about it seemed very vague and confusing.

I eventually worked it out and this is what I found.

You pass create_resources two arguments, the first is the name of the resource that you wish to create (file, package, the name of a define, etc.)

The second is a hash of hashes. Each key in the hash will be used as the name/title of your resource. Every element in the hash that’s referred to by a specific key will be used as a parameter for the resource that you’re trying to configure.

A real simple example creating a bunch of resources with the file type from a hash:

  $myhash = {
    '/some/file/1' => { 'owner' => 'root', 'group' => 'root', 'mode' => '0644' },
    '/some/directory' => { 'ensure' => 'directory' }
  }
  create_reseources(file, $myhash)

Would be equivalent to:

  file { '/some/file/1':
    owner => 'root',
    group => 'root',
    mode => '0644'
  }
  file { '/some/directory':
    ensure => directory
  }

At first this may not seem amazingly useful. But consider when your hash isn’t written as part of your manifest, but instead comes from some sort of external not classifier (or in my case from hiera).

I have a networking::interface define that takes all manner of parameters. With create_resources, I can have somethign like this in one of the yaml files that provide data via hiera:

networking:
  bond0:
    slaves:
      - eth1
      - eth3

  bond0.1000:
    ensure: present

  bond0.1001:
    ensure: present

  br-vlan1000:
    ipaddress: 192.168.1.1
    netmask: 255.255.255.0
    bridge_ports:
      - bond0.1000

  br-vlan1001:
    ipaddress: 192.168.2.1
    netmask: 255.255.255.0
    bridge_ports:
      - bond0.1001

Using create resources I then retrieve that data from hiera (the way the data above is defined means that hiera returns me a hash of hashes):

create_resources( networking::interface, hiera('networking') )

And all my network interfaces get defined from hiera.

TCP Keepalives with Java to keep your load balancer happy

February 2nd, 2012 by Gary Richards - Categories: IPVS / LVS, Java, Linux

So…

IPVS load balancer configured in Direct Routing mode. Due to this we have the shiny Netfilter Connection tracking support for IPVS turned off.

Anyhow, along comes our real servers, server A and server B, each running an instance of application X.

Application Y on servers C, D and E talk to the virtual server for the above real servers and everythings happy.

No real rocket science happening here, alot of people have been using IPVS without it for years…

Server F comes along, also running application Y, but makes significantly longer requests to the virtual server. Requests that are made and then take a fairly substantial time to return any data.

Application Y on server F regularly logs socket read timeouts (after hitting the ridiculously high default socket read timeout that application Y appears to be configured with by default).

Further investigation shows that server F (up until the point that the read timeout occurs) still thinks that the connection is established. Similarly, server A (the real server that this request happened to end up on) still thinks that the connection is established. The machine running IPVS however, has removed the tracked connection from its internal connection tracking.

I see this using ipvsadm
# ipvsadm -Lnc
IPVS connection entries
pro expire state source virtual destination
TCP 13:20 ESTABLISHED 192.168.130.8:37183 192.168.130.200:9700 192.168.130.10:9700

Notice the expire field? It seems that every time the virtual server receives a packet destined for this connection the expire timeout is reset. So what’s the default timeout? Watching the same ipvsadm output as seen above when a new connection is established suggests 15 mins 02 secs… I wonder if we can confirm that somehow?

# ipvsadm -L --timeout
Timeout (tcp tcpfin udp): 900 120 300

So we can (almost). TCP connection timeout is 900 seconds (aka 15 mins).

So the reason our connection breaks is due to this timeout removing the tracked connection from IPVS’s own connection tracking table. Excellent, we know what’s causing it… So how do we fix it?

Most people would simply up the TCP timeout, but I don’t like that idea. If something is going to hold onto a connection for longer than 15 mins and not send any data over that connection, shouldn’t it try to help me in confirming that its connection is still really alive rather than me blindly increasing what seems to be like a fairly reasonable (if not already too high timeout?).

Enter TCP Keepalive.

It seems that TCP keepalive is an optional feature of the TCP protocol. Fortunately both Linux and Java (which is what both ends of applications X and Y are written in) seem to support it. So can we test that it would solve our problem?

I wrote a simple Java class to connect to our virtual server above and to enable TCP keepalive on the socket

import java.net.*;
import java.io.*;

class SocketTest {
  public static void main(String args[]) {
    SocketTest socketTest = new SocketTest();
  }

  public SocketTest() {
    try {
      Socket sock = new Socket("virtual.server.name", 9700);
      sock.setKeepAlive(true);
      DataInputStream input = new DataInputStream(sock.getInputStream());
      input.readLine();
    } catch (Exception e) {
      System.out.println(e.toString());
    }
  }
}

The important line being
sock.setKeepAlive(true);
This tells our socket to send keep alives. Great, lets try it out….

Ok, it runs, it connects (i’m using tcpdump to confirm) and… oh, beyond the three way TCP handshake I see nothing.

I wonder how often the TCP keepalive is sent when enabled? A quick trip in the Googlecopter suggested 2 hours.

Yep, 2 hours… So, I open a socket, enable tcp keepalive on that socket, then I need to wait 2 hours before the first keepalive is sent. That’s not fun is it! I must be able to change the value….

Scary, Java doesn’t seem to be able to do anything other than enable or disable the SO_KEEPALIVE socket option on the underlying socket. So what is the underlying socket in this instance?

Ok, it’s native Linux socket code, so what/how can I configure those? Unfortunately other than enabling/disabling SO_KEEPALIVE options on them, I can’t change the timeout. This explains why Java can’t do it, so how do I change it?

Enter /proc and the magic this is system wide settings for things like this on a Linux box.


$ cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

Now we’re onto something. The time before a keepalive is sent is 7200 seconds (or 2 hours). If a response isn’t received then we send another in 75 seconds, then another, then another, 9 times. So if my connection is idle, it would take up until 7200 + (75*9) = 7875 seconds.

Obviously that’s far to long for our load balancer. So what happens if we tweak these values slightly?
If I set keepalive_time to 600 seconds (or 10 mins)
And I set keepalive_intvl to 60 seconds
And I set keepalive_probes to 240 seconds
(I think that totals 14 mins?)

What happens?

Lo and behold, every 10 mins, the underlying Linux socket code appears to be sending a tcp keepalive to the server. Almost forgetting about the actual IPVS problem, what happens there? As expected, the tcp keepalive is causing the expire counter of each entry in the IPVS connection tracking tables to reset and my long running idle connections are no longer lost.

So I don’t have to increase the load balancer timeout afterall! Win win for everyone i’d say ;)

LXC host network freezing

September 19th, 2011 by Gary Richards - Categories: Linux, Operating Systems, Virtualisation

I’ve been performing some testing for a client who wanted to setup a few Linux Containers on some of their systems.

Everything was working fine fairly quickly, but sometimes starting or stopping a container would cause the hosts networking to freeze (or hang, whatever you’d like to call it) for a small period (10-30 seconds generally).

Investigating further, we’re using bridged networking just like i’ve seen with a whole bunch of other virtualisation before:

root@host:~# ifconfig br0
br0       Link encap:Ethernet  HWaddr b4:99:ba:XX:XX:XX
inet addr:192.168.121.61  Bcast:192.168.121.255  Mask:255.255.255.0
inet6 addr: fe80::b699:baff:feXX:XXXX/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:934628 errors:0 dropped:0 overruns:0 frame:0
TX packets:777998 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:257326709 (257.3 MB)  TX bytes:4874452976 (4.8 GB)

root@host:~# brctl show
bridge name	bridge id		STP enabled	interfaces
br0		8000.b499baXXXXXX	no		bond0

Which all looks fine. Having the bonded interface added to the bridge ’should’ in theory be ok. At least, i’m pretty sure i’ve done it before…

Startup a container and look again:

root@host:~# ifconfig
br0       Link encap:Ethernet  HWaddr b2:c8:45:94:e4:0a
inet addr:192.168.121.61  Bcast:192.168.121.255  Mask:255.255.255.0
inet6 addr: fe80::b699:baff:feXX:XXXX/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:910208 errors:0 dropped:0 overruns:0 frame:0
TX packets:757874 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:254991889 (254.9 MB)  TX bytes:4769801361 (4.7 GB)

vethYldGtj Link encap:Ethernet  HWaddr b2:c8:45:94:e4:0a
inet6 addr: fe80::b0c8:45ff:fe94:e40a/64 Scope:Link
UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
RX packets:7 errors:0 dropped:0 overruns:0 frame:0
TX packets:42 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:648 (648.0 B)  TX bytes:3048 (3.0 KB)

root@host:~# brctl show
bridge name	bridge id		STP enabled	interfaces
br0		8000.b2c84594e40a	no		bond0
                                                    vethYldGtj

Everything looked ok (I hadn’t at this point spotted the MAC address change!). Starting/stopping this container had caused the host networking to freeze. Now I know for a fact that this doesn’t happen EVERY single time that a container is started, so I tried again:

root@host:~# ifconfig
br0       Link encap:Ethernet  HWaddr  b4:99:ba:XX:XX:XX
inet addr:192.168.121.61  Bcast:192.168.121.255  Mask:255.255.255.0
inet6 addr: fe80::b699:baff:feXX:XXXX/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:910208 errors:0 dropped:0 overruns:0 frame:0
TX packets:757874 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:254991889 (254.9 MB)  TX bytes:4769801361 (4.7 GB)

vethLxwrPl Link encap:Ethernet  HWaddr c6:bf:b2:18:db:4e
inet6 addr: fe80::24bf:b2ff:fe18:db4e/64 Scope:Link
UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
RX packets:46 errors:0 dropped:0 overruns:0 frame:0
TX packets:871 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4013 (4.0 KB)  TX bytes:189734 (189.7 KB)

root@host:~# brctl show
bridge name	bridge id		STP enabled	interfaces
br0		8000.b499baXXXXXX	no		bond0
                                                        vethLxwrPl

Wait.. why did nothing go wrong this time?!

After much pondering (and various tcpdumping, lxc configuration changes, etc.) I was still no closer to a solution. It seemed that randomly my containers would cause this to happen and other times they wouldn’t. At no point were the containers networking affected (even though the host machine would be affected).

I even looked at one of the containers networking:

root@lxc1:~# ifconfig
eth0      Link encap:Ethernet  HWaddr 54:52:00:3d:40:87
inet addr:192.168.121.62  Bcast:192.168.121.255  Mask:255.255.255.0
inet6 addr: fe80::5652:ff:fe3d:4087/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:859641 errors:0 dropped:0 overruns:0 frame:0
TX packets:125441 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2129954 (203 MiB)  TX bytes:2323974545 (2.0 GiB)

Nothing obviously out of the ordinary. The MAC address is the same as we’ve told LXC to configure (it’s correct when it breaks too).

So what’s the problem? Stopping lxc from adding/removing the veth device from the bridge allowed me to start/stop my container without problems. So the point at which the veth device was added to the bridge was definitely the problem. But why?

We setup a loop that continually started/stopped containers to see if there’s anything useful that we could find by watching various things whilst containers were started/stopped. One of the things I had open was watch on ifconfig. It didn’t take long before I realised that sometimes the bridge’s MAC address was changing. Sometimes a container would start, the MAC address of the bridge would stay the same, everything was fine. Other times, the container would start, the MAC address of the bridge would change, then the hosts networking would break.

Ok, now we’re on to something…. what’s causing the MAC address to change (sometimes) and why it it changing to totally random things that seem so totally unrelated to any MAC address that we’ve configured anywhere?

After this is it was fairly simple. The MAC address given to our bridge was b4:99:ba:XX:XX:XX (the MAC of bond0, which happens to be the MAC of eth0 too, although that’s less important). Whenever the MAC of the bridge changed (and we saw problems with the host networking) it always changed to a MAC address ‘lower’ in value that the original MAC of the bridge.

So why were random MAC addresses lower than our bridges MAC address even being created? We’ve configured our containers to have MAC addresses in the range that KVM seems to use, which start 52:54:….. So that wasn’t it. I then backtrack to my original comments about not noticing the MAC address of the bridge changing and also realise that the MAC of the bridge was the same as the veth device that was created and is associated with my container. Ok, one step closer, so why is this? LXC’s source reveals that the veth devices are created in the simplest way possible and no extra configuration (so a colleague tells me!) and therefore the MAC address of the new veth device is totally random.

Adding this veth device with its random MAC to our bridge causes the bridge to change its MAC to the MAC of the veth device if the veth devices MAC is lower than the bridges current MAC. I’m unsure why this happens, but it seems like whenever you create a bridge, something decides that the bridge will assume the MAC address of the interface with the lowest value MAC address thats added to the bridge. I have no idea if this is a standard or a choice a kernel developer has made or what? But it’s causing me problems!

Why haven’t any of my KVM or Xen boxes i’ve built/used in the past suffered from this problem? It seems that KVM (at least KVM vm’s created with libvirt) have their associated veth device created with a MAC address specified (rather than random like the LXC). The MAC used is the MAC that the virtual machine is configured with, but with the first part replaced with fe (ie. 01:23:45:67:89:ab would become fe:23:45:67:89:ab). Now presumably they do this because they know most users are probably using bridged networking and when these devices are added to the bridge, they’re probably going to have a higher MAC than any real device and therefore probably aren’t going to break your hosts networking!

What can we do to solve this problem? Hopefully the people behind LXC will adopt a similar approach in the future. However until then the only solution that I can find is to tell the bridge what it’s MAC address is. It then doesn’t seem to get changed when machines with lower MAC addresses are added to the bridge.

It seems that on Debian style systems you can do this with this option in /etc/network/interfaces:
bridge_hw <MAC addr>

You could do it manually with ip:
ip link set br0 address <MAC addr>

I’d imagine that you can do it with ifconfig too? I’m sure RedHat style systems has a way too.

I filed a bug on the LXC bug tracker at sourceforge.

Fixing slony after implementing postgres authentication

May 8th, 2011 by Gary Richards - Categories: Linux, Postgres, Slony

Woah, neglected blog!

Tasked with implementing postgres database auth for a client, the task seemed fairly simple, add some users, edit pg_hba.conf, remove all of the trust config. Implement auth for various users from various hosts using a password… what could possibly go wrong?

After reconfiguring the apps to match, things were fine, with the exception of a Slony (Slony-I 2.0.6 in this instance ‘installed’ from source).

In this set, slon_start seems to takes the client auth details from slon_tools.conf when the cluster is originally configured. But editing it doesn’t then change them. So when slon is started, it tries to get its auth details form somewhere (there’s no slon.conf on these systems at all), so where was it?

I remember once looking into the postgres schema added by slony when you initialise a cluster. So I investigated further…

mydb=# set search_path TO _repl;
SET
mydb=# select * from sl_path;
pa_server | pa_client | pa_conninfo | pa_connretry
-----------+-----------+----------------------------------------------+--------------
1 | 2 | host=172.30.50.231 dbname=mydb user=slony port=5432 | 10
2 | 1 | host=172.30.50.232 dbname=mydb user=slony port=5432 | 10

Ok, so that’s where it’s stored, therefore is it really that easy to fix my problem?

mydb=# update sl_path set pa_conninfo=pa_conninfo || ' password=xxxx'
UPDATE 12
mydb=# select * from sl_path;
pa_server | pa_client | pa_conninfo | pa_connretry
-----------+-----------+-----------------------------------------------------------+--------------
1 | 2 | host=172.30.50.231 dbname=mydb user=slony port=5432 password=xxxx | 10
2 | 1 | host=172.30.50.232 dbname=mydb user=slony port=5432 password=xxxx | 10

Then I started slon on each of my nodes and… wow. slon now auths to my DB servers and replication syncs back up!

Bash 4 associative arrays

June 14th, 2009 by Gary Richards - Categories: Linux, Operating Systems, Scripting

When scripting Bash, i’ve often come across problems where using an associative array would make things so much cleaner. Unfortunately Bash has never supported then…. Until now.

With Bash4 I can do this:

#!/bin/bash
declare -A info
info[blah]="Blah"
info[test]="Test"
echo ${info[blah]}
echo ${info[test]}

As you can see, associative arrays are where you use strings as the index of the array, rather than integers. Very useful.

Removing a top level cyrus imap mailbox

May 28th, 2009 by Gary Richards - Categories: Email, Linux, Operating Systems, Redhat

When using cyrus imap generally a new mailbox is added to the system using the following commands:

$ cyradm -u cyrus localhost
Password:
localhost> cm user.someusername

Generally, this would create the directory: /var/spool/imap/user/s/someuser, containing the various cyrus files.

It turns out that you can also do the following too:

localhost> cm someusername
localhost> cm someuser.name

When you list users you would see:

localhost> lm
someuser.name (\HasNoChildren)
someusername (\HasNoChildren)
user.someusername (\HasNoChildren)

I came across a (RHEL4) cyrus server like this today, it seems that someone had created some mailboxes like this accidentally and then not been able to remove them.

When I tried I received the following messages:

localhost> dm someusername
deletemailbox: Permission denied
localhost> sam someusername cyrus c
setaclmailbox: cyrus: c: System I/O error

Based on what I said above, I would have expected both /var/spool/imap/someusername and /var/spool/imap/someuser/n/name (or /var/spool/imap/someuser/name) to exist and to contain the various cyrus files, but they don’t. I can only assume that this is why I see the errors above.

Anyhow, the fix was fairly simple:

$ cd /var/spool/imap
$ mkdir someusername
$ reconstruct someusername
$ cyradm -u cyrus localhost
localhost> sam someusername cyrus c
localhost> dm someusername

The same applies forsomeuser.name, however rather than creating an extravagent directory structure, it appears that you can just create the directory someuser.name and then reconstruct someuser.name, if that fails then i’d suggest try creating someuser/name and resconstructing someuser.name or failing that, try creating someuser/n/name and reconstructing someuser.name.

This looks as though it may have been a bug in an older build of cyrus imapd. As I tried some similar steps on a test cyrus setup on a Gentoo system and I can create/delete these mailboxes fine without the above steps.

All very odd.

PXEBOOT/BIOS flashage fun

May 20th, 2009 by Gary Richards - Categories: Linux, Operating Systems

Whilst repurposing a machine that’s now a couple of years old, I came across a problem that the system couldn’t ’see’ a 1TB SATA hard disk, but other smaller disks seemed to work fine. I noticed a BIOS update for this board was available, but I hate floppies, so I had a little think and this is what I came up with…

Download a FreeDOS floppy disk image (fdboot.img)

Mount the disk image and copy the required files to it:

$ mount -t vfat -o loop fdboot.img /mnt/floppy
$ cp viaflash.exe /mnt/floppy
$ cp I1000107.BIN /mnt/floppy
$ umount /mnt/floppy

At this point my BIOS flashing floppy disk is technically ready to go, but how do I boot it without writing it to a floppy? So lets have a look at PXE boot… Now I already have a DHCP server configured to boot pxelinux from the syslinux project, I won’t go into too much detail here, but some vague additions to a simple dhcpd.conf for this appear to be:

# Specify the TFTP server to be used
next-server 10.3.0.254;

# Declare a vendor-specific option buffer for PXE clients:
# Code 1: Multicast IP address of boot file server
# Code 2: UDP port that client should monitor for MTFTP responses
# Code 3: UDP port that MTFTP servers are using to listen for MTFTP requests
# Code 4: Number of seconds a client must listen for activity before trying
#         to start a new MTFTP transfer
# Code 5: Number of seconds a client must listen before trying to restart
#         a MTFTP transfer

option space PXE;
option PXE.mtftp-ip               code 1 = ip-address;
option PXE.mtftp-cport            code 2 = unsigned integer 16;
option PXE.mtftp-sport            code 3 = unsigned integer 16;
option PXE.mtftp-tmout            code 4 = unsigned integer 8;
option PXE.mtftp-delay            code 5 = unsigned integer 8;
option PXE.discovery-control      code 6 = unsigned integer 8;
option PXE.discovery-mcast-addr   code 7 = ip-address;

subnet 10.3.0.0 netmask 255.255.255.0 {
8< SNIP ALL THE USUAL OPTIONS >8

# Provide PXE clients with appropriate information
class "pxeclient" {
 match if substring(option vendor-class-identifier, 0, 9) = "PXEClient";
 vendor-option-space PXE;

 # At least one of the vendor-specific PXE options must be set in
 # order for the client boot ROMs to realize that we are a PXE-compliant
 # server.  We set the MCAST IP address to 0.0.0.0 to tell the boot ROM
 # that we can't provide multicast TFTP.

 option PXE.mtftp-ip 0.0.0.0;

 # This is the name of the file the boot ROMs should download.
 filename "pxelinux.0";
 }
}

If we tried to boot the machine now, once its network boot kicks in and gets an IP address from the DHCP server, it will (using tftp) looks for the file pxelinux.0 on the IP address specified in next-server, so obviously we need a tftp server configured on the machine specified in next-server…

I have atftp installed on this server, it defaults to sharing /tftproot for its contents… so in here I have the pxelinux.0 file, booting the system now, it runs pxelinux and because of no configuration I eventually get dumped out in a boot> prompt.

A little further reading suggests that to boot a floppy image I need to use memdisk, yet another part of the syslinux package. I copied this file and my fdboot.img file to /tftproot on the dhcp server.

At the boot> prompt on the machine, I then:

boot> memdisk initrd=fdboot.img

Lo and behold… I end up at a pxe booted FreeDOS prompt.

Unfortuantely the viaflash.exe seems to hang as soon as I try to flash the BIOS. So perhaps that’s an issue with FreeDOS rather than /real/ DOS.

Further investigation needed, but hopefully i’ll never need a real floppy again!