On 23/04/15 2:54 pm, Nathan Ward wrote:
On 23 April 2015 at 13:48:09, Glen Eustace
The problem is also intermittent, I have heard
that the two Xtra
servers are actually LB VIPs in front of a farm of name servers. With
the intermittent nature of the issue I wonder whether one server in
the farm might be broken/misconfigured, just a thought.
I’d be surprised if there was an actual load balancer in the way, though
it is entirely possible that there’s some ECMP routes to the servers or
something. It would surprise me if they had enough DNS traffic to
require such a thing, but, what do I know.
We have solid evidence to show that there is a load balancer in front of
the IP Spark uses for resolver.
NZRS actually has an article that explores that:
There’s a couple of ways to easily validate whether you’re hitting
different servers. It’s difficult to prove the negative, but it’s easy
to prove the positive (with very good confidence).
1) Look at the TTL the servers offers, it’ll jump around between queries.
2) Ask it for names it has to recurse, and on your name server see where
the queries come from, it’ll likely change between queries - though some
providers pass recursive queries to a higher level caching server which
would mask that.
We followed this methodology and we saw those jumps in the TTL, as well
in the validation status of responses. We came to the conclusion there
are a set of servers behind the "service address", and some of them
validate and some others don't
3) Ask for the hostname "dig chaos txt
hostname.bind @<server>” and see
if it changes (assuming they offer it).
If the customer is on a dynamic IP, get them to reconnect to get a
different IP, that might be when you see the change happen - assuming
whatever the load sharing function is does it by an L3 hash. If it’s L4
you’d see it changing between queries, which I suspect isn’t happening
in your case given how you describe the problem.
If any of the above things is true, then there’s a strong chance you’re
hitting different servers. If you can isolate it to a specific server
(or set of servers), I imagine when you do get in touch with someone
about the issue you’ll be able to resolve it much faster.
Assuming it's a hiccup in one of the servers in the pool, it won't be
possible to positively identify, but by discard. I'm not completely sure
the servers respond to hostname.bind queries. At least they don't
provide information using NSID.
NZNOG mailing list
Technical Research Manager
desk: +64 4 495 2337
mobile: +64 21 400535