Strange network behaviour

Discussion:

(too old to reply)

Douglas K. Rand

2008-05-05 18:08:05 UTC

We had a very strange problem today. Two of our hosts could not reach
a server, but only those two hosts. All of our other hosts could reach
those servers fine. (OK, I didn't try ALL of our IPs, but the half
dozen I did try worked fine.)

I checked all of our firewalls and routers, and everywhere I looked
all of the traffic was exiting our network just fine. I saw on our
edge routers the traffic going out, just no traffic back to the two
hosts in question. (We had good bidirectional traffic to all of our
other hosts.) And the two hosts in question were only having problems
connecting to ftp.agnewsonline.com.

Lets start with a traceroute from a working host, the orginating host
is 12.192.92.14:

[~]% traceroute -I ftp.agnewsonline.com
traceroute to agnewsonline.com (64.46.45.226), 64 hops max, 60 byte packets
1 12.192.92.3 (12.192.92.3) 0.257 ms 0.171 ms 0.163 ms
2 pluto-0 (12.192.93.13) 0.401 ms 0.296 ms 0.294 ms
3 ixion-att (12.192.93.244) 1.260 ms 0.463 ms 1.116 ms
4 12.87.125.249 (12.87.125.249) 14.838 ms 9.314 ms 9.755 ms
5 tbr2.cgcil.ip.att.net (12.122.99.122) 24.528 ms 24.788 ms 23.009 ms
6 ggr2.cgcil.ip.att.net (12.123.6.69) 22.362 ms 23.410 ms 22.335 ms
7 192.205.33.186 (192.205.33.186) 23.448 ms 24.074 ms 29.405 ms
8 ae-31-53.ebr1.Chicago1.Level3.net (4.68.101.94) 22.800 ms 32.598 ms 36.093 ms
9 ae-68.ebr3.Chicago1.Level3.net (4.69.134.58) 23.446 ms 21.599 ms 34.060 ms
10 ae-3.ebr2.Denver1.Level3.net (4.69.132.61) 61.517 ms 57.482 ms 56.606 ms
11 ae-2.ebr2.Seattle1.Level3.net (4.69.132.53) 96.484 ms 114.264 ms 96.984 ms
12 ae-23-52.car3.Seattle1.Level3.net (4.68.105.36) 91.295 ms 88.700 ms 89.705 ms
13 BIG-PIPE-IN.car3.Seattle1.Level3.net (4.71.152.26) 90.053 ms 90.511 ms 92.072 ms
14 rc1wh-pos14-0.vc.shawcable.net (66.163.76.1) 90.062 ms 93.489 ms 90.757 ms
15 rc2wh-pos0-15-2-0.vc.shawcable.net (66.163.69.181) 96.527 ms 91.743 ms 97.254 ms
16 rd1ht-tge1-1-1.ok.shawcable.net (66.163.77.18) 101.412 ms 114.160 ms 100.530 ms
17 ra1ht-ge3-1.ok.shawcable.net (66.163.72.134) 105.651 ms 101.336 ms 101.628 ms
18 rx0ht-rack-force-2.ok.bigpipeinc.com (64.251.64.50) 111.960 ms 101.535 ms 116.136 ms
19 rf1.01.rackforce.net (69.10.128.198) 583.192 ms 491.170 ms 598.406 ms
20 64.46.45.226 (64.46.45.226) 110.207 ms 108.718 ms 107.279 ms

A traceroute from one of the hosts that doesn't work would reach
ae-3.ebr2.Denver1.Level3.net but go no further. I then tried pinging
the routers I couldn't reach. I could not ping:

ae-3.ebr2.Denver1.Level3.net (4.69.132.61)
ae-2.ebr2.Seattle1.Level3.net (4.69.132.53)
ae-23-52.car3.Seattle1.Level3.net (4.68.105.36)
BIG-PIPE-IN.car3.Seattle1.Level3.net (4.71.152.26)

but when I started pinging rc1wh-pos14-0.vc.shawcable.net (66.163.76.1)
not only did I start getting responses, but everything started working
to ftp.agnewsonline.com too, but just from that host. It really seemed
that pinging that router some how fixed my problem.

Well, I'm not sure I really believed that, but I still had another
host that couldn't reach ftp.agnewsonline.com, so on that host I
started a ping. I'll add my comments to describe what I was doing in
another window in /* */:

[~]% ping ftp.agnewsonline.com
PING agnewsonline.com (64.46.45.226): 56 data bytes
/* At this point in another window I started a nother ping: */
/* ping 66.163.76.1 and immediately this ping started working ... */
64 bytes from 64.46.45.226: icmp_seq=18 ttl=108 time=104.617 ms
64 bytes from 64.46.45.226: icmp_seq=19 ttl=108 time=105.775 ms
64 bytes from 64.46.45.226: icmp_seq=20 ttl=108 time=101.569 ms
--- agnewsonline.com ping statistics ---
22 packets transmitted, 3 packets received, 86% packet loss
round-trip min/avg/max/stddev = 101.569/103.987/105.775/1.774 ms

It was like I threw a switch. The single outbound ICMP packet to
rc1wh-pos14-0.vc.shawcable.net (66.163.76.1) fixed everything for that
host.

I was wondering if anybody has any clue what might be going on. I've
never experienced a problem like this before.

Douglas K. Rand

2008-05-05 19:28:03 UTC

Permalink

In the popular tradition of replying to my own post ...

It seems that this problem started right around the time I changed our
BGP configuration. I did:

config term
route-map att_out permit 9999
set as-path prepend 19317 19317
exit
clear ip bgp 12.87.125.249 out

This change was to increase our prepending of our own AS path from 1
to two. It was: set as-path prepend 19317

And we /could/ have had another host with a similar problem but with a
different end point, this time in Canada, and the (outbound) route to
our Canadian server does not transit either Level3 or Shaw cable.

We did not identify exactly when it started working again, we were
poking around this problem and then happened to check and it was
working. It is *possible*, but by no means confirmed, that a
traceroute allowed to go through all its timeouts to the Canadian
server may have also switched that problem off too.

Doug "Still searching for answers" Rand.

Deepak Jain

2008-05-05 19:37:23 UTC

Permalink

Did your inbound path change as a result? Sounds like a path asymmetry
issue might be involved.

Post by Douglas K. Rand
In the popular tradition of replying to my own post ...
It seems that this problem started right around the time I changed our
config term
route-map att_out permit 9999
set as-path prepend 19317 19317
exit
clear ip bgp 12.87.125.249 out
This change was to increase our prepending of our own AS path from 1
to two. It was: set as-path prepend 19317
And we /could/ have had another host with a similar problem but with a
different end point, this time in Canada, and the (outbound) route to
our Canadian server does not transit either Level3 or Shaw cable.
We did not identify exactly when it started working again, we were
poking around this problem and then happened to check and it was
working. It is *possible*, but by no means confirmed, that a
traceroute allowed to go through all its timeouts to the Canadian
server may have also switched that problem off too.
Doug "Still searching for answers" Rand.
_______________________________________________
NANOG mailing list
http://mailman.nanog.org/mailman/listinfo/nanog

Douglas K. Rand

2008-05-05 20:23:25 UTC

Permalink

Post by Deepak Jain
Did your inbound path change as a result?

Yes, I was trying to re-balance our inbound traffic a bit better. The
route-map change resulted in about 30% of our traffic coming in via
our other provider. The change was made around 16:00 (CDT) last
Friday, about 72 hours before this problem was brought to my attention.

Post by Deepak Jain
Sounds like a path asymmetry issue might be involved.

But why would the problem last almost 72 hours and be solved by a
single ICMP packet to a particular router? And why would only two of
our hosts be affected while all of our other systems work just fine?

I don't mean to whine, really! And I really don't mean to disagree,
I'm no expert in this stuff. I just don't understand what seems like
a very fine grained problem.