Using Traceroute

Traceroute is the program that shows you the route over the network between two systems, listing all the intermediate routers a connection must pass through to get to its destination. It can help you determine why your connections to a given server might be poor, and can often help you figure out where exactly the problem is. It also shows you how systems are connected to each other, letting you see how your ISP connects to the Internet as well as how the target system is connected.
This tutorial was written for users of premium Usenet services, but can be useful for anyone wanting to learn to use traceroute.


Running a traceroute

The traceroute program is available on most computers which support networking, including most Unix systems, Mac OS X, and Windows 95 and later.
On a Unix system, including Mac OS X, run a traceroute at the command line like this:
traceroute server.name
If the traceroute command is not found, it may be present but not in your shell's search path. On some systems, traceroute can be found in /usr/sbin, which is often not in the default user path. In this case, run it with the full path:
/usr/sbin/traceroute server.name
On Mac OS X, if you would rather not open a terminal and use the command line, a GUI front-end for traceroute (and several other utilities) called Network Utility can be found in the Utilities folder within the Applications folder. Run it, click the “Traceroute” tab, and enter an address to run a trace to.
MTR is an alternate implementation of traceroute for Unix. It combines a trace with continuing pings of each hop to provide a more complete report all at once. It is available here.
If you're stuck with Windows, the command is called tracert. Open a DOS window and enter the command:
tracert server.name
You can also download VisualRoute, a graphical traceroute program available for Windows, Sparc Solaris, and Linux. VisualRoute helps you analyze the traceroute, and provides a nifty world map showing you where your packets are going (it's not always geographically accurate). View a screenshot (I have obscured my local addresses).

Reading the output

Here is some example traceroute output, from a Unix system:

traceroute to library.airnews.net (206.66.12.202), 30 hops max, 40 byte packets
 1  rbrt3 (208.225.64.50)  4.867 ms  4.893 ms  3.449 ms
 2  519.Hssi2-0-0.GW1.EWR1.ALTER.NET (157.130.0.17)  6.918 ms  8.721 ms  16.476 ms
 3  113.ATM3-0.XR2.EWR1.ALTER.NET (146.188.176.38)  6.323 ms  6.123 ms  7.011 ms
 4  192.ATM2-0.TR2.EWR1.ALTER.NET (146.188.176.82)  6.955 ms  15.400 ms  6.684 ms
 5  105.ATM6-0.TR2.DFW4.ALTER.NET (146.188.136.245)  49.105 ms  49.921 ms  47.371 ms
 6  298.ATM7-0.XR2.DFW4.ALTER.NET (146.188.240.77)  48.162 ms  48.052 ms  47.565 ms
 7  194.ATM9-0-0.GW1.DFW1.ALTER.NET (146.188.240.45)  47.886 ms  47.380 ms  50.690 ms
 8  iadfw3-gw.customer.ALTER.NET (137.39.138.74)  69.827 ms  68.112 ms  66.859 ms
 9  library.airnews.net (206.66.12.202)  174.853 ms  163.945 ms  147.501 ms
Here, I am tracing the route to library.airnews.net, the news server name at Airnews. The first line of output is information about what I'm doing; it shows the target system, that system's IP address, the maximum number of hops that will be allowed, and the size of the packets being sent.
Then we have one line for each system or router in the path between me and the target system. Each line shows the name of the system (as determined from DNS), the system's IP address, and three round trip times in milliseconds. The round trip times (or RTTs) tell us how long it took a packet to get from me to that system and back again, called the latency between the two systems. By default, three packets are sent to each system along the route, so we get three RTTs.
Sometimes, a line in the output may have one or two of the times missing, with an asterisk where it should be:

 9  host230-142.uuweb.com (208.229.230.142)  12.619 ms * *
In this case, the machine is up and responding, but for whatever reason it did not respond to the second and third packets. This does not necessarily indicate a problem; in fact, it is usually normal, and just means that the system discarded the packet for some reason. Many systems do this normally. These are most often computers, rather than dedicated routers. Systems running Solaris routinely show an asterisk instead of the second RTT.
It's important to remember that timeouts are not necessarily an indication of packet loss. This is a common misconception, but since there are only three probes, dropping one response is no big deal.
Sometimes you will see an entry with just an IP address and no name:

 1  207.126.101.2 (207.126.101.2)  0.858 ms  1.003 ms  1.152 ms
This simply means that a reverse DNS lookup on the address failed, so the name of the system could not be determined.
If your trace ends in all timeouts, like this:

12  al-fa3-0-0.austtx.ixcis.net (216.140.128.242)  84.585 ms  92.399 ms  87.805 ms
13  * * *
14  * * *
15  * * *
This means that the target system could not be reached. More accurately, it means that the packets could not make it there and back; they may actually be reaching the target system but encountering problems on the return trip (more on this later). This is possibly due to some kind of problem, but it may also be an intentional block due to a firewall or other security measures, and the block may affect traceroute but not actual server connections.
A trace can end with one of several error indications indicating why the trace cannot proceed. In this example, the router is indicating that it has no route to the target host:

 4  rbrt3.exit109.com (208.225.64.50)  35.931 ms !H *  39.970 ms !H
The !H is a “host unreachable” error message (it indicates that an ICMP error message was received). The trace will stop at this point. Possible ICMP error messages of this nature include:
!H
Host unreachable. The router has no route to the target system.
!N
Network unreachable.
!P
Protocol unreachable.
!S
Source route failed. You tried to use source routing, but the router is configured to block source-routed packets.
!F
Fragmentation needed. This indicates that the router is misconfigured.
!X
Communication administratively prohibited. The network administrator has blocked traceroute at this router.
Sometimes, with some versions of traceroute, you will see TTL warnings after the times:

 6  qwest-nyc-oc12.above.net (208.185.156.26)  90.0 ms (ttl=251!)  90.0 ms (ttl=251!)  90.0 ms (ttl=251!)
This merely indicates that the TTL (time-to-live) value on the reply packet was different from what was expected. This probably means that your route is asymmetric (see below). This is not shown by all versions of traceroute, and can be safely ignored.
The output of the Windows version of traceroute is slightly different from the Unix examples (I have censored my router's name and IP address from the listing):

Tracing route to news-east.usenetserver.com [63.211.125.90]
over a maximum of 30 hops:

  1     3 ms     3 ms     2 ms  my.router [xxx.xxx.xx.xxx]
  2    35 ms    36 ms    35 ms  rbtserv5.exit109.com [208.225.64.56]
  3    36 ms    37 ms    36 ms  rbrt3.exit109.com [208.225.64.50]
  4    41 ms    40 ms    41 ms  571.Hssi5-0.GW1.EWR1.ALTER.NET [157.130.3.205]
  5    42 ms    44 ms    52 ms  113.ATM2-0.XR1.EWR1.ALTER.NET [146.188.176.34]
  6    43 ms    41 ms    41 ms  193.at-1-0-0.XR1.NYC9.ALTER.NET [152.63.17.218]
  7    61 ms    41 ms    41 ms  181.ATM6-0.BR2.NYC9.ALTER.NET [152.63.22.225]
  8    41 ms    42 ms    47 ms  137.39.52.10
  9    47 ms    42 ms    42 ms  so-6-0-0.mp2.NewYork1.level3.net [209.247.10.45]
 10    65 ms    63 ms    68 ms  loopback0.hsipaccess1.Atlanta1.Level3.net [209.244.3.2]
 11   104 ms    68 ms    80 ms  news-east.usenetserver.com [63.211.125.90]

Trace complete.
The Windows version does not show ICMP error messages in the manner described above. Errors are shown as (possibly ambiguous or confusing) text. For example, a “host unreachable” error will be shown as “Destination net unreachable” on Windows.
The rest of the examples will be in Unix format.

The reverse route

Any connection over the Internet actually depends on two routes: the route from your system to the server, and the route from that server back to your system. These routes may be (and often are) completely different (asymmetric). If they differ, a problem in your connection could be a problem with either the route to the server, or with the route back from the server. A problem reflected in a traceroute output may actually not lie with the obvious system in your trace; it may rather be with some other system on the reverse route back from the system that looks, from the trace, to be the cause of the problem.
So a traceroute from you to the server is only showing you half of the picture. The other half is the return route or reverse route. So how can you see that route?
In the good old days, you could use source routing with traceroute to see the reverse trace back to you from a host. The idea is to specify what is called a loose source route, which specifies a system your packets should pass through before proceeding on to their destination.
The ability to use loose source routing to see the reverse route could be pretty handy. Unfortunately, source routing has a great potential for abuse, and therefore most network administrators block all source-routed packets at their border routers. So, in practice, loose source routes aren't going to work.
These days, the only hope you likely have of running a reverse traceroute is if the system you want to trace from has a traceroute facility on their web site. Many systems, and Usenet providers in particular, have a web page where you can run a traceroute from their system back to yours. In combination with your trace to their system, this can give you the other half of the picture. I have a list of Usenet provider traceroute pages here.

Tracing from elsewhere

It can also be useful to see the result of a traceroute from somewhere else on the net. There are many public traceroute pages available which let you trace from those systems to other systems or back to your own system. There is an exhaustive list at www.traceroute.org.
Since many systems are multi-homed (have more than one connection to the Internet), you may have to run traces to a system from multiple locations in order to “see” all of its connections. In addition to diagnosing technical problems, this can be useful to determine what kind of connections a system has to the Internet.

Finding the problem: timeouts

If your trace to a system ends in timeouts, and never completes, there could be a problem. (The other explanation is that a system is blocking traceroute attempts, either by filtering all ICMP messages or by other means.) Your next step is to figure out where the problem is.
Well, obviously, if the trace stops at a particular system and can't go any further, then that system is where the problem lies, right? Possibly, but not necessarily.
If your traceroute ends in timeouts at a certain system, it's likely that either the connection between that system and the next system on the route, or the next system itself, is the source of the problem. The system may be down, or the network connecting them may be down. You may just have to wait for the problem to be fixed, especially if the problem system is not at your ISP and thus you aren't a paying customer of that network.
The problem could, however, not be with that system. Recall that the packets must travel from your system to the router and back again before you can see the results, and that the return route may be different from the forward route. Thus, the problem could lie somewhere on the return route between the system giving the timeouts and your own system, and that problem may not be reflected in the previous parts of the trace because the route may be entirely different.
Let's say you have a timeout like this:

16  c1-pos5-3.snjsca1.home.net (24.7.66.77)  136.612 ms  129.795 ms  129.133 ms
17  bb1-pos6-0-0.rdc1.sfba.home.net (24.7.72.18)  130.473 ms  137.609 ms  134.162 ms
18  * * *
The last reachable system on the route is at hop 17. The problem may be with the system at hop 18, or with the network connection between hops 17 and 18. Or it may be on the return route. It's very possible that the routers at hop 17 and hop 18 have different return routes to your system. The return route from 17 may work just fine, while the return route from 18 has a problem. That problem could be with that system, or it could be a totally different system, many hops away. It could even be a problem at your own ISP. The only way to tell is to see the reverse trace. A reverse trace from hop 17 would be useful here as well, to verify that the routes are indeed different. Of course, it may be difficult or impossible to obtain traceroutes from those systems, because the network administrator at home.net would have to run them for you, and is probably too busy to worry about such a request.
In this case, you can try running traces to the target system from various other places (use the list at traceroute.org) to see if it is reachable from elsewhere. In the above example, if you knew what router was normally at hop 18 (from seeing it in previous traces), you could try a trace to that router from another site.

Finding the problem: long routes

If your route to a server is very long, performance is going to suffer. A long route can be due to less-than-optimal configuration within some network along the way. Take a look at this route:

traceroute to 24.48.145.237 (24.48.145.237), 30 hops max, 40 byte packets
 1  main2-249-97.iad.above.net (209.249.97.3)  1.143 ms  0.559 ms  0.382 ms
 2  core1-main2-oc3-1.iad.above.net (209.249.0.25)  0.574 ms  0.886 ms  0.429 ms
 3  sjc-iad-oc12-1.sjc.above.net (207.126.96.121)  82.134 ms  82.537 ms  82.158 ms
 4  sl-gw8-sj-0-1.sprintlink.net (144.232.192.129)  82.523 ms  82.383 ms  82.949 ms
 5  sl-bb12-sj-6-0.sprintlink.net (144.232.3.109)  82.348 ms  82.762 ms  83.029 ms
 6  sl-bb10-sj-8-0.sprintlink.net (144.232.3.85)  83.346 ms  83.012 ms  83.006 ms
 7  sl-bb10-rly-6-0.sprintlink.net (144.232.9.13)  136.004 ms  135.804 ms  136.274 ms
 8  sl-bb6-dc-0-0-0.sprintlink.net (144.232.7.170)  137.625 ms  137.204 ms  136.794 ms
 9  gip-dc-2-fddi1-0.gip.net (204.59.144.194)  137.344 ms  138.156 ms  139.390 ms
10  gip-arch-1-atm2-0-0-132-atm.gip.net (204.59.5.25)  311.850 ms  325.246 ms  285.607 ms
11  gip-telehouse-1-atm0-0-0-333-atm.gip.net (204.59.5.14)  281.472 ms  291.957 ms  314.661 ms
12  gip-linx-fddi0.gip.net (204.59.2.198)  277.425 ms  297.364 ms  248.030 ms
13  linx-gw1.UK.EU.net (195.66.224.90)  291.800 ms  213.447 ms  221.377 ms
14  Nyk-nr01.NY.US.EU.net (134.222.228.158)  266.863 ms  301.220 ms  320.008 ms
15  nyc-core-02.inet.qwest.net (205.171.17.9)  206.191 ms  233.207 ms *
16  nyc-core-03.inet.qwest.net (205.171.17.85)  235.085 ms  270.805 ms  252.668 ms
17  nyc-core-01.inet.qwest.net (205.171.17.82)  281.931 ms  277.519 ms  278.152 ms
18  wdc-core-02.inet.qwest.net (205.171.5.235)  265.548 ms  233.789 ms  219.698 ms
19  wdc-core-03.inet.qwest.net (205.171.24.6)  200.913 ms  225.456 ms  246.335 ms
20  atl-core-01.inet.qwest.net (205.171.5.243)  237.049 ms  253.304 ms  215.435 ms
21  atl-edge-04.inet.qwest.net (205.171.21.50)  234.406 ms  289.490 ms  300.829 ms
22  205.171.45.150 (205.171.45.150)  296.876 ms  333.235 ms  272.397 ms
23  Adelphia-pvc55-t3-gw.aibusiness.net (208.235.111.18)  287.180 ms  268.736 ms  276.649 ms
24  surf4-145-237.pbc.adelphia.net (24.48.145.237)  382.868 ms  420.165 ms  393.398 ms
In this example, both the source and destination of the trace are in the United States. However, note that between hops 11 and 14, the route goes to London and back (LINX is the London Internet Exchange). Obviously, this is a problem; there are two transatlantic hops here which are completely unnecessary. Sprintlink is handing the traffic off to gip.net, which is taking it across the ocean before giving it to Qwest.

Finding the problem: high latency

Recall that the three numbers given on each line of output show the round trip times (latency) in milliseconds. Smaller numbers generally mean better connections. As the latency of a connection inreases, interactive response suffers. Download speed can also suffer as a result of high latency (due to TCP windowing), or as a result of whatever is actually causing that high latency.
Typically, a modem connection's inherent latency will be around 120-130ms. The latency on an ISDN line is usually around 40-45ms. If you use a connection of this type, you won't see any better than these numbers.
If you see, in a trace output, a large “jump” in latency from one hop to the next, that could indicate a problem. It could be a saturated (overused) network link; a slow network link; an overloaded router; or some other problem at that hop. Of course, it could also be a problem anywhere on the return route from the high-latency hop as well. You can use the ping program (described below) to get a better idea of the latency as well as the packet loss to a given site or router; traceroute only does three probes per router (by default), which isn't a very good sample on its own.
A jump in latency can also indicate a long hop, such as a cross-country link or one that crosses an ocean. A long line is naturally going to have higher latency than a short one. For example:

 4  core1.telehouse.level3.net (195.66.224.77)  2.355 ms  4.932 ms  3.473 ms
 5  core1.London1.Level3.net (212.113.2.65)  2.550 ms  1.934 ms  3.110 ms
 6  atm10-0-100.core1.NewYork1.Level3.net (209.244.3.229)  77.629 ms  75.664 ms  75.351 ms
The link between hops 5 and 6 is transatlatic, and thus is adding more than 70ms to the latency. This is normal.

Finding the problem: routing weirdness

One example of “weirdness” that you might see in traceroute output is exposure of private address space. Certain ranges of IP addresses are reserved for private, non-Internet use. These address ranges are not assigned to anyone, and are open for use by any system. They cannot be routed over the Internet, and thus are for internal use only. Sending traffic between private address space and outside networks must be done via internal routing or address translation.
The reserved private address ranges are:
  • 10.*
  • 172.[16-31].*
  • 192.168.*
Private addresses should never be visible over the Internet. But, sometimes you will see them in traceroute output. If they appear within your local network, this is okay; private addresses inside your own network can be visible to you. If, however, they appear within someone else's network, this can be problematic:

10  ebay-2-gw.customer.ALTER.NET (157.130.197.90)  114.204 ms  123.232 ms  120.957 ms
11  10.1.2.5 (10.1.2.5)  110.693 ms  114.475 ms  107.747 ms
12  * * *
13  * * *
The private address 10.1.2.5 within another network should not be visible to us. In this case, though, it is the last visible address before the trace ends in timeouts.
Visibility of private IP addresses doesn't necessarily (or even usually) mean that the route does not work. It is often simply the way the administrators of the target network have set up their system. In fact, the output above, despite the private IP address and the timeouts, shows a route that works perfectly well for web access.
However, a route which includes private addresses is difficult to troubleshoot. You can't ping the private routers to see if there is any packet loss. You can't trace directly to them from other sites. And in general, they show a certain level of cluelessness in how the network is set up.
Here is another example of routing weirdness:

11  USW-phx-gw.customer.ALTER.NET (137.39.162.10)  142.840 ms  151.245 ms  129.564 ms
12  206.80.192.221 (206.80.192.221)  127.569 ms vdsla121.phnx.uswest.net (216.161.182.121)  185.214 ms *
13  vdsla121.phnx.uswest.net (216.161.182.121)  442.912 ms  205.956 ms  221.537 ms
14  vdsla121.phnx.uswest.net (216.161.182.121)  164.728 ms  186.997 ms  190.414 ms
15  vdsla121.phnx.uswest.net (216.161.182.121)  306.964 ms  189.152 ms  221.288 ms
All looks well until hop 12. At that hop, the first packet is replied to from 206.80.192.221, but the second and third (which should be coming from the same place) are being returned from a different address, and timing out, respectively. After that, hops 13, 14, and 15 are all showing the same address! Since the response times are actually different, though, we can guess that they are, in reality, different systems. The trace ends normally at hop 15.
So what the heck is going on here? US West says this is a security measure, to hide the details of their internal network. The last few hops all return the address of the end-user's ADSL line, rather than their actual address. I'm not entirely sure what kind of “security” this is meant to provide.
Obviously, this makes any kind of troubleshooting of this connection next to impossible. If you encounter problems in this situation, the best you can do is contact the network provider and let them deal with it.
Sometimes you might see a route start “looping” back and forth between two routers, until the 30-hop limit is reached. This is a routing loop. This usually means that one router has lost communication (BGP) with another, and thus has dropped that route. Since the router has lost the route it needs, it sends the packet back where it came from, thinking maybe that is the best route. That router knows better and sends it back to the other one, over and over. Here's an example of a loop:

14  hou-core-03.inet.qwest.net (205.171.5.146)  165.484 ms  164.335 ms  175.928 ms
15  hou-core-02.inet.qwest.net (205.171.23.5)  162.291 ms  172.713 ms  171.532 ms
16  kcm-core-01.inet.qwest.net (205.171.5.201)  212.967 ms  193.454 ms  199.457 ms
17  dal-core-01.inet.qwest.net (205.171.5.203)  206.296 ms  212.383 ms  189.592 ms
18  kcm-core-01.inet.qwest.net (205.171.5.201)  210.201 ms  225.674 ms  208.124 ms
19  dal-core-01.inet.qwest.net (205.171.5.203)  189.089 ms  201.505 ms  201.659 ms
20  kcm-core-01.inet.qwest.net (205.171.5.201)  334.19 ms  320.39 ms  245.182 ms
21  dal-core-01.inet.qwest.net (205.171.5.203)  218.519 ms  210.519 ms  246.635 ms

Finding the problem: using ping

The ping program is used to determine whether a route is experiencing packet loss, and to measure latency.
On a Unix SVR4 system (such as Solaris), use the command:

ping -s news.server.name
On BSD Unix, Mac OS X, or Linux, use:
ping news.server.name
And if you're stuck with Windows, open a DOS window and type:
ping -t news.server.name
The output will consist of one line per ping (one per second), giving you the round-trip response time (RTT, or latency). The lower, the better. Note that if you can't traceroute to a system due to administrative blocking, you may not be able to ping it either.
Let the pings go for a while, then press control-C to stop it. You'll see a summary like this, on Unix:

----usenet73.supernews.com PING Statistics----
76 packets transmitted, 76 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 138/144/179
Or like this, on Windows:

Ping statistics for 207.126.101.73:
    Packets: Sent = 73, Received = 73, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 132ms, Maximum =  164ms, Average =  139ms
First you see an indication of packet loss. The more loss you see, the worse your connection will be, because every lost packet on a data connection must be retransmitted. If you see 20% packet loss, it's going to be painful. This number is more meaningful if you let ping run for a while; if you only do five pings, 20% packet loss means it dropped one packet, which could be no big deal. Let it go for a while.
Latency times are important for performance; the lower the better. If you play online games like Quake you are probably familiar with this concept. For Usenet reading, this will matter most if you read news online, interactively, staying connected to the server the whole time. If you use an offline newsreader which downloads articles all at once and lets you read them from your local disk, latency is much less important (it can affect sustained download speeds, but that is beyond the scope of this document). What the output is showing you is the minimum, average, and maximum latency times seen during the ping run. A few systems may include a fourth number showing the standard deviation.
If you see packet loss on a connection, you can use ping with your traceroute output to find the source of the loss. Start by pinging the next to last router in the trace. If you still see packet loss, ping the one before that. Eventually the packet loss will disappear, and you have found the part of the path where the problem begins.
Note, however, that as with other problems, the cause of the loss could be the first router on the path showing packet loss, or it could be anywhere on the return path from that router. Remember that the return path can be totally different from what you see in your trace output. But, this gives you a good place to start pointing fingers.

Under the hood

You don't need to worry about the low-level details of how traceroute works in order to use it. But, if you're interested, here they are.
Traceroute works by causing each router along a network path to return an ICMP (Internet Control Message Protocol) error message. An IP packet contains a time-to-live (TTL) value which specifies how long it can go on its search for a destination before being discarded. Each time a packet passes through a router, its TTL value is decremented by one; when it reaches zero, the packet is dropped, and an ICMP Time-To-Live Exceeded error message is returned to the sender.
The traceroute program sends its first group of packets with a TTL value of one. The first router along the path will therefore discard the packet (its TTL is decremented to zero) and return the TTL Exceeded error. Thus, we have found the first router on the path. Packets can then be sent with a TTL of two, and then three, and so on, causing each router along the path to return an error, identifying it to us. Eventually either the final destination is reached, or the maximum value (default is 30) is reached and the traceroute ends.
At the final destination, a different error is returned. Most traceroute programs work by sending UDP datagrams to some random high-numbered port where nothing is likely to be listening. When that final system is reached, since nothing is answering on that port, an ICMP Port Unreachable error message is returned, and we are finished.
The Windows version of traceroute uses ICMP Echo Request packets (ping packets) rather than UDP datagrams. In practice, this seems to make little difference in the outcome, unless a system along the route is blocking one type of traffic but not the other.
In the unlikely even that some program happens to be listening on the UDP port that traceroute is trying to contact, the trace will fail at the last hop. You can run another trace ucing ICMP Echo Requests, which will probably succeed, or specify a different target port for the UDP datagrams.
A few versions of traceroute, such as the one on Solaris, allow you to choose either method (high-port UDP or ICMP echo requests).


Usenet provider traceroute pages - traceroute from a provider to you
Traceroute.org - traceroute from just about anywhere to anywhere else
VisualRoute - a graphical traceroute program

http://www.exit109.com/~jeremy/news/providers/traceroute.html

No comments :

Post a Comment