Slides and Notes 04-August-2011

klausz went into a strange state yesterday--it would accept a login but freeze, and it would not reply to nagios or pretty much do anything else. John rebooted it. Yesterday at about 19:30 it took another nap (pace nagios) and recovered spontaneously (?) at about 23:45. This was just after a spike in CPU usage wait time, with a little bit of ethernet traffic. When it recovered there was a short spike in ethernet traffic (30 minutes of about 40-50MB/sec) and then a lot of not much. Notice that the wait time is the lion's share of the most recent activity.

Turns out it was trying to transfer an 87GB file; and that didn't quite fit in the 50GB allowed area. With 100GB allowed, it sort of transferred--the last finally arrived after 4 hous and 44 minutes! The file on disk was 93596478700 bytes and in the collector directory is 93687881512 bytes, which is actually bigger. When the allowed area was about 83% full rfs started throttling

After the ball is over... The transfer started Thursday morning. You can see in klausz the relics of the failed transfer and the hang.

You can see that there is no particular contention for access to the files from the lustre server below

You can see that sam starts at medium speed, really picks up, and then throttles to almost nothing. After the transfer is done, it does a little bookkeeping and sends the file to chuma in a fast burst.

On chuma there seems to have been some additional activity--not sure what--just before the transfer kicked in.


Modified 04-August-2011 at 08:27
Previous notes Next notes Main slide directory

Please contact if you have trouble accessing the information on this page.