Notification Type
Technical Information
Service Affecting
Yes
Message
We are currently experiencing some network connectivity issues at our Interxion facility.Our engineers are working to resolve this asap.
Update: 05:40
The following timelines detail the events of tonight.
03:00 Switch swap maintenance begins. Engineer decides that he can't proceed and attempts to carry out some non intrusive maintenance on access router 2 (Hot Standby Router for Customers on unfirewalled VLANs, BGP customers and customers with HA firewall setups)
03:10 access router 1 reboots and traffic to the above mentioned VLANs goes down.
03:20 the on-call engineer calls the engineer doing the maintenance informing him of an issue
03:22 onsite engineer begins investigation on access router 1 over it's console cable.
03:29 access router 1 is power cycled
03:30 access router 1 returns to service.
03:45 - 04:25 access router 1 was down again due to human error. During the investigation of access router 1's problems the onsite engineer was using the same console cable he had been using on access router 2. The engineer then proceeded to work on access router 1 as if it was access router 2 and this is what caused the down time. It took until 04:00 to realise the mistake and a further 25 minutes to undo what had been done. Unfortunately the use of the rollback command in JunOS wasn't used in this case which would have put the system back online in under 60 seconds. In future as part of our maintenance policy we'll do a forced rollback in the event of any issues and ensure that all engineering staff are up to date on both JunOS and IOS procedures for rolling back config changes.
Update: 05:40
The following timelines detail the events of tonight.
03:00 Switch swap maintenance begins. Engineer decides that he can't proceed and attempts to carry out some non intrusive maintenance on access router 2 (Hot Standby Router for Customers on unfirewalled VLANs, BGP customers and customers with HA firewall setups)
03:10 access router 1 reboots and traffic to the above mentioned VLANs goes down.
03:20 the on-call engineer calls the engineer doing the maintenance informing him of an issue
03:22 onsite engineer begins investigation on access router 1 over it's console cable.
03:29 access router 1 is power cycled
03:30 access router 1 returns to service.
03:45 - 04:25 access router 1 was down again due to human error. During the investigation of access router 1's problems the onsite engineer was using the same console cable he had been using on access router 2. The engineer then proceeded to work on access router 1 as if it was access router 2 and this is what caused the down time. It took until 04:00 to realise the mistake and a further 25 minutes to undo what had been done. Unfortunately the use of the rollback command in JunOS wasn't used in this case which would have put the system back online in under 60 seconds. In future as part of our maintenance policy we'll do a forced rollback in the event of any issues and ensure that all engineering staff are up to date on both JunOS and IOS procedures for rolling back config changes.
Leave a comment