September 16, 2014 Leave a comment
It was everyone’s commitment to the project that got us to the actual cutover day. But it was our resilience and sense of purpose that helped us persevere the next three days that would prove to be the most difficult challenge we would face to date.
The Friday prior to launch, the database migration was on track and confidence was high we would be able to access the new Social Intranet production site on Saturday morning. And at 8am Saturday it was confirmed, the database migration had completed and we were ready to start our changes. By 7pm, we felt like we were at a good stopping and that the IT team could kick off the search index rebuild that would take approximately 16 hours. This is where the challenges began….
At 5:30am on Sunday, the search index process appeared to be hung at 97%. Rather than confirm with the vendor or project team, a unilateral decision was made to restart the service and start all over again. This was one of the first mistakes we made as a team and can’t help but think that a lack of sleep over the prior three days had something to do with it. Around 11am the search index was back up to 94% and the vendor confirmed this was typical as the process slows down the closer it gets to finishing and isn’t uncommon for several hours to pass between 97%, 98%, 99%… In hindsight if we had let the process continue, it would have been completed by this time. But, we pushed on realizing we couldn’t change anything at this point and we were well beyond the point of no return.
By 3pm that Sunday, It was time to get out of the command center, get some fresh air and new perspective on status and moral. It was a beautiful sunny day and we talked about how we felt and where we were in the upgrade process. Morale was surprisingly high. There was still more to do but everyone felt we were continuing to make progress and nothing that would prevent us from completing our goal. Most of the old intranet pages had been moved over, all 500+ communities had been updated with the new layout, widgets and template, home page content was being populated and the training space was all setup and ready to go.
It was 8pm on Sunday night and APAC offices were coming on line. We decided to split the team at that point so at least some of us could get some rest and were fresh in the morning when users began accessing the site. At that time, a member of the team made a statement that really caught us off guard. “I am seeing users beginning to access the system. Should I bring the other three nodes of the application cluster online?” We all just looked at each other and said, “Why are they down? Yes bring them online”
All the work we had been doing over the last 48 hours, including uploading images, formatting spaces and organizing content had all been completed against Node1 of the four-node cluster. So, when the site came up our beautiful baby turned into an ugly monster that was out of control. The nodes hadn’t been synchronized which resulted in missing images, broken links and an inconsistent experience. By the time we realized it, user questions started rolling in from the Australia offices. “Why can’t I access this link?”, “Why is the home page missing images?”. We fielded as many questions as possible and tried explain and use this as a test the experience and it was immediately clear it wasn’t good.
It is about midnight now and knew EMEA would be coming on-line in a few hours. The experience was terrible, the search index wasn’t complete and there wasn’t much we could do other than let process finish, cycle the servers, clear the cache and keep our fingers crossed. The site was a disaster and quickly realized we needed a maintenance page that would prevent users from accessing the site. This would also give the team a few hours of sleep and let the index process complete. The UX team said they could modify the entire theme with a maintenance message and would quickly test. We agreed and got to work.
By 1am, the maintenance theme was live. It wasn’t pretty but it would do the trick and we agreed to meet at 5am and reassess where we were. This was a critical point in our go live preparation. Everything depended on the search index process completing. Everyone was tired, mistakes were made but we still believed we could get through this and the reward of delivering the experience was still in sight.
Have you experienced any particular challenges where you persevered and resulted in a positive outcome?