By a reference from a fellow quality engineer, I watched GTAC 2014 keynote video by Ankit Mehta on how different engineering and cultural practices help Google to move fast and NOT break things (almost). Overall Google’s test engineering goal is set to build an infrastructure and tooling to enable rapid launching of high quality innovative products that delight end users.
It was interesting to know that Google too is somewhat dirty on the inside: has flaky tests, manual testing in the evening, deadlines slipping from 2 days into 2 months, failures in production that affect a lot of people. I would never think that sometime back Gmail was on a weekly release cycle with code hitting production only 3 weeks after it was committed. You can imagine fixes took a lot longer too. Of course, this was a problem and Google evolved a lot since then.
Below are some points Ankit mentioned that help them release daily (all credits go to him, as this post is sort of note taking for myself)
Push on amber
Google pushes to production on amber. It’s not always justified to stop a release once some unimportant tests fail or some minor issues are found. There is a set of critical test defined that need to pass. Other tests’ failures pend a review. Flaky tests are quarantined and bug is created. Critical tests cannot be quarantined. Once bugs are fixed only a subset of tests related to updated code is executed, something Google calls “smart regression testing”.
Focus on preventing bugs
As I understood, not all tests are run locally and branches/mainline can get red (IIRC Google doesn’t use branches extensively and relies on feature switches in mainline instead). Many commits are made into mainline per day. Around 6PM testing team creates release candidate and pushes it to staging, runs set of manual regression and exploratory tests which take 2-6 hours. Usually 3-4 issues are still found regardless of thousands automated tests being there. (Whoa! It’s not surprising that test automation is not that good, but given it’s Google’s automation I am still surprised).
To avoid breaking things Google relies on several practices.
Deterministic hermetic tests – mocked services and backends. Simple check – test can run w/o network and access to 3rd party service. If one component uses 3 backends, then 3 tests with 1 real backend (2 others mocked) are created. Sometimes there are regressions when these don’t work together (bad integration tests? No integration tests?)
High presubmit test coverage and quality requirements.
Speculative rollback fix – when you knit and find a bug, then you unknit and do it again the right way. Test regressions in mainline are treated as build blockers. Failed tests == regression, rollback == green build. This way developers are not under a gun to fix (though I bet when something business/marketing critical it still happens 🙂 ). By default 4 commits are removed and added one by one to find latest green build (isn’t it clear what was the green build right away?)
Additionally, there is regular assessment of manual testing vs automated testing vs where bugs are found vs where code is changed more often. Bonus – nice way to see if there’s an testing ice cone (too much UI tests, too few unit tests) – you catch NPEs with UI tests.
Testing pushed upstream
“Spread the love of testing”. Business owners, developers and managers – everyone is involved in test planning and bug bashing.
Fishfooding – evolution of dogfooding and more gross 🙂 Everyone is on bleeding edge – most used functionality is in use by thousands of employees. It is possible to iterate in 4 hours from design implementation to bug bash in internal network.
Product and feature releases are delineated
Release != feature release. Feature flags are used to do feature releases. No rollback in case things go wrong, simply turn off the flag. Features don’t have hard dependencies and releases keep on rolling further.
Flaky test detector. Threshold – 3 false positives out of 100 runs. Bug is created and test is quarantined. Critical tests cannot be quarantined.
Balance release velocity – users don’t want to update apps often. You cannot rollback – you need higher quality bar.
Smaller releases are verified by business owners. Bigger ones are given for business owner review once QA did some searching and product is more stable.
Kill old feature switches, otherwise there’s too many. Do big refactoring behind feature switches. Post mortems on bigger feature releases and screw ups. Trusted testers program. Prioritize releases – smaller ones get less attention and PO
Below are two screens from the video. First shows how Google evolved in 3 years.
Second one shows summary of practices. Unfortunately, Ankit spent more time on first ones and rushed through last ones, so that nothing new or interesting was mentioned
- Gmail was on weekly release cadence, but due to testing and regressions code reached production 3 weeks after final release commit.
- Once adding a feedback form to Gmail took 2 months instead of 2 days, because of refactoring and agreed testing requirements (unit tests, integration, UI) which needed additional work.
- At some point Gmail tests were so flaky, that they stopped adding them for a while.
- One team had 1000 UI tests, but only 70 of them were working and being executed 🙂
- Once testing network config was pushed to live and turned off several backend services and blackouted Gmail and G+ for 25 minutes.