Open source and the auto software assembly line

There’s been a lot of coverage about the Toyota recalls and the role of software in automobiles. One of the most interesting headlines, in my opinion, was this one from Techdirt, which echoes a call from the Software Freedom Law Center for automakers like Toyota to open source their software code.

These both demonstrate a lack of understanding with the automotive software assembly line. Likewise, this CNET article cites a vehicle tester at Edmunds.com who equates auto software to simple calculators. Again, not an engineer or programmer with a clear understanding of what goes on under the hood.

Two things stand out here:

First, auto software is not a standalone package. It is comprised of up to 100 million lines of code that come from multiple suppliers. In the Toyota case, it’s not clear whether the faulty brake software in the Prius was coded by Toyota engineers or by a supplier down the chain. And even if each individual code base was 100% bug free, there’s no guarantee that when the different pieces are glued together into a vehicle, there won’t be ‘glitches.’ To borrow from the Edmunds.com tester, it’s more accurate to say that auto software is more like different calculators duct-taped together.

Secondly, the Software Freedom Law Center’s call for Toyota to open source its code isn’t a realistic step. By its name, I get that the center promotes free software, and as such, are proponents of Linus’ Law that with enough eyeballs, code is better and more secure. Logically, this makes sense. But just how many eyeballs are enough? At a rough estimate, 50 lines of code are equal to one printed page. That means 100 million lines of code runs to about 2,000,000 pages.

The ingenuity and excellence of open source software development has created a wealth of excellent software. But as detailed in the 2009 Scan Open Source Report, there is a lot more to gain by using static analysis to improve the quality, security and integrity of software. Shawn Herman, a Microsoft programmer, wrote a very thoughtful piece on this subject looking at where human code review (eyeballs) stops and where automated analysis (static analysis and dynamic analysis) picks up. He explains that “a static analysis tool looks at source code the same way a compiler does, and so it has much more knowledge of what will happen than a human does.”

Instead of just open sourcing the software, the real effort needs to focus on Deming-like quality control for the automotive software assembly line. At a minimum, we believe that automated analysis needs to be part of this new software assembly line, preferably as early in the development process as possible. Right now vehicles are rolling consumer electronics devices but we expect that they will go through the same production maturity that we’ve seen in avionics and will be developed under mandated safety quality procedures.

C/C++’s enduring popularity

Last time I promised some more thoughts on the paper that we recently published in CACM. One area that some readers noticed is that the article spends a lot of time on issues that are specific to C/C++. Let’s dive into a question I sometimes hear: why don’t companies just write software in a better language where these problems can’t happen?

Part of the answer is many companies have already switched. But many more have stuck with C/C++ (see the Tiobe data on language popularity for one measure). A few of the many reasons from observing our customers are:

  • Legacy code. Consider the ton of code written in C/C++ out there. Coverity Scan analyzes over 300 of the most popular open source projects written in C/C++, including the Linux kernel and Firefox browser. There are about 60 million lines of code (MLOC) in these projects combined. In the commercial world, some of our larger customers have over 30 MLOC in a single project. To give you some perspective, 1 MLOC amounts to a stack of printed paper about 6 feet high. There would need to be a really, really good reason to consider rewriting anything of that magnitude.
  • Risk aversion. Code “hardens” as it ages and enters repeated production use. Large code bases are tested over a period of years, if not decades. The reality is that all production software, no matter what language it’s written in, needs to be tested for functionality, stability, and performance reasons. Using a great language might help with stability to some degree, but rarely functionality or performance. Rewriting a (sub)system reaps certain benefits, but also introduces risks to existing functionality, stability, and performance. Sometimes, the risk-reward trade-off is worthwhile, but often it is not.
  • Embedded software. Think mobile phones, cable set-top boxes, internet routers, wireless base stations, firewalls, network-attached storage systems, weapons control systems, and medical devices. They aren’t running on commodity PC hardware, processors, and operating systems. The hardware is often chosen for power-performance and cost reasons, and not infrequently there are large chunks of kernel-mode code. Sometimes these systems start from the Linux or FreeBSD kernel, or perhaps a commercial offering like Wind River VxWorks. Specialized hardware devices are common in this world, and low-level direct memory access is frequently necessary. Much of this software is written in C/C++.
  • Performance and control. C/C++ compilers generally produce very efficient object code. Still, the choice of algorithm, and careful tuning and performance evaluation can trump language choice. However, in my experience, C/C++ provides a degree of flexibility and control that can sometimes help in optimizing the performance of a system in a way that many other languages don’t allow. Of course, the same flexibility and control can lead to bugs that are hard to diagnose.

When I was a researcher back at Stanford, my favorite programming language was O’Caml. I felt wonderfully productive writing code in a language that had type safety, type inference, pattern matching, and purely functional data structures (still a novelty at the time). We initially started off writing most of our core functionality in O’Caml, but eventually, had to abandon that implementation in favor of one written in C++. It was just too hard to hire programmers who knew O’Caml, and we also found that certain features that helped eliminate defects also reduced our control of the program’s performance, scalability behavior, and flexibility. Writing in C++ was often more verbose and less elegant, but it was possible to get to the result needed for the business.

At the end of the day, the features of a programming language are only one aspect of what makes it suitable or not for a particular job.

Digital Deming for the new software assembly line

It is now official that the problems with the 2010 Toyota Hybrid Prius cars are directly related to defects in the braking control software.

As I was reading various news items, the part that seemed interesting to me is that there is an Office of Defect Investigation under the National Highway Traffic Safety Administration. Modern automobiles have many million lines of code in embedded software running inside the car – orders more than an F-35 or a Dreamliner 737, according to Robert N. Charette’s article in IEEE Spectrum. From connecting and controlling mechanics, ABS engagement systems to the traditional electronics such as navigation, audio, heating and cooling, software affects various parts of the automobile. The modern cars have brake by wire, adaptive cruise control, active steering, tire sensors and many other parts all controlled by embedded systems and software. I imagine that the overwhelming number of lines of code in today’s vehicles bring new meaning to the word “defect” for the Office of Defect Investigation.

Robert Mitchell in his blog post points out the real challenge for auto manufacturers like Toyota. What started as manufacturing assembly line process allowing many vehicles to be built in a predictable way evolved into a Deming process driving quality procedures for predictable, repeatable, scalable manufacturing; and now has reached a point where well-built vehicles are failing miserably because there is no Deming-like rigor on quality in the new software assembly line. Vehicles at any speed — just like computer systems at wire speeds — are unsafe without an engine of high integrity software driving it.

So, it should follow that in the new software assembly line we too have an Office of Defect Investigation – only, build it early into the development cycle, and not after products have gone into the field. Once software has rolled into the field, it’s too late – and the results can be damaging (as Toyota knows only too well). In particular, this office may want to focus on three “departments”:

  • Architecture Analysis as we the design software.
  • Static Analysis as we write the code.
  • Dynamic Analysis and as we detect and fix functional issues.

A while back, there was an urban legend making the round in news cycles. According to the most popular version, Bill Gates allegedly stated, “If GM had kept up with the technology like the computer industry has, we would all be driving $25.00 cars that got 1,000 miles to the gallon.” In this story, GM responded with a smart press release asserting that if it developed products the way Microsoft did, its cars would have all the problems of Microsoft’s software. This has been proven to be an untrue story but in context of the Toyota recall, the debate still stands on its own.

As Software Professionals, we cannot be on the receiving end of such debates. We need our own Offices of Defect Investigation and emphasis on production quality early and often to ensure software integrity in every piece of software that we ship.

A few billion lines of code later

Coverity has a relatively unique back story as a company.  We were largely bootstrapped for the first four years, and the founding team had no business experience.  How did we do it?  Some of the story is told in this paper just published in the Communications of the ACM.

The paper discusses some of the technical and social challenges that we faced when bringing some cutting edge technology to the brutal commercial world.  Most of this was written by Dawson based on feedback from talking with lots of people at Coverity, so credit for the style of the paper goes to him.

I was surprised to find this comment on the article from the inventor of the C++ programming language:

I hugely enjoyed this article. I can’t think when I last read an article that made so many important points. Thanks for not sugar-coating the description of real-world problems and real solutions. And thanks for not drowning those points in jargon.

— Bjarne Stroustrup, January 29, 2010

Next time, I’ll dive into some of the topics covered in this article for those who want a bit more information.

When is -652 days later?

Imagine this – You are looking at your monitor screen, observing data and evaluating the information. Suddenly a prompt that warns you about your password expiring in ‘-652 days’ pops up. On the surface, it is trivial, but it has interrupted your work and you cannot proceed unless you press the ‘OK’ button. Now imagine the system you are using is a Fetal Heart Rate Monitor…

Screenshot showing 'Your password expires in -652 days' alert message.

Photo Credit http://thedailywtf.com/Articles/Divisive-Placeholder-.aspx#Pic4

The issue here is a software bug probably due to the misuse of negative integers. The code needs to be written such that negative integers and function return values that can potentially be only positive must be checked before being used.  The impact that such a defect can have is not just limited to the cosmetic error in the prompt. Negative integer misuses can cause memory corruption, process crashes, infinite loops, integer overflows, and security vulnerabilities.

Good QA techniques and dynamic analysis might help you to catch a limited portion of such bugs. However, a comprehensive analysis of the source code that walks through every possible execution path and an infinite number of inputs and outputs is only possible through solid static analysis.  An inter-procedural static analysis checker would report errors when function return values that can potentially be negative are not checked before being used or typecast to an unsigned integer.

As software developers, we sometimes fail to see the deep impact of a defect. What if the negative integer in the password-expiry-prompt on the fetal heart rate monitor resulted in a process crash? That would be unacceptable. It is imperative that we rigorously analyze our code, identify such defects and resolve potential for such failures at all costs.

Why Go Agile?

I joined one of our recent webinars on agile development and was quite surprised to learn that half of the attendees needed basic education on agile development.

Coverity’s seen companies transition to agile for many reasons and there’s no shortage of opinions advocating putting the plug in a waterfall process – most of which sum up to some simple facts: it’s cheaper, faster, has more flexible processes, responds better to changes in market demands and, though not perfect, agile environments can bring a certain honesty to team dynamics by exposing who’s behind contributions and progress.

In general, I think agile is a way of life. I think every advanced organization is moving to it, if it hasn’t adopted it already. Quality is always a challenge, but with agile you don’t have excessive time to produce high-quality product. In many cases, agile developers are fully accountable to design and write software, ensuring it meets or exceeds quality, performance, safety, and compliance thresholds – all while delivering it in fractions of time.

So, why am I a fan of agile? I’m an advocate of any methodology that empowers engineers to work toward higher standards of software integrity. Agile and test-driven methodologies have found a way to do that by distributing ownership and responsibility for quality. And with so many ways to make static analysis boost the type of automated testing this requires, agile is a better formula for better code that can keep up with shorter scrum cycles and produce frequently “potentially shippable” products.

There’s still a sizeable need for more agile education, however. Starting this week, in conjunction with our ALM partners (AccuRev, AnthillPro and Rally Software), we’re going to bring our free agile seminar series to the public. If you’re in any of these cities, reserve your spot today:

  • February 4 – Millbrae, Calif.
  • February 9 – San Diego, Calif.
  • February 11 – Boston, Mass.
  • February 24 – Atlanta, Ga.
  • March 17 – Seattle, Wash.
  • March 18 – Dallas, Texas
  • April 6 – Colorado Springs, Colo.
  • April 7 – Minneapolis, Minn.

If you can’t join us, you can always watch our pre-recorded webcasts on demand.

Use-after-free gets Google out of China

Google recently announced that it may cease operations in China, partly due to a large scale cyberattack that it believes originated in that country.  According to initial reports, one of the security vulnerabilities involves Microsoft Internet Explorer, including the latest versions of IE8 on the newly released Windows 7 operating system.

The exploit of the vulnerability involved getting users to click on a specially crafted email with a link to a malicious web site, so beware of links in email yet again.  But, as a user, why should you expect that a web site can take over your machine? For many users, that’s a bit like having your house broken into because you watched a particular channel on TV.  You shouldn’t have to expect that.  Software should not be so fragile.  But because IE has a software bug, the browser itself is vulnerable, and it provides a hidden backdoor that the malicious web site can pick.

So what’s the root cause?  According to Microsoft:

The vulnerability exists as an invalid pointer reference within Internet Explorer. It is possible under certain conditions for the invalid pointer to be accessed after an object is deleted. In a specially-crafted attack, in attempting to access a freed object, Internet Explorer can be caused to allow remote code execution.

Without the source code, it’s hard to tell what the vulnerability looks like at a technical level, but the description sounds a lot like what we call a “Use After Free” in our source code scanner.  It’s a vanilla software defect that usually causes applications to crash.  Some security auditors might even see this as a software defect instead of a security hole.  But as this defect shows, almost any software defect can be a security hole, if it can be tickled in just the right way.

I’m an optimist, so I hope that incidents like this will raise the profile of this problem.  But I’ve also seen how there’s a tendency to invest in building out fantastically expensive organizations responsible for responding to attacks, and very little in going after the root cause.  No matter how effectively and quickly we respond to attacks, we will not fundamentally improve the situation until we harden our software infrastructure by addressing the root cause of buggy software.

Lessons from 2009 Software Failures

As we start a new year, I can’t help but wonder if we are doomed to repeat the same software failure headlines of past years.

Whether it’s a security hole that allowed militants to use cheap off-the-shelf software to hack into Predator drones; a glitch that grounded and delayed air travel during the busiest travel week of the year; or software flaw that caused a very costly product delay – fact is, software failure continues despite the industry’s best efforts.

Why is software increasingly prone to spectacular failure even as it becomes more and more critical to the core infrastructure that we all expect to “just work”? For one thing, code bases today are much larger than they were just a few years ago. To illustrate, the Windows code base has grown from 6 million lines of code (MLOC) in 1993 with NT 3.1 to 50 MLOC with Vista. All of this code extends software capabilities, but at the same time increases complexity.

In addition, software has become much more mission-critical, pervading many aspects of our personal and professional lives. It not only controls familiar PCs and mobile phones, but also plays an invisible role in making the modern world run, in transit systems, communications infrastructure, internet services and defense systems.

As we depend more and more on software to control and manage mission-critical systems, we open ourselves up to huge risks when defects occur. Can we get a handle on software bugs? Is software failure a process, people or technology issue?

At Coverity, we believe software will continue to be plagued with high-cost and high-risk failures unless there is a fundamental change in the software development lifecycle – that is, defects need to be attacked and remedied early and often.

Which software failures come to mind most for you? Which ones do you think could have been avoided?