The Beehive, City Place, Gatwick, RH6 0PA, United Kingdom
+44 (0)20 801 74646

Using static analysis to audit C/C++ codebases for GDPR compliance

The EUs General Data Protection Regulation rules (GDPR) were introduced to enshrine in law a set of principles intended to protect the privacy of EU citizens. Infringements of certain GDRP rules by companies or bodies could carry very heavy financial penalties of upto either €20 or 4% of annual turnover. Basically, significant stuff, and most definitely something to care about. According to an article at ITProPortal, only a tiny 0.25% of infractions resulted in enforcement penalties but part of the requirements of GDPR compliance requires companies to report breaches, and over 11 thousand reports were raised. The reason for such a small percentage of fines could be many but the point is, it seems the potential for massive fines has made sure companies take their obligations seriously.

Among the varied systems and processes therefore that need to be checked are the applications that are central to many organisations day to day activities. Some of these systems may be legacy ones in maintenance mode only, or they may be actively developed. In either case, being able to thoroughly expose all the avenues of execution within an applications codebase is one of the acknowledged difficulties of unearthing flaws, and the same issue applies to finding paths of execution along which GDPR violating data leaks may occur. Take for example systems that have access to private data that also have connections to other systems, and theres a need to check only certain endpoints are connected. Quite often the points where data is accessed, for example databases and files can be reasonably easily identified, as can the points of exposure or connection to other systems. The difficulty is making sure that these two points are not connected in unexpected ways. Simply having a large body of code through which these endpoints (known technically as sources and sinks, respectively) may connect makes it nigh on impossible to check with good confidence, and to complicate matters, the data along those myriad execution paths may be transitively passed around through variable and pointer assignment and alias.

Take for example the following simple example:

It is just approximately 80 lines of code. Its a very short, almost psuedocode implementation of a program that retrieves data ( via getUser() ), potentially transforms it somehow (getInfo(), getCompleteData() ), and then “connects” with external tools ( sendToEmail() ) and systems ( sendToDatabase() ). For the purposes of brevity, the implementations of these routines are almost empty stubs, there is no actual code implementing detailed behaviour. Imagine how much more difficult it would be to either manually or indeed through the use of automated assistance in the form of static analysis, decide that certain user data should not be exposed or forwarded to externalities such as email.

Enter Codesonar, our C/C++ static analysis tool, that rather nicely has support for modelling the sources and sinks within a program, as well as those routines that indicate a particular function can be relied on to sanitise the flow of data between a source and sink.

To harness Codesonars abilities, we need to define models that label routines as source, sinks or sanitisers. This is done in the form of C like function implementations. For example, here is one such model for getInfo():

Here, we create an overload basically by defining a replacement for getInfo in a separate source file by prefixing the getInfo() prototype with csonar_replace_. Note how the first line of the implementation calls the real getInfo(), so the correct program behaviour is visible during the later analysis. All we had to do in this case was mark the data that we consider sensitive with the assignment of the buffer returned from getInfo() to csonar_taint_source_any(). There are several types of source marker we could have used, including File, environment, network, among others. So, thats a source of data that we dont want misappropriated marked up, but Codesonar needs to know the points that should not touch the data. Heres a model for sendToData() that does just that:

A rudimentary implementation again for the sake of space and ease of appreciation, here we basically mark the “str” parameter passed to sendToEmail as something that we dont want exposed when the program executes sendToEmail(). With just these 2 models in place, running the Codesonar analysis will have it examine all the execution paths in the codebase, and in the cases where the data flow is determined to be marked as tainted (as it will be in execution paths where getInfo() has been called), prior to calling sendToEmail with that transitively propagated data, an “Unexpected Data Leak” warning will be flagged for the reviewers consideration:

There are 3 issues flagged because there were additional models for other sources and sinks active. However, heres the actual failure report in detail:

Lacking from this example are the inclusion of suitable sanitising functions whose purpose would be to perform checks on the data to make sure nothing was leaking. As previously mentioned, in that case, we could have included a model to check that the sanitisation function did as expected and mark it appropriately. With such models in place, our analysis would then find all execution paths between source and sink where no acceptable sanitisation occurred.

Did you enjoy this post?

Subscribe to our newsletter and to keep up to date on blog posts, product updates and events.