How I Detected Sensitive Data Leaks, Such as Log Leaks in Open Source Projects Using Piiano Flows

Guy Feigenblat

Director of AI, PhD

December 22, 2023

On this page

Introduction

I’ve been working for high-tech companies for two decades. It's no secret that organizations are putting a significant focus on perimeter security. However, they may have inadvertently left vulnerabilities open within their web application code. Duolingo's recent data leak incident really scared me. It's crazy how despite all these safeguards, the security landscape is far from bulletproof and serves as a stark reminder that, despite these safeguards, application data security is far from bulletproof. 

These sensitive data leaks often happen due to casual, honest coding issues, which are very common, yet very difficult to detect - for example, log leaks. My current focus is developing a proactive, automated way to detect these log leaks before they reach production. In an attempt to figure out how common this problem is, and if the solution I’m working on is effective, I tried to use the new scanner we’ve developed, to see if I managed to find leaks in a few chosen open source projects. This article describes how I did it and what I found. 

But first, what are application data leaks

Application data leaks are where sensitive or confidential information is unintentionally exposed or transmitted from application software. 

These leaks can occur for various reasons, including programming errors. Such leaks can involve a wide range of data, such as personal user information, financial records, credit card holder information, or any other sensitive data that the application processes or stores. 

Application data leaks pose a significant risk to individuals and organizations as they can lead to privacy violations, financial losses, and reputational damage. Protecting against these leaks requires rigorous security measures, regular and manual code audits, and proactive monitoring to identify and mitigate vulnerabilities before they lead to data exposure.

Why are application data leaks hard to detect, yet very common

Detecting data leaks, particularly through logs and third-party APIs, can be challenging due to several reasons:

  • Log leaks often involve large volumes of unstructured data, making it difficult to distinguish between normal and potentially sensitive information. Identifying patterns or anomalies within this data can be a complex and time-consuming process. Also, these leaks look like expected application behavior. They’re not like a data breach where you see abnormal activity and someone downloading a GBs of your data.
  • Third-party APIs can introduce additional layers of complexity because organizations often lack full control over these external services. This lack of control makes it harder to monitor and secure the data flow. Moreover, the APIs may be used for legitimate purposes, making it tricky to differentiate between authorized and unauthorized data transfers. Also, the use of third-party APIs may involve additional stakeholders and organizations, further complicating the detection process. 

These challenges, combined with the growing complexity of modern software ecosystems, make data leaks through logs and third-party APIs all too common and a persistent concern for cybersecurity professionals. 

During a recent conversation with a colleague, they revealed that their organization endured a staggering 50 data leak incidents annually. Each of these incidents triggered a high-pressure response, with multiple teams collaborating in a war room to rectify the situation swiftly. To tackle this issue, they implemented a dynamic tool designed to sample a portion of the data entering their pipelines and scan for sensitive information, such as personally identifiable information (PII) or credit card numbers. Despite their efforts, they remained aware of potential blind spots and understood that by the time they detected an issue with this limited sampling, it was often too late to prevent harm. 

Exploring Solutions - Application Data Leak Solutions

Solutions are starting to emerge that make it easier for companies to move away from reactive approaches to identifying data leaks. These tools scan a codebase to identify potential issues with stored, incoming, and outgoing data and log entries.

I used Piiano Flows which offers free scans for projects in a range of hosted Git services, such as GitHub and Bitbucket. It can also scan private repositories on GitHub. A commercial version is available for scanning projects in your infrastructure using a CLI. It's currently limited to scanning Java code, but support for Go and Ruby is in beta.

I thought it would be an interesting exercise to see what issues this tool could tell us about some public open source projects.

To get started, I visited scanner.piiano.io where I took the option to sign up with my Google account. I could’ve signed up with a GitHub account or simply used my email address. 

When the scanner dashboard opens, you see that Piiano have already loaded a scan for Shopizer, an open source project for headless commerce that can be used to create online stores, marketplaces, product listings, B2B applications, transactional portals, and alike.

Screenshot of piiano flow's scanner dashboard

Adding projects is easy; select Add Project and provide your project’s URL and a custom name, if you wish.

How to add a project on Piiano Flows

You can also specify a directory within the repository to scan, a useful option if you have a monorepo. Then you simply select scan to get things going. If you’re scanning a private repo on GitHub, you’re prompted for your credentials before the scan starts. 

A scan can take a few minutes to complete, depending on the size of the repository.

So what do you get as the result of a scan? To find out select the project name or, for projects you scan, View Report from the icon in the Actions column.

How to view reports after a scan in Piiano Flows

The report has seven sections:

  • Dashboard, a summary of the scan findings.
  • Storage, details of sensitive data types stored by your application.
  • Log Leaks, details of code that writes sensitive data to external logs.
  • Outbound, details of third-party API calls that access sensitive data.
  • Inbound, details of the declaration (class member) and use of sensitive data in your code.
  • Report, a format version of the report that you can download as a PDF.
  • Exclusions, details of any files not scanned.
The overview of a Shopizer report in Piiano Flows

So, what do the scan reports help you do?

Identify sensitive data storage 

Where your project defines its database in code, this section of the report details the content of tables that may contain personal or sensitive data. For example, from the Shopizer scan you see that the database includes a table for Billing details. This table includes the customer's first and last first and phone number among other details.

How to identify sensitive data storage in Piiano Flows

This report Is of great use to your privacy officer. With it, they can see what sensitive or personal data the application stores without needing to read the code or find a developer to extract the information.

Identify sensitive data leaking through logs 

We've probably all done it, been tracking down a bug during development and created a log entry. We fix the bug but then forget to remove the log. What if that log included personal or sensitive data and the entry makes its way to production? 

You probably don't secure your logs in the same way as your database or other transactional data. Indeed, you may use a third-party service to store and analyze logs. 

The logs leaks scan looks for this type of potential leak. Take this example from Teammates.

How to identify sensitive data leaking through logs in Piiano Flows

Here a log entry is written that includes the student ID and their unsanitized email address. 

This is a good example of why log leaks can be difficult to track down. In the right-hand section of the report you can see the destination for the log record.

How to find the destination for the log record in Piiano Flows

Nothing obviously problematic here, just a log line being output.

However, when you look at the flow on the left-hand side you can see the creation of the log line details, including the email address:

Details of the log lines in Piiano Flows report

Depending on the scope of the issue that resulted in these unsanitized email addresses, the log could be a trove for a hacker.

Identify outbound data leaks

This section of the report looks at third-party API calls that could be accessing sensitive data. 

You might assume that the production version of your app only sends data to third parties that are approved and known to be secure. However, oversights and omissions are always a risk. For example, as with logs, someone could forget to remove third-party calls used during testing or development. And, even if all the calls are planned, having an audit to use in checking that all third-party libraries are secure could be invaluable.

Here is an outbound scan result from Teammates.

Screenshot from the outbound scan result from Teammates in Piiano Flows

In this example, the application is sending a person's name to an external email service. This may be perfectly legitimate and part of the system design, but there is a need to flag this and make sure it is by design. The benefit of Flows is that the egress of data from the system can be quickly and easily identified, allowing effort to focus on determining its legitimacy.

Identify inbound data leaks

The inbound section of the report provides details of the declaration (class member) and use of sensitive data code.

This report is useful because, for example, GDPR requires you to monitor personal and sensitive data being processed by your application. 

This example shows how Shopizer obtains a billing telephone number. 

Screenshot of a Shopizer scan in Piiano Flows, showing how it obtains a billing telephone number

From this, you can see that Shopizer is obtaining order information from a shopping cart through a REST API.

Flows also identifies data stored in the database (see the “Persistent” label). This is  something you should review to ensure it was implemented as designed.

Conclusion: Application Data Leaks 

Locking the stable door after the horse has bolted is not a particularly effective approach to managing the security of personal and other sensitive data. However, for many organizations, this is the approach they take.

The emergence of tools, like Piiano Flows, enables organizations to take a proactive and automatic  approach to data security and take practical steps to ensure their system embodies the principles of security by design. This approach secures the data from the source, making sure your customers' sensitive data is much less likely to be unwittingly exposed. 

Create your account today and get started for free!

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

About the author

Guy Feigenblat

Director of AI, PhD

Follow

Guy, Director of AI at Piiano, spearheads the development of diverse AI and deep learning models, overseeing statistical data analysis, AI infrastructure, automation, and model deployment. Formerly a research lead at IBM Research AI's Language and Retrieval group.

Why Piiano Vault

Continue your reading

Back to all blogs
You agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.