No Black Box: Securing Our Data Analytics

By Allison Proffitt

February 18, 2019 | Data scientists and lawyers have fundamentally different goals, Matt Carroll explains. A data scientist wants as much relevant data as possible. A lawyer wants to limit unauthorized use of the data. How do you create a symbiotic relationship? How do you mitigate the concerns of both?

When Carroll first tackled this problem, he was an intelligence specialist with the U.S. government. There was no technology to manage that problem, he said. It was all done by hand. Now Carroll is the CEO of Immuta, a four-year-old company building a data management and permissions platform that will please both the data scientists and attorneys.

“I consider us a unified access and control layer,” Carroll told Bio-IT World. “Our mission in the company is to provide legal and ethical control of any data going inside or out… The platform is really a decision layer, a control plan for data: who should have access to what data for what purpose.”

The applications are broad. Finance, of course, stands out. Carroll tells of a financial customer with 70% of its data inaccessible to internal data scientists because of permission locks. Regulatory compliance was a burden on the process, and human intervention was slow and cumbersome. They didn’t have a system for managing risk, Carrol said, and needed to augment the relationship between the data owner, data scientist, and lawyers.

Immuta’s machine learning platform includes a policy builder designed with three attorneys. The policy builder puts governance, risk, and compliance into the data management process, logging everything for compliance. Globally, a legal team puts rules into place. Data scientists can then work within the rules, or request to opt out of one if needed. Every data science action is “fingerprinted” with the rules set used.

Regulatory compliance can be very complex when it comes to analytics, Carroll says. Once you have a dashboard and you can view and analyze data, are those analytics HIPAA compliant, for instance? But he calls the Immuta software itself “really simple,” a highly logical decision engine, and the logic layer can evolve.

“How do we allow a lawyer, or a governance person, or a risk officer to work with a data owner who’s afraid of handing that data over to a data scientist who doesn’t really care about any of this, but just wants the data to build predictions that answer a question?” Carroll asks.

But that’s only access control, step one. “Where we’re going is risk management,” he said. Here the algorithm and machine learning become more mature. “We need to focus more on outcomes than input,” Carroll says. “The raw data isn’t as important as the derived data. That’s what is risky.”

Of all of the stakeholders—data owners, governance teams, and data scientists—Carroll says data owners are most pleased by the risk management offered by the Immuta platform.

“Data owners love it because they don’t have to write [permissions] code anymore. The lawyers can just quickly build a rule, and they’ve levied all of that liability off the data owner and the governance team now owns that liability. They’re ok, because that’s their job: to own liability.”

We really care about privacy preservation within machine learning, Carroll said. We want to enable companies to be privacy-first—make money and do business—but enable privacy first. That means building what he calls “circuit breakers” into data access and analytics, shutting down access if rules aren’t met.

These data control needs are most evident at scale, he says. “How do we all agree on the different approaches that were taken? How do we document that? How do we have the appropriate enterprise controls to understand what happened, what we’re doing? [How do we have] the proper auditing behind so we can reproduce [our findings]?”

Immuta’s fingerprinting technology tracks not only algorithms used in analysis, but also the training dataset used to develop the algorithm. “The code is just as important as the training set, the validation set, the environment variables that went into it. No one is thinking about that! No one is thinking about the policies and the data and how that can change the algorithm. That’s the maturity that needs to occur.”

Life Sciences Pain Points

Carroll sees two areas of security weakness in life sciences in particular: genomics and medical imaging.

First, we are making some pretty big assumptions on genomics, he says. We don’t know what we know about the genome; we are inferring from a fairly limited data pool. How can we be sure that how we are managing data now will be secure in the future?

The second area of concern: medical imaging. You can’t trust an algorithm to fully process medical image data, Carroll says. A radiologist still needs to see it. While computer vision algorithms are very good at change detection, there are incidental findings and derivative data that must be dealt with.

And there are so many medical images available, most saved in hospital file systems.

Carroll concedes that a wealth of images must be made available to train computer imaging algorithms. “How does everyone have fair access to training data? I think it’s for the greater good that we have access to each other’s data for the sake of better healthcare.” But those images are still personal medical information, with possible implications for not only the patient but the patient’s family.

Other challenges in medical imaging: overfitting and derivative data, both of which Carroll finds particularly scary with unsupervised learning. “It’s opaque. Maybe it works. Maybe it doesn’t.”

Immuta's goal is to take “scary” out of the equation, Carroll says. “How do we make that process 1) more transparent, and 2) how do we make that process speedy?”