Powerful predictions and privacy

Thursday, July 14, 2022

Our devices collect vast troves of data about our lives. This data is useful for myriad reasons—alerting us to health concerns, optimizing our bank accounts, providing an endless stream of fascinating content to consume. But once that data is collected and centralized it becomes a target for hackers, marketers, political campaigns, and more. So how do we get what we need out of the data we collect, without opening ourselves to risk?MBZUAI Assistant Professor of Machine Learning, Dr. Samuel Horváth’s research attempts to get at some of these questions. His interests lie at the intersection of mathematics, computer science, machine learning, optimization, and statistics, with one particular area of focus — federated learning, a sub-discipline of machine learning.

Machine learning has by now entered the popular lexicon as the way in which we “train” computer systems to understand and use data sets to accomplish many different things. This is how Netflix makes relevant suggestions based on your previous viewing patterns, for example.

The question is: How can we still progress the machine learning field if we might lose access to data due to privacy requirements?

Dr. Samuel Horváth
MBZUAI Assistant Professor of Machine Learning
“Federated learning introduces a new paradigm for machine learning by bringing training directly to clients where the original data never leaves the device with the goal to mitigate risks of centralized data collection,” Horváth said. “The challenge is then to design systems that can respect privacy, train on data that cannot be directly collected and where users can only be irregularly asked for updates.”

A good example of this is the health data which is collected and stored on our phones. While data related to your heart is important for you, your doctor, and your heart monitoring app, it should be kept private. The consequence of this privacy, unfortunately, is that the models that might save our lives, cannot become the powerful predictors we need.

Machine learning, in some incarnations, is akin to computer models feasting at the buffet of data our wearables and hospital tests create. If we starve these computer models, they simply cannot do their job. A natural tension occurs then between powerful predictive capacity on the one hand, and privacy on the other. This is precisely where federated learning really shines, according to Horváth.

“These are learning paradigms where the data is distributed across the clients where you need to respect the privacy of client data,” Horváth said. “My research is understanding the challenges that these types of distributed data sets bring. As well as to design and analyze methods and algorithms for more efficient training.”

“The question is: How can we still progress the machine learning field if we might lose access to data due to privacy requirements?” Horváth mused. This is where federated learning offers a viable solution, according to Horváth, where the goal is to analyze and learn from decentralized data distributed among many owners/clients without exposing their data. “We believe that some of the federated learning models we are working towards now aim to respect the privacy of users, while also fulfilling the needs of clients that need to leverage this data to advance healthcare outcomes, for example,” Horváth added. “This way, we can enable training on much larger decentralized data sets at predicting heart attacks, for example.”

“This is the ultimate goal of federated learning. And actually, in some instances, you might be able to boost your overall computing power because you are using a little bit of the capacity of millions of devices, rather than one main resource,” Horváth said.

Federated learning of this kind typically takes on two main forms, according to Horváth. The first are, as outlined above, researchers or service providers that need to have access to massive amounts of data to help boost the capabilities of their apps and information systems more broadly. The second are institutions that want to collaborate together, but who might have some reason to now want to share data directly.

“So, for example, if you have two institutions that are competitors — banks or hospitals are good examples — how do you boost their knowledge base and effectiveness, while ensuring that they don’t share highly sensitive or perhaps even compromising data?” Horváth said. “We want to get to the bottom of how data use can be done securely, privately, and most efficiently.”

About Samuel Horváth

Horváth completed both a Ph.D. and an M.Sc. in statistics at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. He earned a bachelor (Summa Cum Laude) in mathematics of economics and finance from Comenius University in Slovakia.

Horváth has received several awards during his studies, including a best paper award at the NeurIPS Workshop on Scalability, Privacy, and Security in Federated Learning; the best poster award at the Data Science Summer School (DS3), Ecole Polytechnique, France; and the best reviewer award at NeurIPS. Horváth regularly serves as a program committee member for leading machine learning journals and conferences, including the Journal of Machine Learning, ICML, and NeurIPS.

Related

thumbnail
Monday, December 23, 2024

Bridging probability and determinism: A new causal discovery method presented at NeurIPS

MBZUAI research shows how a better understanding of the relationships between variables can benefit fundamental scientific research.

  1. student ,
  2. determinism ,
  3. variables ,
  4. casual discovery ,
  5. neurips ,
  6. research ,
  7. machine learning ,
Read More
thumbnail
Monday, December 16, 2024

Web2Code: A new dataset to enhance multimodal LLM performance presented at NeurIPS

A team from MBZUAI used instruction tuning to help multimodal LLMs generate HTML code and answer questions.....

  1. instruction tuning ,
  2. code ,
  3. multimodal ,
  4. llms ,
  5. dataset ,
  6. neurips ,
  7. machine learning ,
Read More
thumbnail
Thursday, December 12, 2024

Solving complex problems with LLMs: A new prompting strategy presented at NeurIPS

Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.

  1. processing ,
  2. prompting ,
  3. problem-solving ,
  4. llms ,
  5. neurips ,
  6. machine learning ,
Read More