intro.tex

All work done is contained within my GitHub repository \footnote{\href{https://github.com/SamuelTrew/FederatedLearning}{https://github.com/SamuelTrew/FederatedLearning}} with all of the results from all of the experiments/investigations performed in a separate repository \footnote{\href{https://github.com/SamuelTrew/FYP_Results}{https://github.com/SamuelTrew/FYP\_Results}}.

\section{Motivation}
The use of Machine Learning (ML) to enhance user experiences, to provide fantastic tools, or to be able to make accurate predictions all require large amounts of data --- this is normally collected by a single body/organisation.
This has its limitations; there are cases where not enough data can be collected by any one group of people. 
An example of this is with medical data.
\\ \\
Medical data is typically well-protected, with many regulations in place to prevent its misuse \cite{nhs_digital_data}.  
This is due to the sensitive nature of the data, that pertains to an individual's state of health and other private information.
Directly exposing such information to researchers is undesirable. 
While these protections (in the form of legislation such as the General Data Protection Regulation (GDPR) \cite{gdpr} or Health Insurance Portability and Accountability Act (HIPAA) \cite{hipaa}) are beneficial to user confidentiality, this poses a problem to researchers. 
This limits their ability to apply data-intensive ML procedures in the medical sector.\\

Typical methods of training ML models would not work as the personal data should not leave the hospitals or the devices of the users. 
This is where Federated Learning (FL) \cite{federated_comic} comes in. 
FL is where each client involved in the procedure trains an agreed upon model on their own data. 
Instead of sharing the data with anyone else, they now share the model they trained with a central server. 
This server then combines all of the models from all of the clients together to form one unified model. 
This is then distributed back out to the clients and the model is further trained on the data. 
The process is repeated until the model converges.
\\ \\
However, this style of collaborative learning can be exploited by malicious or faulty clients \textbf{sending bad updates} to the server, interfering with the global model. 
There are already several methods for defending this to allow robust aggregation, but with the varying styles and complexities of current attacks, they can be shown to be insufficient.
\\ \\
There is also the case of clients trying to \textbf{steal the model} without contributing and this kind of unique attack poses to be a formidable foe against current aggregators.
Investigating, enhancing and creating a robust aggregator to be able to handle all of these attacks is a tricky problem, but one that will be detailed throughout this report.


\section{Objectives}
The main focus of this project is to implement and evaluate robust aggregation methods for FL.
In our work, we use well-known datasets such as MNIST, that are good for covering the variety of attacks we are considering.
However, other forms of ML (e.g. LSTMs) might be susceptible to different types of attacks and it should be noted that these is not covered.
\\ \\
The initial experiments are using MNIST \cite{mnist} digit classification, as it serves a good baseline for initial investigation.  This is to gauge the performance of the robust aggregation algorithms when applied to such common datasets.
Then, we conduct an evaluation of the different strategies, and determine the driving characteristics behind their performance. This will provide insight into what helps tackle the issues faced by FL.
\\ \\
We then lead into creating our own robust aggregation strategy, one that is more capable of handling a large variety and combinations of attacks in a robust manner.
The aim would be to end up with a State of the Art (SotA) algorithm that outperforms the other solutions.
\\ \\
In the future, we would like to be able to do some further investigation into the medical realm. This is to evaluate the performance of our aggregator on a real-world dataset.
This might be harder to achieve in practice due to the sensitive nature of the data. A competition dataset that is publicly available might prove to be the better option.

\section{Contributions}
\begin{enumerate}
    \item \textbf{Improved FedMGDA+:} 
    This project enhances upon an aggregator called FedMGDA+ in a variety of ways.
    It increases the global model's accuracy under this aggregation such that the final results is also more deterministic.
    The smoothness of aggregation and the increase in speed of convergence also gets enhanced to result in a more robust algorithm.
    Finally, the accuracy and capability of the blocking is far better now, blocking most (if not all) of the malicious clients while pretty much never blocking benign clients.
    
    \item \textbf{Highlighting Limitations of Aggregators:}
    Current robust aggregators are not able to defend against free-riding clients, with certain ones (e.g. FedMGDA+) performing extra poorly on aggregation.
    This project also explains why and how they fail and explains that it could be due to a lack of understanding of the space.
    
    \item \textbf{Enhancing Aggregation through Clustering:}
    This project introduces a novel solution to enhance current robust aggregation methods (e.g. COMED) through the use of K-Means Clustering.
    This increases COMED's ability to block clients from aggregating with less than ideal accuracy with 15 malicious (out of a total of 30 clients) clients, all the way up to smoothly aggregating against 22 malicious with a much better accuracy.
    
    \item \textbf{Creating my own Robust Aggregator:}
    Created a novel robust aggregator that outperforms all other robust aggregators tested (to date, theses are some of the best - SotA).
    Uses a combination of clustering, dimension reduction, personalisation and adaptiveness to create a robust algorithm for blocking up to 26 malicious / free-riding clients (out of a total of 30 clients).
    It does this to the extent where the error rate is similar to that of FedAvg without any malicious clients at all.
    It is also the only aggregation method that currently does not allow any free-riders access to any learned model and as such, the only system to not allow any information theft against free-riders (thereby ensuring the free-riders have a useless model).
    
    \item \textbf{Future Work:}
    This project includes some short, yet thorough, ideas and concepts that could be used to further the work done in this project.
    Some of the content is specific to the work done here with FedPADRC and other content is just ideas and postulations that I had throughout the course of completing this project but were not able to incorporate.
    
    \item \textbf{Codebase:}
    The codebase associated with this project is not only open for anyone to use, but has been greatly improved upon (from previous versions) to allow greater extensibility with other robust aggregation or privacy preserving methods.
    This was important so as to make sure other researchers and students could easily test the work done in this project on their own and integrate it freely into their own work for whatever their needs are.
    The code is in a ready and workable state for anyone who wants to use it.
        
\end{enumerate}


\section{Challenges}
\begin{enumerate}
    \item \textbf{Tuning hyper-parameters:} 
    The robust aggregation methods all have their own set of unique hyper-parameters that need fine-tuning. This adds to the complexity of running our experiments, as it is difficult to obtain a definitive, optimal parameterisation of each method.
        
    \item \textbf{Type of attacks:} 
    Identifying viable and representative real-world attacks could prove complicated. 
    There are a wide range of attacks and so there might not be time to do full-scale testing on all of them.
    Instead, this project will focus on some on the current attack strategies to a deeper, and not broader, extent.

    \item \textbf{Number of configurations:}  
    Every possible attack, variations in the structure of the attack, and variation in the number of clients requires a complete re-training and testing of the model. 
    This is hugely time consuming if not planned well, given the limited duration of this project.
    
    \item \textbf{Computational resources:}
    The Machine Learning and distributed nature of this project makes it resource intensive. 
    Increasing the number of users that we test with or having a more complex Deep Learning network all contributes to this.
    Having a more powerful system at my disposal would help make it more feasible to get more done in this time-frame.
\end{enumerate}