The project began with training in the Multivariate Big Data Analysis (MBDA) framework, and training in multivariate data analysis and machine learning techniques continued throughout the project. Research related to the core of the project included: SDN data analysis, Principal Component Analysis (PCA)-MBDA, Partial Least Squares (PLS)-MBDA. The MBDA methodology consists of five steps: data parsing, data fusion, detection, diagnosis, and deparsing, and has been proposed as a root cause solution for anomaly detection and traffic classification problems, as opposed to black box models such as deep learning. In the detection phase we have used two techniques: PCA and PLS. We experimented with the publicly available datasets. Regarding to the PCA-based MBDA methodology, we developed a paper which is under revision. We are still working on PLS-based MBDA. We performed initial experiments, but results are too preliminary at this point.
We have also worked on other multivariate techniques based on linear matrix factorization (apart from PCA, PLS, we are using the ANOVA Simultaneous Component Analysis (ASCA), Parallel Factor Analysis (PARAFAC) and sparse methods, that can be used in network analytics process. We are preparing both a conference paper and a journal paper. Moreover, we have worked on the extension of MBDA to Federated Learning.
During the work (actually, soon after the beginning of the project), we noticed that the machine learning results strongly depend on the quality of the dataset, and generally experienced that there is a shortage of high-quality datasets in the networking area. On the other hand, building a real, high quality dataset in the laboratory environment is very challenging and time consuming task. Moreover, evaluating the quality of a dataset is also a common problem. Although there are several methods in the data quality assessment field, none is completely functional, and this problem is generally overlooked in the area of network research. During the project, we devoted much effort to this problem. Our goal was to find a way to help evaluate the quality of a dataset holistically and in an automatic way, so that it could be used in an autonomous network to assess the dataset quality of the automatic measurements with telemetry. We proposed the PerQoDA methodology based on permutation testing. This method can test the strength of relationships between observations and labels. If this relationship is weak, we cannot expect the ML model to work perfectly. We obtained significant results in this field and prepared two papers that were presented at two high-quality conferences as well as during a research visit to Paris.