Bachelor's thesis submission talk (Informatics). Veselina is advised by Dr. Felix Dietrich.
Previous talks at the SCCS Colloquium
Veselina Vazova: Scalable Manifold Learning through Landmark Diffusion
SCCS Colloquium |
Manifold learning by spectral embedding is a technique that can be used for non-linear dimensionality reduction. By extracting the spectral properties of high-dimensional data, the intrinsic manifold where the data is located on can be fitted into a lower dimension.
A newly proposed algorithm in the field of spectral embedding that has the goal of providing a scalable and robust approach to dimensionality reduction is Roseland.
The algorithm takes two sets: the data set and the landmark set and returns an embedding of the data set in a desired lower dimension q'. To achieve this, the data is first fitted: a landmark-set affinity matrix is computed that represents the affinities between the points in the data set and the points in the landmark set. After normalization of the resulting matrix, the singular value decomposition of the normalized matrix is computed. Using the singular vectors and the singular values, the Roseland embedding for a given diffusion time t is finally computed.
In its core, the algorithm is similar to the Diffusion Maps (DM) algorithm, whereas the main differences lie in the affinity matrix and the decomposition. In Diffusion Maps, the affinities between the data points themselves are calculated, without first "detouring" through a landmark set. Instead of the singular value decomposition, the eigendecomposition is performed. Since the affinity matrix of the DM is larger than the Roseland landmark-set affinity matrix, Roseland fits the data set faster than DM.
In this thesis, we implement the Roseland algorithm in the datafold package. We aim to show that Roseland can be used in the same context of dimensionality reduction as Diffusion Maps. We consider different approaches to constructing the landmark set when it is not given, and we compare the results. Finally, we evaluate the efficiency of the novel algorithm by comparing it to the performance of Diffusion Maps.