# Algorithms, Hashing, Parallel processing,          and the Nearest Neighbor problem in High Dimensions

Exact Near Neighbor in high dimensional spaces is a quite hard problem, because of the curse of dimensionality, where the spaces become so sparse that known algorithms in low dimensional spaces tend to be comparable with the brute force approach. For that reason, the community has focused on geometric approximate algorithms, where one trade offs accuracy for speed. Moreover, the Approximate Nearest Neighbor problem can be reduced to that problem, which strengthens the interest in Approximate Near Neighbor. The c-approximate Near Neighbor decision problem in high-dimensional spaces has been mainly addressed by Locality Sensitive Hashing (LSH).

While, in practice, it is important to ensure linear space usage, most previous work in this regime focuses on the case that c exceeds 1 by a constant term. We present a new, simple data structure using linear space and sublinear query time for any c > 1: Given an LSH family of functions for some metric space, we randomly project points to vertices of the Hamming cube in dimension ≤ log n, where n is the number of input points. The search algorithm projects the query, then examines points which are assigned to nearby vertices on the cube.

Our algorithm can be broken down to smaller, easier and more manageable subalgorithms (subproblems). We investigate every single subalgorithm further and try to optimize the ones that are going to execute frequently enough to affect the overall performance, while at the same avoiding becoming a victim of premature optimization. In particular, we ex- amine recursive and iterative approaches for the same subproblem and compare them, both theoretically and in practise. Due to the nature and simplicity of our algorithm, parallelization can be applied, both in the preprocessing stage, as well as in the query phase. The data we target are big enough to allow parallelization to give a significant boost in the execution time of our algorithm.

Hashing is at the heart of our implementation, which prompts us to examine it gingerly. Namely, we examine a plethora of approaches regarding data representation and data type. We combine these approaches with custom-made simple, helper functions or functions provided by the standard library of C++, and compare them. Despite the huge effort in optimizing our code, we preserve its readability, which comes along with the popularity of our project in GitHub. Quite remarkable is the fact that our implementation amounts to only 716 lines of code, including the extensive documentation in every file.

We report on several experiments with synthetic and real datasets in dimension ≤ 1024 and n ≤ 106. Our software is faster than brute force by several orders of magnitude. More- over, we compare it against the state-of-the-art LSH-based library FALCONN: our algorithm and implementation are significantly simpler, with comparable performance in terms of query time and storage for one of the LSH families used by FALCONN, whereas our code is orders of magnitude faster than brute force.

Master Thesis PDF. GitHub: Dolphinn. Paper: Practical linear-space Approximate Near Neighbors in high dimension, also presented in EuroCG ’17 conference.

Date : Winter 2016

# Virtual Reality Game

User Experience Quality Assurance was the main focus here. A Serious Game named “Feed the Polar Bear”, where the students help a bear that just woke up from hibernation to its feeding grounds, while having to overcome many natural obstacles and answer a plethora of educative questions, provided by the teacher.

Moreover, the children are gaining awareness of the Global Warming effect and how this effects the Wildlife. The game scales in difficulty, both on the scenery stage as well as in the questions asked by Polar Birds, who are indirectly guiding the guest to the island of food. The game was the final project of the “Virtual Reality” Masters Course in DIT, and the game was developed by Giannoutos, Koulalis and me (Samaras).

Character selection with real information to give a reality sense. Live Demo.

Technologies : HTML5, Construct2 and Unity.

Date : Winter 2016

# Spark Similarity Search

Yahoo! project: The goal was to produce quantized codes for the yfcc100m dataset, a Flickr dataset of 100m images. First, deep-learning features should be extracted, named CroW features. Then, after applying dimensionality reduction with PCA, the LOPQ codes should be produced (something like hashing for a rough intuition). Finally, this pipeline would compete the existing image features Yahoo uses in production and shall claim their replacement, since they were proven to be better.

The most challenging part though for this project was scaling with Spark and Hadoop, in such a high magnitude (over 15T input data) that even the people of the Grid section wouldn’t be able to help. From this project I learned a lot in debugging and fine tuning Spark jobs with Big Data, since the experience I gained from this was on a real-world problem, exploring areas where no one ever was. The understanding of the ecosystem of Spark is something remarkable, and I documented it well, so that others can learn from it, and try to debug their Spark jobs. 7 Spark issues/bug were discovered in the process, opening new JIRA tickets and giving us hope for a better Spark for future users, such as the Spark implementation of k-means||. Through that project, documentation regarding Partitions in Spark (intuitively the chunks to distribute your data, critical for Spark jobs) was created by me in Stack Overflow.

Query image

Fully connected results

Crow features with LOPQ (our approach)

Demo, as Flickr introduces visually similar search, which scales to dozens of billions of Images.

Technologies : HTML5 , CSS3 , javaScript, jQuery, Bootstrap, and other libraries.

Date : Summer 2016

# Big Data Visualization

Yahoo! project: After conducting a series of k-means experiments on 100m Flickr images and computing the Cost (sum of squared distances of points to their nearest center) and Unbalanced Factor (a heuristic that gives an intuition of how much balanced your data are across clusters), we wanted to see whether minifying these two entities would produce a better visual effect for the user, since many applications, such as similarity search are based on k-means, and the better the quality of k-means, the better the visual results.

An elegant, Client-only web application to visualize the k-means clusters. The idea is that this should be as easy to use as downloading the project and launching the “index.html”. The project is so flexible, that accepts a variety of input files and formats. Two implementations are provided; a conservative one, that will warn the user when the file is too big and abort the process, and an aggressive one, that will try to parse the data, no matter of how big they are, allowing the user to make use of the ‘max clusters’ and ‘max images per cluster’ input parameters.

k-means Data Visualization:Images in action.

Run k-means with Random Init 3 times, one per {10, …, 100} iteration, and report the overall best run (min Cost). Same with Fixed Init.

Run k-means with Random Init 3 times, one per {10, …, 100} iteration, and report the overall best run (min Unbalanced Factor). Same with Fixed Init.