Spark Similarity Search

Yahoo! project: The goal was to produce quantized codes for the yfcc100m dataset, a Flickr dataset of 100m images. First, deep-learning features should be extracted, named CroW features. Then, after applying dimensionality reduction with PCA, the LOPQ codes should be produced (something like hashing for a rough intuition). Finally, this pipeline would compete the existing image features Yahoo uses in production and shall claim their replacement, since they were proven to be better.

The most challenging part though for this project was scaling with Spark and Hadoop, in such a high magnitude (over 15T input data) that even the people of the Grid section wouldn’t be able to help. From this project I learned a lot in debugging and fine tuning Spark jobs with Big Data, since the experience I gained from this was on a real-world problem, exploring areas where no one ever was. The understanding of the ecosystem of Spark is something remarkable, and I documented it well, so that others can learn from it, and try to debug their Spark jobs. 7 Spark issues/bug were discovered in the process, opening new JIRA tickets and giving us hope for a better Spark for future users, such as the Spark implementation of k-means||. Through that project, documentation regarding Partitions in Spark (intuitively the chunks to distribute your data, critical for Spark jobs) was created by me in Stack Overflow.

Query image
Fully connected results
Crow features with LOPQ (our approach)

Technologies : HTML5 , CSS3 , javaScript, jQuery, Bootstrap, and other libraries.

Date : Summer 2016


    Big Data Visualization

Yahoo! project: After conducting a series of k-means experiments on 100m Flickr images and computing the Cost (sum of squared distances of points to their nearest center) and Unbalanced Factor (a heuristic that gives an intuition of how much balanced your data are across clusters), we wanted to see whether minifying these two entities would produce a better visual effect for the user, since many applications, such as similarity search are based on k-means, and the better the quality of k-means, the better the visual results.

An elegant, Client-only web application to visualize the k-means clusters. The idea is that this should be as easy to use as downloading the project and launching the “index.html”. The project is so flexible, that accepts a variety of input files and formats. Two implementations are provided; a conservative one, that will warn the user when the file is too big and abort the process, and an aggressive one, that will try to parse the data, no matter of how big they are, allowing the user to make use of the ‘max clusters’ and ‘max images per cluster’ input parameters.

k-means Data Visualization:Images in action.

Run k-means with Random Init 3 times, one per {10, …, 100} iteration, and report the overall best run (min Cost). Same with Fixed Init.

Run k-means with Random Init 3 times, one per {10, …, 100} iteration, and report the overall best run (min Unbalanced Factor). Same with Fixed Init.