Word2vec Word Vectors in JavaScript

Word2vec is a program that takes natural language words and assigns them vectors whose components encompass what those words means. I downloaded a big list of word vectors online and converted the vectors for the most common English words into JSON to lower the barrier of entry.

You can find the files here: https://github.com/turbomaze/word2vecjson/tree/master/data

Process

First, I downloaded a 1.6 GB .bin containing word vectors for nearly 3 million words. Then, I wrote a Java program that loaded frequency information for around 150,000 English words. Given an input of $n$, that program would read through the large .bin file and output the word vectors for the $n$ most common English words in JSON. I ran the program for various values of $n$ — 1000, 5000, 10000, and 25000 — and uploaded the results.

Demo

You can play around with word vectors in your browser here: http://turbomaze.github.io/word2vecjson/

You can do things like "berlin" + ("france" - "paris") and get logical answers, like "germany". The difference of two word vectors is a relationship that often applies to other word vectors. "france" - "paris" is the country to capital city relationship.

For more info about word vectors you can check out the project here: https://code.google.com/p/word2vec/.

Fork me on GitHub


comments powered by Disqus