Word2vec Word Vectors in JavaScript
Word2vec is a program that takes natural language words and assigns them vectors whose components encompass what those words means. I downloaded a big list of word vectors online and converted the vectors for the most common English words into JSON to lower the barrier of entry.
You can find the files here: https://github.com/turbomaze/word2vecjson/tree/master/data
Process
First, I downloaded a 1.6 GB .bin
containing word vectors for nearly 3 million words. Then, I wrote a Java program that loaded frequency information for around 150,000 English words. Given an input of $n$, that program would read through the large .bin
file and output the word vectors for the $n$ most common English words in JSON. I ran the program for various values of $n$ — 1000, 5000, 10000, and 25000 — and uploaded the results.
Demo
You can play around with word vectors in your browser here: http://turbomaze.github.io/word2vecjson/
You can do things like "berlin" + ("france" - "paris")
and get logical answers, like "germany"
. The difference of two word vectors is a relationship that often applies to other word vectors. "france" - "paris"
is the country to capital city relationship.
For more info about word vectors you can check out the project here: https://code.google.com/p/word2vec/.