Word2vec is a program that takes natural language words and assigns them vectors whose components encompass what those words means. I downloaded a big list of word vectors online and converted the vectors for the most common English words into JSON to lower the barrier of entry.
You can find the files here: https://github.com/turbomaze/word2vecjson/tree/master/data
First, I downloaded a 1.6 GB
.bin containing word vectors for nearly 3 million words. Then, I wrote a Java program that loaded frequency information for around 150,000 English words. Given an input of $n$, that program would read through the large
.bin file and output the word vectors for the $n$ most common English words in JSON. I ran the program for various values of $n$ — 1000, 5000, 10000, and 25000 — and uploaded the results.
You can play around with word vectors in your browser here: http://turbomaze.github.io/word2vecjson/
You can do things like
"berlin" + ("france" - "paris") and get logical answers, like
"germany". The difference of two word vectors is a relationship that often applies to other word vectors.
"france" - "paris" is the country to capital city relationship.
For more info about word vectors you can check out the project here: https://code.google.com/p/word2vec/.
Fork me on GitHub
comments powered by Disqus