Google used some of the data obtained from 15 million scanned books to build Google Books Ngram Viewer.
"The datasets we're making available today to further humanities research are based on a subset of that corpus, weighing in at 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish. The datasets contain phrases of up to five words with counts of how often they occurred in each year. (...) The Ngram Viewer lets you graph and compare phrases from these datasets over time, showing how their usage has waxed and waned over the years," says Jon Orwant, from the Google Books team.
The nice thing is that the raw data is licensed as Creative Commons Attribution and can be downloaded for free. Maybe Google should use the same license for the Ngram database obtained from indexing the web.
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment