2 thoughts on “Three nasty problems.

  1. Regarding #1, If you’re willing to give up words, you might be able to have vectors. I haven’t had a chance to try it out yet, but I want to do a LR run on the paceofchange data, but with locality-sensitive-hash-binned word vectors instead of words. You could probably reconstruct the exact text from the plain embeddings in order, but the binning should add enough fuzziness that you couldn’t get the exact wording back, even if the binned embeddings were left in their original order. It would be a sort of machine-readable paraphrase of the original.

    You’d also be throwing out a lot of information, but at least some of the semantic content would survive. Would enough survive to get decent classification performance? If so, you should be able to use highly-weighted bins to partially reverse the process, getting a list of words close in word-embedding space to words that correlate with the salient category.

    There are some kinds of study that this would be useless for, but I bet there are still lots of interesting phenomena that would be preserved.

    I’m not sure it would work but I think it’s a promising line of inquiry.

    • That is really interesting. I think word embeddings are definitely something people will want to extract, and proving that they can’t support text reconstruction is going to be an important part of that.

Comments are closed.