RE: [Programming] Digging In the DB and Hitting a Wall
One of the big problems with Steemit as I see it is the fact that trying to find content that you're interested in is like sipping from a fire hose. One directed straight into your face.
My current plan, at least in this sketch code, is to fetch the text of everything that you've uploaded over the last time period (currently 30 days), tokenize it, create a set of eigenvectors which describe all of those documents, magic goes here in determining an average eigenvector, and then take any post to the blockchain, process it through the same vector space, I then see how close it is to the things that you've liked. If it's over some arbitrary threshold, let you know about it.
I actually have my code set up to allow me to switch between n-gram extractions and word tokens on a whim, so at least I'll be able to test both of those to see if one consistently gives me better stuff than the other.
This is experimental programming. It's like mad science but with slightly fewer explosions.
Hmm, are you sure the average eigen vector thing would work?
Is your plan just to compute the LSA on the set of your own posts (in a 30 day period)? And then determine how close other posts are? I guess the dataset will most likely be too small. Instead of filtering noise, the LSA might even enhance it unless you are a posting machine (could this explain why you end up with quite common words in your topics?).
Or do you wish to compute the LSA on many posts (let's say all Steemit publications of last month) and try to infer an average representation of the subset of your own posts? Even then I don't know if this works. What would happen if half of your posts are about cryptocurrency and the other half about vaccines (:-0). Presumably, these would be projected into different parts of the LSA space. The average would be meaningless here (maybe something like prepper homeopathy?). Maybe it's better to compute the similarity to all of your recent posts individually at first and then take the average, or median, or some percentile to determine if it's worth reading and may cater to your interests.
If you are still in favor of averaging your posts and compute your interest vector directly, another approach could be to take a look at Doc2Vec. There the average or sum of word and document vectors seem to work kinda well. Still, as before, you might end up somewhere in the Doc2Vec space that is just the middle empty ground between different posts of yours. Moreover, Doc2Vec is incredibly data hungry and requires a couple of 10k or better 100k documents. Fortunately, as you said, Steemit is a fire hose, so that should be the least of your concerns.
I haven't still fully grasped what you are trying to do, sorry if I misunderstood you. Anyway, I'm curious how your experiment progresses because I want to do something similar. I started with a bot that predicts payouts of posts and I am curious if the LSA part of my bot could potentially be used for content recommendations as well.