[Programming] Digging In the DB and Hitting a Wall

egmracer01 (49) 6 years ago

I use this list to filter out common words: https://github.com/evanmarshall/steemit-cloud/blob/master/src/ignored-words.js

lextenebris (62) 6 years ago

Thanks, though at this point I think that beyond a very basic handful, I'm simply going to let the corpus decide what words are not significant enough indicators to be useful guides.

If the SVD can't figure out at least that much, it really has no hope of determining what words are useful enough to be used as discriminators, after all.

$0.00

egmracer01 (49) 6 years ago

I don't necessarily agree with that. My linear algebra isn't great but in probabilistic terms, if you know a solid prior, you should apply it to your model so it has to do less work.

lextenebris (62) 6 years ago

In this case I literally want it to do the work to prove that it can do the work.

At least on a first pass.

If it can't do that much, if it can't at least determine what the least useful things are – we need another method. Maybe a smarter method, maybe a dumber method, but a method that can actually determine something fairly elemental.

Who knows? It may be that the frequency or offset of what we would normally think of as junk words are actually useful in making some sort of determination. Complex discovery like this isn't a finished science quite yet.

Though I may take a break and just try to put together something based on some sort of limited, spreading activation energy friend network thing. I am way better with graph theory that I am with this stuff off-the-cuff.

Unfortunately, I don't think there are too many people working on this sort of thing in the steem space, so I guess everyone is stuck with me.

$0.00

boucaron (57) 6 years ago

Just trying to understand, it looks like you want to perform a kind of summarization of the document with the extraction of "topics" followed by a classification of the document according to the set of "topics" in the corpus.

lextenebris (62) 6 years ago

I actually don't care about topics in common usage. "Topics" is the term that gensim uses to refer to each dimension of differentiation in the vector space of the abstraction which summarizes each individual document.

Essentially, I want to create a high dimensional vector which describes/summarizes/distills the documents into something that can be given a distance. Each eigenvector is really a descriptive point in high dimensional space where that document is located.

I'm trying to figure out, given a set of documents, handed a brand-new document that's never been seen before, how far is that document from the cloud of points which represent the documents I already have?

Ultimately, it's intended to be a lens through which you can look at posts to the steem blockchain and get an estimation of how likely any individual post is to be something that you would be interested in up voting/reading.

Essentially I'm trying to run a classification system in reverse. That makes it an interesting problem.

$0.04

boucaron (57) 6 years ago

Ok, I see. It will be useful, too much noise to filter out.
The summarization of documents is probably the difficult part to have a good, small and accurate enough representation on which to filter onto.

lextenebris (62) 6 years ago

One of the big problems with Steemit as I see it is the fact that trying to find content that you're interested in is like sipping from a fire hose. One directed straight into your face.

My current plan, at least in this sketch code, is to fetch the text of everything that you've uploaded over the last time period (currently 30 days), tokenize it, create a set of eigenvectors which describe all of those documents, magic goes here in determining an average eigenvector, and then take any post to the blockchain, process it through the same vector space, I then see how close it is to the things that you've liked. If it's over some arbitrary threshold, let you know about it.

I actually have my code set up to allow me to switch between n-gram extractions and word tokens on a whim, so at least I'll be able to test both of those to see if one consistently gives me better stuff than the other.

This is experimental programming. It's like mad science but with slightly fewer explosions.

$0.04

smcaterpillar (52) 6 years ago

Hmm, are you sure the average eigen vector thing would work?

Is your plan just to compute the LSA on the set of your own posts (in a 30 day period)? And then determine how close other posts are? I guess the dataset will most likely be too small. Instead of filtering noise, the LSA might even enhance it unless you are a posting machine (could this explain why you end up with quite common words in your topics?).

Or do you wish to compute the LSA on many posts (let's say all Steemit publications of last month) and try to infer an average representation of the subset of your own posts? Even then I don't know if this works. What would happen if half of your posts are about cryptocurrency and the other half about vaccines (:-0). Presumably, these would be projected into different parts of the LSA space. The average would be meaningless here (maybe something like prepper homeopathy?). Maybe it's better to compute the similarity to all of your recent posts individually at first and then take the average, or median, or some percentile to determine if it's worth reading and may cater to your interests.

If you are still in favor of averaging your posts and compute your interest vector directly, another approach could be to take a look at Doc2Vec. There the average or sum of word and document vectors seem to work kinda well. Still, as before, you might end up somewhere in the Doc2Vec space that is just the middle empty ground between different posts of yours. Moreover, Doc2Vec is incredibly data hungry and requires a couple of 10k or better 100k documents. Fortunately, as you said, Steemit is a fire hose, so that should be the least of your concerns.

I haven't still fully grasped what you are trying to do, sorry if I misunderstood you. Anyway, I'm curious how your experiment progresses because I want to do something similar. I started with a bot that predicts payouts of posts and I am curious if the LSA part of my bot could potentially be used for content recommendations as well.

$0.00

thenikola (31) 6 years ago

Hi. What program's that screenshot? I don't see standard window controls, is that on Linux actually?

$0.02

lextenebris (62) 6 years ago

Would you believe that's running on Windows 10 in Jupyter Lab? And everything you see on the screen there is actually running in my browser window (and I tend to use Vivaldi when I'm working with content).

Yes, it's an integrated multilanguage IDE/data research environment which runs in a web browser as its primary interface, with draggable windows, frames, built-in management systems – the whole 9 yards.

I have no idea how I got by as a programmer many many years ago without something this nice.

Also – it's free.

$0.00

thenikola (31) 6 years ago

Aye, thanks for the info! Lemme ask some more questions, as I am a workflow afficionado and based ona large response you seem to be one too :D
If you're full screen, how do you manipulate files? No windows taskbar :( I personally use brackets for web design because of that handy live preview feature, for other coding purposes Sublime, and text writing Notepad++.

(n++ pictured, oh how I love custom styling, cached files... <3 )

I've tried Vivaldi and it was b-e-a-utiful and then I realized it is very low on extensions and had to let it go. Maybe I give it another go since it's been I believe more than a year.

Cheers

$0.00

lextenebris (62) 6 years ago

Well, in Jupiter Lab, the interface is capable of hosting multiple panels, so I often have at least two panels with code in them, and an interpreter monitor at the bottom of one of those columns.

For browsing, it's definitely Vivaldi, at least when I'm trying to put together some work, and that the browser itself can tile tabs across the screen. If you use a web-hosted editor like StackEdit, that can be tucked into any screen you like. When I'm developing longer pieces, my usual text editing environment is Gingko, which makes managing section content extremely easy, and that's just running in a tab somewhere, over to the side or at the bottom of her frame.

Vivaldi can use any of the plug-ins in the Chrome architecture, so I have no idea what you're talking about when you suggest that it's low on extensions. That's a big pile of extensions.

But mainly – I just cheat. I have two monitors. The right hand monitor is typically used for email, Discord, chats, social media, and my editing environment while the main screen on the left is primarily for reference material, video editing, 3D modeling systems, games – whatever needs to be the focus of what I'm doing.

I'm at the point where a third monitor wouldn't go unused, but I have literally nowhere to put it in the tight confines here with two microphones, a mixing board, studio speakers, and a reference bookshelf.

But – yes, my main solution is cheating.

$0.09