The Auto Curator ~ aicu

bauloewe (58)in #blog • 5 years ago (edited)

After a couple of days and some feature engineering, I finished the first version of my curation bot. It's currently doing a test run on the account @aicu.

Dataset

The dataset consists of 5000 curator selected and 5000 random posts from various categories. They are split in 33% Test and 67% training data sets.

The Basis

After some testing, I decided to use a Random Forest classifier for the prototype. Random Forest classifiers are very robust non-linear classifiers and one of my favourite algorithms in the field. They yield high classification performance with low overhead. It's certainly a lot faster than setting up a Neural Network Architecture and training on it. That's always a step that can be done later if a robust feature set is found after adding/improving the features.

Features

The first iteration uses some basic features from the post itself along with a TF-IDF model, HTML tag frequencies and Part of Speech tags. Below you can find a table of the top 30 features by decision weight.

As you can see, the most decisive feature is the main category of a post. Aside from that the number of words and part of speech categories lead the top ten features. The part of speech categories are a bit unexpected. Because determiners, particles and coordinating conjunctions are usually stop words which are removed by preprocessing and you wouldn't expect any meaningful relationship there. Adjectives are not that unusual, as they spice up a text. The #ERROR# feature refers to unknown words. Those include foreign words and stop words. The first word, which carries a lot of weight is the lemma of the word "think".

I'm glad that each iteration lead to some improvement in the F1 score. I'll keep you posted how the development keeps going and what the results of the first testrun are.

	Basic+BOW	+HTML+POS	+TF-IDF
F1-Score	0.801	0.817	0.822
1	Category(0.021015)	Category(0.020211)	Category(0.019052)
2	PRON(0.013029)	#NewLines(0.009536)	DET(0.009565)
3	like(0.012215)	DET(0.008916)	#NewLines(0.009272)
4	DET(0.012076)	PART(0.008684)	PART(0.009090)
5	-PRON-(0.011227)	#Words(0.008643)	#Words(0.008748)
6	#Images(0.009926)	CCONJ(0.008533)	CCONJ(0.008719)
7	#ERROR#(0.009699)	#ERROR#(0.007995)	ADJ(0.007781)
8	know(0.008192)	-PRON-(0.007521)	#ERROR#(0.007094)
9	CCONJ(0.008129)	ADJ(0.007274)	#Images(0.006808)
10	#Words(0.008105)	ADV(0.006710)	think(0.006750)
11	#NewLines(0.007931)	think(0.006420)	-PRON-(0.006553)
12	VERB(0.007625)	PRON(0.006168)	ADV(0.006324)
13	ØWordLen(0.006338)	VERB(0.005915)	PRON(0.006192)
14	#Links(0.006010)	#Tags(0.005816)	ØWordLen(0.005976)
15	PROPN(0.005814)	ØWordLen(0.005418)	#Tags(0.005536)
16	ADV(0.005576)	ØSentLen(0.005409)	VERB(0.005523)
17	PART(0.005481)	#Links(0.005343)	ØSentLen(0.005347)
18	NOUN(0.005268)	#Images(0.005243)	bit(0.005086)
19	PUNCT(0.005155)	head(0.005012)	love(0.005021)
20	ADJ(0.005117)	bit(0.005011)	NUM(0.005012)
21	work(0.004646)	time(0.004984)	PROPN(0.004947)
22	#Tags(0.004602)	NUM(0.004968)	time(0.004923)
23	ØSentLen(0.004505)	leave(0.004755)	leave(0.004789)
24	feel(0.004306)	love(0.004733)	head(0.004728)
25	good(0.004149)	PROPN(0.004700)	NOUN(0.004662)
26	song(0.003634)	start(0.004674)	want(0.004621)
27	hope(0.003594)	img(0.004586)	PUNCT(0.004593)
28	think(0.003482)	NOUN(0.004568)	#Links(0.004527)
29	turn(0.003335)	PUNCT(0.004498)	light(0.004124)
30	fall(0.003210)	light(0.004121)	start(0.003759)

#bot #ai #curation #nlp

5 years ago in #blog by bauloewe (58)

$0.59

8 votes

STEEM 0.18

TRX 0.15

JST 0.028

BTC 63597.74

ETH 2476.06

USDT 1.00

SBD 2.53

The Auto Curator ~ aicu

Dataset

The Basis

Features

Coin Marketplace