INDEX
Explanations
references to manipulative or deceitful behaviors in relation to political or economic contexts
New Auto-Interp
Negative Logits
通販
-0.63
TagMode
-0.63
tatuagens
-0.61
queryInterface
-0.60
staden
-0.58
habet
-0.57
ApiModelProperty
-0.56
pidana
-0.56
jambes
-0.55
nahilalakip
-0.53
POSITIVE LOGITS
RegressionTest
0.69
bullshit
0.51
ped
0.50
propaganda
0.49
obses
0.49
blink
0.48
BS
0.48
ziz
0.47
ftagPool
0.47
paganda
0.47
Activations Density 0.879%