INDEX
Explanations
negative words related to criticism or disapproval
negative descriptors or phrases related to unfavorable qualities
New Auto-Interp
Negative Logits
ĸļ
-0.96
raltar
-0.77
ensional
-0.77
earchers
-0.75
ittees
-0.75
theless
-0.75
htaking
-0.74
conservancy
-0.73
eston
-0.73
xual
-0.72
POSITIVE LOGITS
dies
1.09
dest
1.08
die
1.04
ger
0.96
gered
0.93
GES
0.88
ged
0.86
ges
0.86
karma
0.86
luck
0.85
Activations Density 0.029%