INDEX
Explanations
positive adjectives and descriptors expressing approval or admiration
New Auto-Interp
Negative Logits
ngth
-0.93
rive
-0.81
ividual
-0.79
lished
-0.70
opez
-0.66
alks
-0.66
perty
-0.65
usalem
-0.65
cipl
-0.64
assemb
-0.64
POSITIVE LOGITS
soType
0.81
enough
0.80
considering
0.79
ðŁĻĤ
0.74
explan
0.73
ECA
0.72
LY
0.71
NEWS
0.70
ðŁĺ
0.69
XD
0.68
Activations Density 0.101%