INDEX
Explanations
words related to making decisions or assessments
New Auto-Interp
Negative Logits
bery
-0.17
ilities
-0.16
uary
-0.16
oms
-0.15
ses
-0.15
orman
-0.15
iler
-0.14
ove
-0.14
iev
-0.14
ilitating
-0.14
POSITIVE LOGITS
316
0.15
ants
0.15
loose
0.15
whether
0.15
ffen
0.15
expr
0.15
lich
0.15
antt
0.14
esub
0.14
angling
0.14
Activations Density 0.022%