INDEX
Explanations
words related to reactions and responses
New Auto-Interp
Negative Logits
-0.18
ern
-0.17
kits
-0.16
ping
-0.15
enza
-0.15
passwd
-0.15
esen
-0.14
elters
-0.14
wik
-0.14
paid
-0.14
POSITIVE LOGITS
ivate
0.36
ively
0.26
iveness
0.25
ives
0.23
aries
0.21
uator
0.21
rice
0.19
uate
0.19
ants
0.19
ual
0.18
Activations Density 0.027%