INDEX
Explanations
words related to observation and awareness
New Auto-Interp
Negative Logits
isle
-0.18
oust
-0.16
utex
-0.16
lers
-0.16
lfw
-0.15
-legged
-0.14
Andrews
-0.14
>NN
-0.14
ãģ°
-0.14
/devices
-0.14
POSITIVE LOGITS
vation
0.21
å¯Ł
0.21
asion
0.18
(obs
0.18
ably
0.17
235
0.17
ãĥ¥
0.17
ances
0.17
ant
0.16
yer
0.16
Activations Density 0.024%