INDEX
Explanations
instances of the word "act" in various forms and contexts
New Auto-Interp
Negative Logits
ahan
-0.19
anke
-0.17
/fw
-0.17
theless
-0.17
jet
-0.16
attern
-0.16
Bu
-0.15
rieg
-0.15
ied
-0.15
otti
-0.15
POSITIVE LOGITS
uator
0.28
UAL
0.28
ual
0.25
uar
0.25
uality
0.23
uelle
0.23
uated
0.22
uation
0.21
ively
0.20
uary
0.20
Activations Density 0.012%