INDEX
Explanations
the word "OK" with a strong activation value
expressions indicating a sense of approval or acceptance
New Auto-Interp
Negative Logits
urther
-0.60
ience
-0.59
cum
-0.59
guiActiveUn
-0.58
Shadow
-0.57
eries
-0.57
ensis
-0.55
cence
-0.55
rowth
-0.55
latent
-0.54
POSITIVE LOGITS
OK
3.95
ok
2.70
okay
2.54
OK
2.32
alright
2.09
Okay
1.81
Ok
1.78
Ok
1.53
Okay
1.47
Alright
1.34
Activations Density 0.005%