INDEX
Explanations
instances of acknowledgment or confession
New Auto-Interp
Negative Logits
olding
-0.16
/Gate
-0.15
blo
-0.15
lei
-0.15
olds
-0.15
abouts
-0.14
ettle
-0.14
ìĶ
-0.14
Configurer
-0.14
¶Į
-0.13
POSITIVE LOGITS
defeat
0.29
freely
0.20
responsibility
0.19
defeats
0.19
ting
0.19
feeling
0.18
guilt
0.17
defeated
0.17
ration
0.16
wrongdoing
0.16
Activations Density 0.030%