INDEX
Explanations
statements that challenge common misconceptions or beliefs
New Auto-Interp
Negative Logits
increments
-0.18
996
-0.16
æľĽ
-0.15
rax
-0.15
shan
-0.15
yna
-0.14
Mature
-0.14
pped
-0.14
ging
-0.14
princ
-0.14
POSITIVE LOGITS
acter
0.15
chos
0.15
lok
0.15
kü
0.14
ren
0.14
pau
0.14
æį
0.14
erval
0.14
":↵
0.13
fram
0.13
Activations Density 0.487%