INDEX
Explanations
human-like states compared to ai
New Auto-Interp
Negative Logits
поба
-0.10
indr
-0.09
å±Ģ
-0.09
Arb
-0.09
avery
-0.09
aram
-0.08
-Al
-0.08
omi
-0.08
ivery
-0.08
.EventHandler
-0.08
POSITIVE LOGITS
like
0.35
same
0.27
zoals
0.23
như
0.23
åĥı
0.23
way
0.22
same
0.21
como
0.20
seperti
0.20
gibi
0.19
Activations Density 0.101%