INDEX
Explanations
phrases indicating possession or existence
New Auto-Interp
Negative Logits
763
-0.15
kom
-0.14
Ùĩ
-0.14
ppo
-0.14
ipeg
-0.14
f
-0.14
ìĦł
-0.14
Duy
-0.14
váºŃy
-0.14
ayım
-0.13
POSITIVE LOGITS
why
0.31
how
0.27
where
0.23
why
0.22
what
0.22
precisely
0.22
为ä»Ģä¹Ī
0.19
how
0.18
exactly
0.17
true
0.17
Activations Density 0.090%