INDEX
Explanations
phrases that indicate actions or processes undertaken
New Auto-Interp
Negative Logits
ud
-0.15
vert
-0.14
oi
-0.14
raž
-0.14
heim
-0.14
ved
-0.14
èĶ
-0.14
cake
-0.13
itta
-0.13
HG
-0.13
POSITIVE LOGITS
so
0.41
å¦ĤæŃ¤
0.23
ÑĤак
0.21
so
0.20
So
0.18
så
0.18
such
0.18
So
0.17
igin
0.17
.so
0.15
Activations Density 0.037%