INDEX
Explanations
index followed by parenthesis
New Auto-Interp
Negative Logits
屃
-1.70
腘
-1.60
což
-1.59
ኧ
-1.52
騭
-1.42
protože
-1.41
dicho
-1.39
</h1>
-1.38
鐿
-1.38
我知道
-1.37
POSITIVE LOGITS
er
1.84
at
1.60
now
1.52
most
1.51
still
1.49
ists
1.45
for
1.41
on
1.41
.
1.39
слегка
1.38
Activations Density 0.008%