INDEX
Explanations
repeated word forms and grammatical structures in various languages
New Auto-Interp
Negative Logits
myſelf
-1.09
iſt
-0.99
raiſ
-0.93
NameInMap
-0.93
་་
-0.93
Anſ
-0.91
ſche
-0.90
Diſ
-0.90
faſt
-0.89
ſind
-0.88
POSITIVE LOGITS
-
0.66
,
0.63
<eos>
0.62
of
0.62
?
0.62
-
0.61
(
0.60
[toxicity=0]
0.56
</sub>
0.56
ubereitung
0.56
Activations Density 0.008%