INDEX
Explanations
assertive statements about truths or verifiable information
New Auto-Interp
Negative Logits
andra
-0.55
خرى
-0.50
kring
-0.50
weer
-0.49
wieder
-0.49
mpe
-0.48
mélanger
-0.48
wnicy
-0.47
рант
-0.47
dragen
-0.47
POSITIVE LOGITS
even
1.07
sogar
1.02
actually
0.92
Bahkan
0.91
Bahkan
0.91
zelfs
0.90
むしろ
0.85
anzi
0.84
Incluso
0.83
Actually
0.80
Activations Density 0.153%