INDEX
Explanations
multiple-choice answers followed by period
New Auto-Interp
Negative Logits
p
0.41
igten
0.39
icht
0.38
bepaalde
0.38
agia
0.38
ёз
0.38
drained
0.37
trataro
0.37
ómo
0.36
5
0.36
POSITIVE LOGITS
None
0.54
none
0.53
lahat
0.50
nessuna
0.50
всех
0.49
żad
0.49
ninguna
0.48
всички
0.48
никаких
0.48
swarm
0.48
Activations Density 0.012%