INDEX
Explanations
pronouns followed by specific contexts
New Auto-Interp
Negative Logits
=
0.64
of
0.63
is
0.60
ü
0.57
ty
0.55
fumar
0.54
ya
0.53
var
0.52
om
0.52
ia
0.52
POSITIVE LOGITS
basaltes
0.54
έχουν
0.52
ಸ್ಟ
0.52
슌
0.50
twierd
0.50
ႅ
0.49
जेसीबी
0.48
entraîne
0.47
𓍊
0.47
瑆
0.47
Activations Density 0.000%