INDEX
Explanations
pronouns and associated actions
New Auto-Interp
Negative Logits
um
0.47
animity
0.43
}$.
0.43
suunn
0.42
Z
0.41
uh
0.40
_
0.40
ap
0.40
BUS
0.40
aj
0.40
POSITIVE LOGITS
мата
0.42
ścia
0.42
槎
0.41
взгля
0.40
իկ
0.40
кара
0.40
вании
0.40
нце
0.40
ка
0.39
karat
0.39
Activations Density 0.169%