INDEX
Explanations
instances of the pronoun "we"
New Auto-Interp
Negative Logits
noon
-0.18
andon
-0.18
rav
-0.17
åĢij
-0.17
mund
-0.17
umn
-0.16
wich
-0.16
semble
-0.15
water
-0.15
们
-0.15
POSITIVE LOGITS
aves
0.28
aved
0.26
ighb
0.24
avings
0.21
aver
0.21
arily
0.21
arnings
0.21
evil
0.20
eded
0.20
eping
0.20
Activations Density 0.020%