INDEX
Explanations
personal pronouns and expressions of self-identity
New Auto-Interp
Negative Logits
wards
-0.17
UBL
-0.16
772
-0.16
Civ
-0.16
ardo
-0.14
ubl
-0.14
uctor
-0.14
ecta
-0.14
atk
-0.14
ält
-0.14
POSITIVE LOGITS
aside
0.37
into
0.29
Aside
0.27
aside
0.27
together
0.26
INTO
0.22
forward
0.22
Aside
0.22
atively
0.21
Into
0.21
Activations Density 0.050%