INDEX
Explanations
phrases indicating feelings of luck, privilege, and opportunities
New Auto-Interp
Negative Logits
sympathy
-0.17
swick
-0.16
arness
-0.16
arius
-0.15
ilot
-0.14
失
-0.14
ramid
-0.14
orno
-0.14
uc
-0.14
Reflection
-0.14
POSITIVE LOGITS
Priv
0.17
privilege
0.16
IES
0.16
uppe
0.15
/gui
0.14
iesen
0.14
privileged
0.14
TextLabel
0.14
priv
0.14
.selection
0.13
Activations Density 0.048%