INDEX
Explanations
expressions of pride or self-importance related to achievements or identity
New Auto-Interp
Negative Logits
merce
-0.15
ofday
-0.15
uran
-0.15
Fuj
-0.15
oran
-0.14
onth
-0.14
laws
-0.14
eki
-0.14
oth
-0.14
allon
-0.13
POSITIVE LOGITS
ably
0.16
uzzi
0.16
PoÄįet
0.14
proud
0.14
ór
0.14
Fior
0.14
Ïģια
0.14
æºĸ
0.14
mantle
0.14
unwrap
0.14
Activations Density 0.035%