INDEX
Explanations
strong emphasis on positive sentiment or expressions of praise
New Auto-Interp
Negative Logits
horn
-0.17
teenth
-0.16
ic
-0.16
oon
-0.15
ered
-0.15
uld
-0.15
omer
-0.15
cy
-0.15
ham
-0.14
venience
-0.14
POSITIVE LOGITS
anford
0.18
ASE
0.16
aney
0.16
åı·
0.15
aller
0.15
-secret
0.15
lying
0.14
ignant
0.14
anka
0.14
gether
0.14
Activations Density 0.075%