INDEX
Explanations
references to friendliness and positive social interactions
New Auto-Interp
Negative Logits
ngo
-0.17
stin
-0.15
lage
-0.14
åŁŁ
-0.14
238
-0.14
ilion
-0.14
edBy
-0.14
orial
-0.14
sf
-0.14
à¥Īà¤ľ
-0.14
POSITIVE LOGITS
/lo
0.17
towards
0.17
toward
0.16
nature
0.16
faces
0.16
udge
0.16
nature
0.16
/help
0.16
tone
0.15
-faced
0.15
Activations Density 0.073%