INDEX
Explanations
expressions related to concern or indifference towards societal issues
New Auto-Interp
Negative Logits
ãĥ³ãĤ¸
-0.82
UES
-0.77
======
-0.69
kered
-0.68
onite
-0.64
oute
-0.64
ãĥĥãĥī
-0.62
CHA
-0.62
KEN
-0.60
ãĥĪ
-0.58
POSITIVE LOGITS
taker
1.34
lessly
1.25
lessness
1.15
giving
0.98
ening
0.97
fully
0.93
taking
0.89
der
0.85
ful
0.85
eners
0.80
Activations Density 0.603%