INDEX
Explanations
references to societal injustices and moral dilemmas
New Auto-Interp
Negative Logits
bens
-0.15
tend
-0.15
ocket
-0.14
reu
-0.14
ullet
-0.14
aba
-0.14
sab
-0.14
åħ¼
-0.14
jev
-0.14
Explicit
-0.13
POSITIVE LOGITS
stejnÄĽ
0.29
same
0.26
åIJĮ
0.25
similarly
0.24
Similarly
0.23
Similarly
0.23
analogy
0.22
Same
0.22
ä¸Ģæł·
0.22
à¹Ģหม
0.21
Activations Density 0.279%