INDEX
Explanations
references to accountability and complicity in societal issues
New Auto-Interp
Negative Logits
aller
-0.19
åĭĴ
-0.15
iegel
-0.14
unfamiliar
-0.14
humble
-0.14
æŁĦ
-0.14
nerd
-0.14
iso
-0.14
mann
-0.14
èıľ
-0.13
POSITIVE LOGITS
cond
0.23
support
0.21
permit
0.20
allowing
0.18
enabling
0.18
comp
0.18
enable
0.18
åħģ
0.18
allow
0.18
toler
0.18
Activations Density 0.287%