INDEX
Explanations
concepts related to societal roles and evaluations
New Auto-Interp
Negative Logits
instead
-0.19
etc
-0.19
instead
-0.17
çŃī
-0.17
Instead
-0.16
undan
-0.16
fak
-0.16
Instead
-0.15
kker
-0.15
(or
-0.15
POSITIVE LOGITS
AND
0.48
lẫn
0.45
AND
0.34
että
0.27
as
0.27
nor
0.26
_AND
0.25
plus
0.25
PLUS
0.23
_
0.23
Activations Density 0.117%