INDEX
Explanations
phrases related to identity and societal roles
New Auto-Interp
Negative Logits
quot
-0.17
‘
-0.15
quot
-0.14
...'
-0.14
ayıp
-0.14
aku
-0.14
,’”
-0.14
Diamonds
-0.13
uxt
-0.13
Ori
-0.13
POSITIVE LOGITS
"
0.30
".↵
0.23
"↵
0.21
."↵
0.21
".
0.21
".
0.21
".↵
0.20
ï¼ļ"
0.20
"↵
0.20
",
0.19
Activations Density 0.232%