INDEX
Explanations
terms and phrases related to critiques of societal norms and cultural phenomena
New Auto-Interp
Negative Logits
}.
-0.25
.").
-0.21
}.↵
-0.21
''.
-0.21
'.
-0.21
).
-0.20
“.
-0.20
("").-0.20
].
-0.20
>().
-0.20
POSITIVE LOGITS
”,
0.36
",
0.35
,”
0.33
»,
0.31
,"
0.31
’,
0.31
!",
0.30
”ï¼Į
0.30
',
0.29
",
0.29
Activations Density 0.110%