INDEX
Explanations
controversial or negative associations and actions related to various groups or individuals
discussions of harmful societal issues and groups
New Auto-Interp
Negative Logits
ERG
-0.66
Ank
-0.63
Sym
-0.62
STRUCT
-0.61
OIL
-0.60
Sum
-0.59
Vert
-0.57
Shift
-0.57
Sund
-0.57
Var
-0.56
POSITIVE LOGITS
respectively
0.87
etc
0.72
isine
0.71
atics
0.70
.''.
0.69
.",
0.67
.[
0.67
backgrounds
0.66
perpetrated
0.64
¥µ
0.64
Activations Density 0.629%