INDEX
Explanations
phrases related to opinions or beliefs about different groups or ideologies
phrases that assert the existence or characteristics of groups or entities
New Auto-Interp
Negative Logits
WER
-0.76
mentioned
-0.76
urry
-0.75
stood
-0.71
Moines
-0.70
ESE
-0.69
sers
-0.67
rief
-0.66
noticed
-0.66
Adds
-0.65
POSITIVE LOGITS
unfit
1.20
inherently
1.17
incapable
1.16
unworthy
1.14
illegitimate
1.09
somehow
1.08
intrinsically
1.05
insufficient
1.05
incompatible
1.05
conspiring
1.04
Activations Density 0.293%