INDEX
Explanations
phrases related to social issues and disparities
New Auto-Interp
Negative Logits
izons
-0.78
agon
-0.72
onet
-0.72
oud
-0.71
gered
-0.71
ema
-0.69
20439
-0.69
iban
-0.68
itol
-0.68
alysed
-0.67
POSITIVE LOGITS
namely
0.95
Whenever
0.91
Anyone
0.89
Unless
0.85
Why
0.84
Until
0.84
When
0.84
Firstly
0.84
Where
0.83
Forget
0.83
Activations Density 0.120%