INDEX
Explanations
expressions related to negative judgment or criticism
derogatory remarks about intelligence
New Auto-Interp
Negative Logits
AUT
-0.92
largeDownload
-0.88
APH
-0.87
quart
-0.76
cussion
-0.69
HI
-0.69
aver
-0.67
RH
-0.67
ILA
-0.67
soType
-0.66
POSITIVE LOGITS
stupid
1.01
nesses
0.93
silly
0.85
Stupid
0.84
dumb
0.83
gery
0.79
itude
0.77
upid
0.77
ulent
0.77
ishly
0.76
Activations Density 0.010%