INDEX
Explanations
phrases related to criticism or negative judgment
negative descriptors, particularly related to criticism or derogatory terms
New Auto-Interp
Negative Logits
gow
-0.79
minster
-0.73
laun
-0.65
abre
-0.65
Gardens
-0.64
united
-0.64
anmar
-0.64
bnb
-0.63
quartered
-0.61
purch
-0.61
POSITIVE LOGITS
alion
0.90
ction
0.84
ible
0.80
dden
0.72
chio
0.71
iment
0.67
ilet
0.66
ype
0.65
fusc
0.65
ione
0.65
Activations Density 0.042%