INDEX
Explanations
negations and phrases expressing disagreement or denial
New Auto-Interp
Negative Logits
aler
-0.65
ims
-0.62
liest
-0.61
hesis
-0.60
kees
-0.60
ellen
-0.60
thing
-0.60
ogen
-0.60
IRE
-0.60
heads
-0.59
POSITIVE LOGITS
icia
0.69
pmwiki
0.68
pleasant
0.64
horm
0.63
Nerd
0.62
pload
0.61
accident
0.60
Punk
0.60
advisable
0.60
ice
0.59
Activations Density 0.060%