INDEX
Explanations
comparisons using the word "like"
New Auto-Interp
Negative Logits
byn
-0.80
conservancy
-0.67
earance
-0.64
alez
-0.64
arcity
-0.64
itions
-0.64
alt
-0.63
oust
-0.63
edom
-0.63
izarre
-0.63
POSITIVE LOGITS
crap
0.98
lier
0.93
shit
0.84
idiots
0.73
fools
0.71
liest
0.68
they
0.67
outsiders
0.67
lihood
0.67
THEY
0.65
Activations Density 0.032%