INDEX
Explanations
expressions of subjective feelings or thoughts
phrases that express feelings of comparison or similarity
New Auto-Interp
Negative Logits
byn
-0.76
oust
-0.71
conservancy
-0.70
omen
-0.68
alt
-0.66
itions
-0.65
ertain
-0.64
edient
-0.63
ais
-0.63
DonaldTrump
-0.62
POSITIVE LOGITS
crap
1.02
shit
0.92
lier
0.84
pulling
0.78
stepping
0.75
spitting
0.74
jumping
0.73
throwing
0.72
admitting
0.72
quitting
0.70
Activations Density 0.032%