INDEX
Explanations
specific adjectives that describe actions or characteristics, such as "sparing", "terrifying", "vicious", "exact", and "leisure"
New Auto-Interp
Negative Logits
anders
-0.79
APH
-0.66
axy
-0.66
REL
-0.65
DonaldTrump
-0.64
ploma
-0.63
aden
-0.61
ARM
-0.61
ODUCT
-0.61
iltr
-0.60
POSITIVE LOGITS
ly
2.93
LY
1.84
lys
1.42
edly
1.31
liness
1.31
lies
1.25
fully
1.21
ity
1.16
ously
1.15
lly
1.15
Activations Density 1.945%