INDEX
Explanations
adverbs indicating degree or intensity
phrases that express varying degrees of surprise or amazement
New Auto-Interp
Negative Logits
ature
-0.68
UME
-0.66
isher
-0.65
odder
-0.65
izu
-0.63
oris
-0.61
Ans
-0.60
agonists
-0.60
ograph
-0.59
uthor
-0.59
POSITIVE LOGITS
HCR
0.93
ls
0.87
soever
0.83
much
0.81
ling
0.80
bill
0.77
beit
0.77
MUCH
0.77
ells
0.76
much
0.75
Activations Density 0.083%