INDEX
Explanations
adjectives that express opinions or perceptions
New Auto-Interp
Negative Logits
itton
-0.77
pour
-0.68
probably
-0.64
instead
-0.62
instead
-0.62
sometimes
-0.61
Probably
-0.60
Kings
-0.60
rather
-0.60
onen
-0.59
POSITIVE LOGITS
anymore
1.47
anywhere
1.10
nor
1.06
anything
0.99
any
0.99
slightest
0.97
necessarily
0.95
bothered
0.90
whatsoever
0.87
yet
0.85
Activations Density 0.167%