INDEX
Explanations
phrases expressing extreme opinions or experiences
phrases emphasizing personal experiences and evaluations
New Auto-Interp
Negative Logits
phasis
-0.60
gravity
-0.59
uish
-0.54
dominates
-0.53
inks
-0.53
Majority
-0.51
flix
-0.51
Kung
-0.51
tiny
-0.51
papers
-0.50
POSITIVE LOGITS
ever
1.57
EVER
1.38
ever
1.13
imaginable
1.07
Ever
1.02
Ever
0.93
encountered
0.92
encount
0.85
possibly
0.77
anywhere
0.76
Activations Density 0.105%