INDEX
Explanations
phrases indicating clarification or emphasis
phrases structured around the concept of "being" or existence
New Auto-Interp
Negative Logits
yip
-0.81
naires
-0.65
rador
-0.61
Lines
-0.58
atis
-0.57
wi
-0.56
Bung
-0.56
nap
-0.56
rows
-0.56
ria
-0.55
POSITIVE LOGITS
honest
1.12
sure
1.11
blunt
1.03
able
0.99
frank
0.93
fair
0.85
heading
0.85
eligible
0.82
truthful
0.81
careful
0.79
Activations Density 0.048%