INDEX
Explanations
phrases introducing general statements or observations
generalizing terms and phrases that imply common experiences or observations
New Auto-Interp
Negative Logits
gent
-0.63
imm
-0.62
enting
-0.61
nearby
-0.61
avering
-0.57
pron
-0.57
driving
-0.57
andering
-0.56
chore
-0.56
rapp
-0.56
POSITIVE LOGITS
entimes
0.87
chwitz
0.86
Helpful
0.85
terness
0.83
eus
0.83
resy
0.82
Strikes
0.82
Issue
0.81
yip
0.77
Called
0.77
Activations Density 0.032%