INDEX
Explanations
questions, especially those starting with the word "ask"
New Auto-Interp
Negative Logits
Ĥ¬
-0.73
âĶĢâĶĢ
-0.67
cutting
-0.66
lim
-0.65
Scouting
-0.65
ccording
-0.64
edition
-0.62
swing
-0.62
absor
-0.61
zinski
-0.61
POSITIVE LOGITS
rhet
1.25
questions
1.21
probing
1.03
forgiveness
0.97
naires
0.97
politely
0.94
plaint
0.93
Questions
0.92
permission
0.91
question
0.84
Activations Density 0.415%