INDEX
Explanations
questions ending with a question mark
questions about understanding implications and dynamics
New Auto-Interp
Negative Logits
toxin
-0.69
bul
-0.66
ality
-0.66
iannopoulos
-0.65
oki
-0.65
uther
-0.65
evening
-0.65
contr
-0.64
onte
-0.63
background
-0.63
POSITIVE LOGITS
Well
1.56
Firstly
1.35
Probably
1.31
Quite
1.29
Answer
1.23
Certainly
1.23
Apparently
1.22
Possibly
1.20
Turns
1.20
Obviously
1.20
Activations Density 0.130%