INDEX
Explanations
questions, particularly starting with the word "How"
New Auto-Interp
Negative Logits
room
-0.62
piece
-0.61
odder
-0.60
article
-0.59
oubted
-0.57
hereafter
-0.57
iculture
-0.56
Issue
-0.56
agonists
-0.56
goers
-0.56
POSITIVE LOGITS
soever
1.10
beit
0.97
ever
0.95
ells
0.91
itzer
0.90
ling
0.88
much
0.87
ls
0.86
exactly
0.75
much
0.75
Activations Density 1.164%