INDEX
Explanations
phrases related to accomplishing tasks or decisions
occurrences of the word "the."
New Auto-Interp
Negative Logits
strate
-0.73
wr
-0.66
ufact
-0.65
Cur
-0.64
Style
-0.64
exting
-0.62
greeted
-0.61
tions
-0.60
tion
-0.59
epad
-0.59
POSITIVE LOGITS
slightest
1.10
mistake
1.09
leap
1.01
pilgrimage
0.99
distinction
0.96
same
0.95
decision
0.94
rounds
0.94
transition
0.94
difference
0.91
Activations Density 0.038%