INDEX
Explanations
references to individual actions or identities in a narrative context
New Auto-Interp
Negative Logits
utto
-0.17
uzzi
-0.17
eyer
-0.16
ogan
-0.16
udo
-0.15
velt
-0.15
.aw
-0.15
rown
-0.15
oven
-0.14
agne
-0.14
POSITIVE LOGITS
flat
0.18
oram
0.17
wid
0.16
flat
0.15
Wid
0.15
bp
0.15
wid
0.14
Burnett
0.14
788
0.14
cle
0.14
Activations Density 0.018%