INDEX
Explanations
phrases indicating distinguishing characteristics or unique features
phrases that identify distinguishing characteristics or qualities
New Auto-Interp
Negative Logits
endix
-0.80
lance
-0.75
lex
-0.74
ixon
-0.72
xon
-0.70
erenn
-0.69
odder
-0.67
rentice
-0.67
tch
-0.66
cow
-0.66
POSITIVE LOGITS
them
0.71
orno
0.69
Cu
0.69
rament
0.67
distinguishes
0.67
humanity
0.67
Sanct
0.66
ably
0.64
these
0.63
Tyrann
0.63
Activations Density 0.107%