INDEX
Explanations
phrases associated with instructions or guidance
New Auto-Interp
Negative Logits
sk
-0.70
ste
-0.65
bet
-0.63
p
-0.63
inv
-0.63
-
-0.62
sti
-0.61
re
-0.61
見
-0.60
ri
-0.60
POSITIVE LOGITS
myſelf
1.53
himſelf
1.50
itſelf
1.48
auffi
1.44
ſeveral
1.42
ſelf
1.42
themſelves
1.41
ainfi
1.39
againſt
1.38
ſhe
1.37
Activations Density 0.164%