INDEX
Explanations
phrases related to roles, instructions or commands
phrases indicating legal or formal contexts
New Auto-Interp
Negative Logits
.:
-0.78
":-
-0.69
ciplinary
-0.66
shed
-0.63
.",
-0.63
ses
-0.62
Pieces
-0.61
usercontent
-0.60
reth
-0.57
Sep
-0.57
POSITIVE LOGITS
?)
1.04
!)
1.03
incidentally
1.02
!),
1.01
!).
0.96
theless
0.96
?).
0.94
?),
0.92
arently
0.90
admittedly
0.89
Activations Density 0.364%