INDEX
Explanations
phrases indicating effort or attempt
phrases related to effort and intention
New Auto-Interp
Negative Logits
ELD
-0.72
zynski
-0.70
OPS
-0.68
Worse
-0.66
EStream
-0.66
UTH
-0.63
activation
-0.62
Liberation
-0.60
IFE
-0.59
Forced
-0.57
POSITIVE LOGITS
minimize
1.55
avoid
1.41
ensure
1.30
minim
1.24
avoid
1.21
maintain
1.21
maximize
1.18
keep
1.18
adhere
1.18
emphasize
1.15
Activations Density 0.300%