INDEX
Explanations
phrases indicating strong opinions or evaluations
statements about problems or conditions that lead to significant consequences
New Auto-Interp
Negative Logits
FN
-0.65
known
-0.61
SD
-0.60
zig
-0.60
pring
-0.59
than
-0.59
odd
-0.59
alias
-0.58
episode
-0.58
wig
-0.57
POSITIVE LOGITS
deserves
1.45
attracts
1.24
evolves
1.19
requires
1.19
belongs
1.18
seeks
1.16
needs
1.15
inspires
1.14
strives
1.14
relies
1.14
Activations Density 0.183%