INDEX
Explanations
ambiguous or unclear words and phrases
New Auto-Interp
Negative Logits
sidew
-0.72
heel
-0.70
hement
-0.69
intuitive
-0.67
defamation
-0.67
suppressed
-0.66
sway
-0.66
flush
-0.66
disg
-0.65
positively
-0.64
POSITIVE LOGITS
lihood
1.00
Else
0.93
âĦ¢
0.89
Limits
0.86
tons
0.84
ness
0.83
itarian
0.82
!,
0.80
cott
0.79
sburg
0.79
Activations Density 0.162%