INDEX
Explanations
words or prefixes related to negation or reversal, particularly those starting with "un."
New Auto-Interp
Negative Logits
�
-0.74
orate
-0.73
superhuman
-0.62
unpop
-0.61
uncontrolled
-0.60
mathemat
-0.59
oured
-0.59
Dian
-0.59
unden
-0.58
submerged
-0.58
POSITIVE LOGITS
readable
0.87
nikov
0.86
interested
0.86
loading
0.85
miss
0.84
rave
0.84
rival
0.83
needed
0.82
paralle
0.82
break
0.80
Activations Density 0.015%