INDEX
Explanations
words related to opposition or contradiction
terms related to counter-arguments and counterproductive actions
New Auto-Interp
Negative Logits
©¶æ¥µ
-0.71
ogo
-0.63
é¾įå
-0.63
Bonds
-0.62
Vide
-0.61
FANTASY
-0.61
weeney
-0.61
ahime
-0.60
Likes
-0.60
livest
-0.59
POSITIVE LOGITS
measures
0.80
dict
0.78
attack
0.76
xual
0.75
intuitive
0.75
arya
0.73
ctive
0.72
atives
0.71
argument
0.71
rad
0.71
Activations Density 0.078%