INDEX
Explanations
positive affirmations or strong endorsements
words related to strong positive responses or approvals
New Auto-Interp
Negative Logits
nan
-0.69
bil
-0.68
perature
-0.65
士
-0.65
nesota
-0.65
skelet
-0.65
procure
-0.63
othy
-0.63
ORT
-0.61
pool
-0.60
POSITIVE LOGITS
ounded
1.15
ounding
1.10
oslav
0.92
ounds
0.91
OUND
0.84
onent
0.79
soType
0.79
Sadd
0.78
ogle
0.70
SourceFile
0.69
Activations Density 0.009%