INDEX
Explanations
words related to judgment or evaluation
negations and expressions of skepticism or doubt
New Auto-Interp
Negative Logits
iets
-0.94
regation
-0.89
glers
-0.87
ortmund
-0.82
atari
-0.80
umps
-0.80
phia
-0.79
Ws
-0.78
olphins
-0.77
acas
-0.76
POSITIVE LOGITS
downright
1.24
impractical
1.20
addictive
1.20
impossible
1.13
profitable
1.13
dangerous
1.10
inspiring
1.10
irresistible
1.09
scary
1.08
desirable
1.07
Activations Density 0.334%