INDEX
Explanations
phrases relating to effects or consequences
New Auto-Interp
Negative Logits
bow
-0.76
cele
-0.69
approved
-0.64
course
-0.61
mad
-0.60
away
-0.60
fter
-0.59
ption
-0.58
ilings
-0.58
media
-0.58
POSITIVE LOGITS
tremend
0.84
alot
0.79
raining
0.78
bnb
0.77
ynthesis
0.71
CTR
0.70
ometimes
0.69
rontal
0.67
ichick
0.63
inently
0.63
Activations Density 0.241%