INDEX
Explanations
words related to negative or controversial actions or situations
terms related to reputation and its implications
New Auto-Interp
Negative Logits
sky
-0.65
nova
-0.65
Nig
-0.63
thur
-0.60
veins
-0.60
danger
-0.60
gray
-0.60
erb
-0.59
Mig
-0.59
Sherman
-0.59
POSITIVE LOGITS
ction
0.87
enment
0.80
ndum
0.79
ctions
0.78
ance
0.76
eval
0.76
issance
0.76
atives
0.75
ENCY
0.73
essed
0.73
Activations Density 0.164%