INDEX
Explanations
words related to expressing strong opinions or beliefs
terms related to vocabulary and cultural references
New Auto-Interp
Negative Logits
etter
-0.84
erness
-0.80
icipated
-0.78
raid
-0.74
esm
-0.74
ered
-0.73
ering
-0.72
ness
-0.72
resh
-0.71
ige
-0.71
POSITIVE LOGITS
ations
0.92
acion
0.87
atures
0.82
entric
0.81
ATIONS
0.79
atis
0.76
ature
0.76
ates
0.76
adoes
0.75
acies
0.75
Activations Density 0.076%