INDEX
Explanations
terms related to power dynamics or performances
New Auto-Interp
Negative Logits
romeda
-0.72
Niet
-0.68
roit
-0.67
eret
-0.67
Bei
-0.67
OOL
-0.67
eryl
-0.65
algia
-0.64
Von
-0.63
olson
-0.63
POSITIVE LOGITS
houses
1.05
stroke
1.03
lifting
0.97
outage
0.90
puff
0.89
train
0.85
lessness
0.82
full
0.80
chords
0.80
Reviewer
0.80
Activations Density 0.035%