INDEX
Explanations
phrases related to negative or derogatory terms
terms associated with critique and negative characterization
New Auto-Interp
Negative Logits
ahon
-0.79
chwitz
-0.72
earch
-0.71
ensive
-0.71
large
-0.71
arbon
-0.70
ij
-0.70
range
-0.68
angan
-0.67
ascript
-0.67
POSITIVE LOGITS
extraord
1.47
gery
1.05
esses
1.03
hood
1.00
ry
0.93
archetype
0.91
liness
0.87
persona
0.87
who
0.86
doms
0.85
Activations Density 0.360%