INDEX
Explanations
profanity and offensive language
affirmations or denials within discussion contexts
New Auto-Interp
Negative Logits
curfew
-0.35
dams
-0.32
earthquakes
-0.31
genomes
-0.31
bases
-0.30
reactors
-0.30
pores
-0.30
polygamy
-0.29
jails
-0.29
roofs
-0.28
POSITIVE LOGITS
orp
0.35
perty
0.35
icably
0.35
Ax
0.35
uine
0.33
orne
0.33
omew
0.33
Helpful
0.32
REDACTED
0.32
ohn
0.32
Activations Density 2.058%