INDEX
Explanations
phrases related to negative connotations or criticisms
recurring themes or patterns related to derogatory remarks or slurs
New Auto-Interp
Negative Logits
spoilers
-0.63
unexplained
-0.62
ymes
-0.61
iots
-0.61
LLOW
-0.60
INO
-0.60
successor
-0.58
Annotations
-0.57
lip
-0.56
hypers
-0.55
POSITIVE LOGITS
ufact
1.14
rius
0.95
stration
0.82
ESA
0.76
andise
0.75
brance
0.73
roe
0.73
strate
0.71
ogyn
0.68
ques
0.68
Activations Density 0.063%