INDEX
Explanations
words or phrases related to controversial or sensitive topics, potentially skewed towards medical or political subjects
topics related to social issues and cultural sensitivity
New Auto-Interp
Negative Logits
luster
-0.52
Nare
-0.50
farious
-0.50
sylv
-0.48
Ire
-0.47
withd
-0.47
orage
-0.46
anon
-0.46
nesota
-0.46
nesday
-0.45
POSITIVE LOGITS
)?
0.77
?)
0.70
)</
0.68
)/
0.66
?).
0.65
)
0.65
!)
0.64
!).
0.63
-)
0.63
!),
0.62
Activations Density 1.105%