INDEX
Explanations
phrases related to controversial topics or actions
New Auto-Interp
Negative Logits
idth
-0.68
Shares
-0.68
Niet
-0.60
[+]
-0.59
assetsadobe
-0.57
cheon
-0.57
76561
-0.57
hrs
-0.56
illard
-0.56
saw
-0.56
POSITIVE LOGITS
disappear
0.95
obsolete
0.94
happen
0.93
accessible
0.93
unavailable
0.89
inaccessible
0.85
easier
0.83
redundant
0.83
safer
0.82
solete
0.79
Activations Density 0.202%