INDEX
Explanations
negative statements or concepts related to relationships and social issues
New Auto-Interp
Negative Logits
[...]
-0.15
Potion
-0.14
...]↵↵
-0.14
.nano
-0.14
recent
-0.14
()['
-0.14
ahrain
-0.14
inati
-0.14
swick
-0.14
[...]↵↵
-0.13
POSITIVE LOGITS
libs
0.17
—
0.16
ideo
0.15
306
0.14
illeg
0.14
Americans
0.14
conservatives
0.14
Barack
0.14
sob
0.14
because
0.14
Activations Density 0.001%