INDEX
Explanations
phrases expressing strong criticism or disbelief
instances of the word "nonsense" and related phrases
New Auto-Interp
Negative Logits
hani
-0.75
redits
-0.72
ez
-0.70
irth
-0.70
lis
-0.70
ugal
-0.69
hold
-0.69
uve
-0.69
yer
-0.68
imb
-0.65
POSITIVE LOGITS
nonsense
1.11
detector
0.91
excuses
0.89
bullshit
0.86
excuse
0.83
rubbish
0.81
crap
0.79
guiActiveUn
0.77
aceutical
0.77
blah
0.77
Activations Density 0.027%