INDEX
Explanations
the word "won't" with high activation values
negations or words indicating refusal or denial
New Auto-Interp
Negative Logits
behavi
-0.77
CVE
-0.72
Reloaded
-0.72
Reviewer
-0.72
examiner
-0.70
Palestin
-0.70
HTTP
-0.67
Moroc
-0.67
Hardware
-0.66
ventilation
-0.65
POSITIVE LOGITS
weet
1.01
ardless
0.91
acular
0.91
ruck
0.91
urtle
0.91
ravis
0.86
aylor
0.81
otally
0.80
itles
0.79
rees
0.79
Activations Density 0.032%