INDEX
Explanations
adjectives or phrases describing potential harm or risks
references to danger and potentially harmful situations
New Auto-Interp
Negative Logits
olitan
-0.82
roma
-0.79
ļéĨĴ
-0.78
mination
-0.76
via
-0.75
oration
-0.75
zzo
-0.75
hew
-0.74
arthed
-0.74
cedented
-0.73
POSITIVE LOGITS
adolesc
0.91
endanger
0.81
undermin
0.75
sounding
0.74
nesses
0.73
dangerous
0.73
threats
0.71
overdose
0.71
combination
0.70
Danger
0.69
Activations Density 0.032%