INDEX
Explanations
phrases related to potential risks or dangers
terms related to risk and danger
New Auto-Interp
Negative Logits
eve
-0.74
elf
-0.72
ergy
-0.72
ENTS
-0.70
Nap
-0.69
lins
-0.67
Seasons
-0.66
gdala
-0.65
inters
-0.65
poons
-0.65
POSITIVE LOGITS
risk
0.89
crow
0.86
risks
0.85
taking
0.81
risk
0.80
endanger
0.76
lessly
0.76
gamble
0.75
proble
0.74
contag
0.74
Activations Density 0.021%