INDEX
Explanations
references to threats or dangers in various contexts
New Auto-Interp
Negative Logits
etur
-0.15
arent
-0.15
gere
-0.14
ulton
-0.14
undra
-0.14
ignon
-0.13
.compose
-0.13
áo
-0.13
keit
-0.13
vr
-0.13
POSITIVE LOGITS
danger
0.19
dangers
0.18
hã
0.17
ional
0.17
stell
0.17
Danger
0.17
-danger
0.16
threat
0.15
threat
0.15
ome
0.15
Activations Density 0.049%