INDEX
Explanations
references to harm and physics-related terms
terms related to harm and its effects
New Auto-Interp
Negative Logits
Ö¼
-0.87
onde
-0.74
stakes
-0.73
toes
-0.70
ducks
-0.67
Monteneg
-0.66
thumbs
-0.65
eyed
-0.65
craw
-0.64
gravel
-0.64
POSITIVE LOGITS
harm
3.17
phys
1.39
physi
1.34
Phys
1.32
Harm
1.28
pharmac
1.23
Phys
1.20
alter
1.13
aber
1.07
ulla
1.03
Activations Density 0.056%