INDEX
Explanations
references to nuclear weapons and related incidents
New Auto-Interp
Negative Logits
andal
-0.15
Scanner
-0.15
Noise
-0.14
оÑĢов
-0.14
å¥
-0.14
CAA
-0.14
Anchor
-0.14
hel
-0.14
Magnet
-0.13
pseud
-0.13
POSITIVE LOGITS
nuclear
0.47
atomic
0.43
uclear
0.40
Nuclear
0.39
æł¸
0.38
Atomic
0.37
Atomic
0.33
atomic
0.33
atom
0.32
اÙĦÙĨÙĪ
0.30
Activations Density 0.110%