INDEX
Explanations
mentions of guns and gun-related terminology
New Auto-Interp
Negative Logits
еÑĢин
-0.17
oha
-0.17
©
-0.16
acus
-0.15
cue
-0.15
oped
-0.15
hra
-0.14
casts
-0.14
tick
-0.14
ÑĢажд
-0.14
POSITIVE LOGITS
pow
0.29
ning
0.22
ned
0.20
linger
0.20
ny
0.20
metal
0.20
shots
0.19
erals
0.18
boat
0.17
nen
0.17
Activations Density 0.023%