INDEX
Explanations
references to guns and gun-related terminology
New Auto-Interp
Negative Logits
еÑĢин
-0.18
tte
-0.16
cue
-0.16
hid
-0.15
een
-0.15
crast
-0.15
cin
-0.15
hra
-0.15
gor
-0.15
AsStream
-0.15
POSITIVE LOGITS
pow
0.35
metal
0.27
ned
0.25
ning
0.25
ny
0.24
ners
0.24
shots
0.24
ner
0.23
boat
0.22
fight
0.22
Activations Density 0.024%