INDEX
Explanations
references to Nazi Germany and related concepts
New Auto-Interp
Negative Logits
onaut
-0.16
ft
-0.15
imos
-0.15
Pascal
-0.14
aab
-0.14
icense
-0.14
ække
-0.14
ayet
-0.14
ÄĽt
-0.14
utenberg
-0.14
POSITIVE LOGITS
Hit
0.47
Hitler
0.43
Hit
0.42
.Hit
0.36
HIT
0.36
NS
0.35
Nazi
0.33
hit
0.32
Naz
0.32
SS
0.32
Activations Density 0.110%