INDEX
Explanations
Anti-Defamation League (ADL) links
New Auto-Interp
Negative Logits
proposes
0.40
ROOT
0.39
sáb
0.39
manuscripts
0.38
WORK
0.37
stratum
0.37
STRICT
0.37
WORK
0.37
RULES
0.37
Work
0.36
POSITIVE LOGITS
combat
0.45
hate
0.44
IDF
0.43
전투
0.42
Fett
0.41
Lauder
0.41
combats
0.41
Hitler
0.41
ጩ
0.41
Anti
0.40
Activations Density 0.005%