INDEX
Explanations
categories of content related to location or classification
categories or types of entities
New Auto-Interp
Negative Logits
elf
-0.42
y
-0.39
Opfer
-0.37
alu
-0.37
hy
-0.36
sy
-0.36
aloud
-0.35
long
-0.35
teasing
-0.35
harassment
-0.35
POSITIVE LOGITS
tartalomajánló
0.96
Portail
0.92
بوابة
0.82
Datuak
0.79
Portale
0.73
الحره
0.73
DockStyle
0.72
Италијани
0.72
bezeichneter
0.69
extAlignment
0.67
Activations Density 0.009%