INDEX
Explanations
mentions of different groups of people
references to groups of people
New Auto-Interp
Negative Logits
tains
-0.88
opens
-0.71
iHUD
-0.70
¿½
-0.68
Increases
-0.68
manent
-0.67
Flavoring
-0.67
CONT
-0.65
forestation
-0.63
Prel
-0.63
POSITIVE LOGITS
aren
1.57
deserve
1.55
ARE
1.36
weren
1.33
shouldn
1.32
are
1.32
despise
1.30
don
1.29
ain
1.28
suck
1.28
Activations Density 0.421%