INDEX
Explanations
various forms of negative or oppositional language
New Auto-Interp
Negative Logits
çĦ
-0.94
INGTON
-0.82
folders
-0.70
Cups
-0.69
enegger
-0.68
Folder
-0.68
ponds
-0.66
CSV
-0.64
wallets
-0.63
adapters
-0.63
POSITIVE LOGITS
xiety
0.99
acist
0.95
alyst
0.85
olic
0.85
apist
0.84
hesis
0.82
uro
0.81
hetical
0.81
acists
0.80
agog
0.78
Activations Density 0.010%