INDEX
Explanations
words related to validity and authentication
expressions related to validation or legitimacy
New Auto-Interp
Negative Logits
hedon
-0.82
xual
-0.77
Mania
-0.71
irez
-0.68
Grove
-0.66
mania
-0.65
Kut
-0.65
ynthesis
-0.64
hell
-0.64
opsy
-0.64
POSITIVE LOGITS
ators
1.08
ating
1.08
ator
0.98
alties
0.90
ates
0.87
ations
0.87
iation
0.78
acies
0.78
ifiers
0.77
itable
0.75
Activations Density 0.017%