INDEX
Explanations
terms related to "unknown" or "undeveloped" concepts
New Auto-Interp
Negative Logits
ership
-0.75
Catal
-0.70
tsky
-0.70
Tycoon
-0.70
understatement
-0.69
Gazette
-0.68
Reviewer
-0.67
Duchess
-0.63
Charl
-0.63
rejection
-0.62
POSITIVE LOGITS
ored
1.16
oded
1.15
enced
1.06
itable
1.04
structed
1.03
ired
1.02
overed
1.01
ought
1.01
velop
0.99
served
0.96
Activations Density 0.011%