INDEX
Explanations
references to specific items or categories of things
New Auto-Interp
Negative Logits
those
-0.75
those
-0.70
stuff
-0.69
stuff
-0.67
forskning
-0.66
Ausrüstung
-0.64
legislation
-0.63
)_/¯
-0.63
Kleidung
-0.63
Stuff
-0.62
POSITIVE LOGITS
days
0.85
two
0.85
kinds
0.82
guys
0.80
sorts
0.80
same
0.79
zelfde
0.79
cond
0.76
three
0.76
latter
0.72
Activations Density 0.328%