INDEX
Explanations
words related to support or help
features related to characteristics of individuals or identity
New Auto-Interp
Negative Logits
.</
-0.82
—
-0.77
.?
-0.76
.''.
-0.76
.—
-0.73
.[
-0.71
âĢķ
-0.71
.
-0.70
âĢł
-0.70
*.
-0.69
POSITIVE LOGITS
however
0.94
tho
0.78
meanwhile
0.77
alot
0.72
organise
0.70
though
0.69
realise
0.64
organising
0.63
anwhile
0.63
learnt
0.60
Activations Density 1.589%