INDEX
Explanations
phrases indicating recognition or awareness of specific information
New Auto-Interp
Negative Logits
ares
-0.18
iasco
-0.17
ids
-0.17
dk
-0.15
ero
-0.15
go
-0.15
/ts
-0.15
role
-0.14
cigaret
-0.14
guard
-0.14
POSITIVE LOGITS
æĻĵ
0.23
s
0.23
ledge
0.21
sand
0.20
ledged
0.20
-how
0.20
ingly
0.19
led
0.18
ledger
0.18
fact
0.18
Activations Density 0.033%