2022-01-22 22:54:51 -05:00
|
|
|
( regex.tal )
|
|
|
|
( )
|
|
|
|
( compiles regex expression strings into regex nodes, then uses )
|
2022-02-20 15:06:57 -05:00
|
|
|
( regex nodes to match input strings. )
|
2022-01-22 22:54:51 -05:00
|
|
|
( )
|
2022-02-20 15:06:57 -05:00
|
|
|
( two methods are currently supported: )
|
|
|
|
( )
|
|
|
|
( 1. match )
|
|
|
|
( )
|
|
|
|
( when matching the regex must match the entire string. this means )
|
|
|
|
( that it is unnecessary to use ^ and $ when matching, since their )
|
|
|
|
( effect is implied. it also means that that dot nodes will match )
|
|
|
|
( any characters at all including newlines. )
|
|
|
|
( )
|
|
|
|
( match returns 01 if the string was matched and 00 otherwise. )
|
|
|
|
( )
|
|
|
|
( 2. search )
|
|
|
|
( )
|
|
|
|
( when searching the regex attempts to find matching substrings )
|
|
|
|
( in the given string. this means that after successfully finding )
|
|
|
|
( a match, search may be called on the remaining substring to find )
|
|
|
|
( more matches. )
|
|
|
|
( )
|
|
|
|
( when searching, ^ matches the beginning of the string OR a line. )
|
|
|
|
( $ matches the end of a line OR the end of the entire string. )
|
2022-04-15 01:40:52 -04:00
|
|
|
( the dot nodes will not match newline characters, which must be )
|
|
|
|
( matched explicitly. )
|
|
|
|
( )
|
|
|
|
( finally, search-multiline will cause ^ and $ to use the matching )
|
|
|
|
( behavior (i.e. only matching the beginning or end of a string). )
|
|
|
|
( however dot nodes will still not match newline characters. )
|
2022-02-20 15:06:57 -05:00
|
|
|
( )
|
|
|
|
( search returns 01 if the string was matched and 00 otherwise. )
|
|
|
|
( additionally, the @search-start and @search-end addresses will )
|
|
|
|
( contain the starting location and match boundary of the matching )
|
|
|
|
( substring. )
|
2022-01-22 22:54:51 -05:00
|
|
|
( )
|
|
|
|
( regex node types: )
|
|
|
|
( )
|
|
|
|
( NAME DESCRIPTION STRUCT )
|
|
|
|
( empty matches empty string [ #01 next* ] )
|
|
|
|
( dot matches any one char [ #02 next* ] )
|
|
|
|
( lit matches one specific char (c) [ #03 c^ next* ] )
|
|
|
|
( or matches either left or right [ #04 left* right* ] )
|
2022-02-21 15:59:13 -05:00
|
|
|
( star matches expr zero-or-more times [ #05 expr* next* ] )
|
2022-01-22 22:54:51 -05:00
|
|
|
( (NOTE: r.expr.next must be r) )
|
2022-02-20 15:06:57 -05:00
|
|
|
( caret matches start of line/string [ #06 next* ] )
|
|
|
|
( dollar matches end of line/string [ #07 next* ] )
|
2022-04-18 23:17:44 -04:00
|
|
|
( lpar starts subgroup region [ #08 i^ next* ] )
|
|
|
|
( rpar ends subgroup region [ #09 i^ next* ] )
|
2022-04-15 01:40:52 -04:00
|
|
|
( class character class, e.g. [a-z] [ #0a next* n^ ... ] )
|
|
|
|
( (NOTE: n is the number of pairs in ...) )
|
|
|
|
( nclass negative class, e.g. [^a-z] [ #0b next* n^ ... ] )
|
|
|
|
( (NOTE: n is the number of pairs in ...) )
|
2022-01-22 22:54:51 -05:00
|
|
|
( )
|
|
|
|
( `or` and `star` have the same structure and are handled by the )
|
|
|
|
( same code (;do-or). however, the node types are kept different )
|
|
|
|
( to make it clearer how to parse and assemble the nodes. )
|
|
|
|
( )
|
2022-02-21 15:59:13 -05:00
|
|
|
( dollar nodes contain a next pointer even though this usually )
|
|
|
|
( will not be needed. )
|
|
|
|
( )
|
|
|
|
( lpar and rpar contain addresses pointing between subgroup-bot )
|
|
|
|
( and subgroup-bot. rpar's address will always be +2 relative to )
|
|
|
|
( the corresponding lpar address. )
|
|
|
|
( )
|
2022-01-22 22:54:51 -05:00
|
|
|
( concatenation isn't a node, it is implied by the *next addr. )
|
|
|
|
( a next value of #0000 signals the end of the regex. )
|
|
|
|
( )
|
|
|
|
( in these docs str* is an address to a null-terminated string. )
|
|
|
|
( regexes should not include nulls and cannot match them (other )
|
|
|
|
( than the null which signals the end of a string). )
|
|
|
|
|
2022-02-21 15:59:13 -05:00
|
|
|
( TODO: we have lpar and rpar nodes but aren't using them yet )
|
|
|
|
( 1. need to modify c-lpar and c-par )
|
|
|
|
( 2. we need to store subgroup-posd in regions during parsing: )
|
|
|
|
( a. need to store the current pos in the region )
|
|
|
|
( b. need to call start to move subgroup-pos forward )
|
|
|
|
( 3. when finishing parsing a region we need lpar/rpar nodes )
|
|
|
|
( 4. we also need to store "last started subgroup" on the stack )
|
|
|
|
( 5. when backtracking we must rewind to "last started" subgroup )
|
|
|
|
|
2022-04-02 22:29:06 -04:00
|
|
|
%emit! { #18 DEO }
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-02-19 01:23:39 -05:00
|
|
|
( now that uxnasm throws errors about writing into the zero page )
|
|
|
|
( we have to do something like this to be able to compile library )
|
|
|
|
( code. we have to guess what offset to use since it needs to )
|
|
|
|
( avoid conficting with the program we're included in. )
|
|
|
|
( )
|
|
|
|
( remove this if needed when including it in other projects. )
|
2022-04-02 23:14:18 -04:00
|
|
|
( |2000 )
|
2022-02-19 01:23:39 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( ERROR HANDLING )
|
2022-01-30 14:34:50 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( using error! will print the given message before causing )
|
|
|
|
( the interpreter to halt. )
|
2023-11-28 12:08:42 -05:00
|
|
|
@errorm ( msg* -> )
|
2024-10-26 00:36:26 -04:00
|
|
|
LIT "! emit! #20 emit!
|
2023-11-28 12:08:42 -05:00
|
|
|
&loop LDAk #00 EQU ?&done
|
|
|
|
LDAk emit! INC2 !&loop
|
2024-10-26 00:36:26 -04:00
|
|
|
&done POP2 #0a emit! #ff0e DEO #010f DEO BRK
|
2022-01-30 14:34:50 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( error messages )
|
2022-01-30 14:34:50 -05:00
|
|
|
@unknown-node-type "unknown 20 "node 20 "type 00
|
|
|
|
@mismatched-parens "mismatched 20 "parenthesis 00
|
|
|
|
@stack-is-full "stack 20 "is 20 "full 00
|
|
|
|
@stack-is-empty "stack 20 "is 20 "empty 00
|
|
|
|
@arena-is-full "arena 20 "is 20 "full 00
|
2022-01-30 15:11:03 -05:00
|
|
|
@star-invariant "star 20 "invariant 20 "failed 00
|
2022-02-03 01:32:35 -05:00
|
|
|
@plus-invariant "plus 20 "invariant 20 "failed 00
|
|
|
|
@qmark-invariant "question 20 "mark 20 "invariant 20 "failed 00
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( REGEX MATCHING )
|
|
|
|
|
|
|
|
( use stored regex to match against a stored string. )
|
|
|
|
( )
|
|
|
|
( regex* should be the address of a compiled regex )
|
|
|
|
( such as that returned from ;compile. )
|
|
|
|
( )
|
|
|
|
( str* should be a null-terminated string. )
|
|
|
|
( )
|
|
|
|
( returns true if the string, and false otherwise. )
|
2022-04-11 00:16:31 -04:00
|
|
|
@rx-match ( str* regex* -> bool^ )
|
2022-02-20 15:06:57 -05:00
|
|
|
#01 ;match-multiline STA
|
2022-02-19 01:23:39 -05:00
|
|
|
#00 ;search-mode STA
|
2023-11-28 12:08:42 -05:00
|
|
|
rx-reset
|
|
|
|
!loop
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-04-11 00:16:31 -04:00
|
|
|
@rx-search-multiline ( str* regex* -> bool^ )
|
2022-02-20 15:06:57 -05:00
|
|
|
#01 ;match-multiline STA
|
|
|
|
#01 ;search-mode STA
|
2023-11-28 12:08:42 -05:00
|
|
|
!rx-search/main
|
2022-02-20 15:06:57 -05:00
|
|
|
|
2022-04-11 00:16:31 -04:00
|
|
|
@rx-search ( str* regex* -> bool^ )
|
|
|
|
#00 ;match-multiline STA
|
|
|
|
#01 ;search-mode STA
|
2022-04-18 23:17:44 -04:00
|
|
|
&main STH2 ( s* [r*] )
|
|
|
|
DUP2 ;string-start STA2 ( s* [r*] )
|
2023-11-28 12:08:42 -05:00
|
|
|
&loop LDAk #00 EQU ?&eof ( s* [r*] )
|
|
|
|
rx-reset ( s* [r*] )
|
2022-02-19 01:23:39 -05:00
|
|
|
DUP2 ;search-start STA2 ( s* [r*] )
|
2023-11-28 12:08:42 -05:00
|
|
|
DUP2 STH2kr loop ( s* b^ [r*] )
|
|
|
|
?&found ( s* [r*] )
|
|
|
|
INC2 !&loop ( s+1* [r*] )
|
2022-02-19 01:23:39 -05:00
|
|
|
&found POP2 POP2r #01 JMP2r ( 01 )
|
2023-11-28 12:08:42 -05:00
|
|
|
&eof rx-reset ( s* [r*] )
|
2022-02-19 01:23:39 -05:00
|
|
|
DUP2 ;search-start STA2 ( s* [r*] )
|
2023-11-28 12:08:42 -05:00
|
|
|
STH2r !loop ( b^ )
|
2022-02-19 01:23:39 -05:00
|
|
|
|
2022-04-18 23:17:44 -04:00
|
|
|
( reset all "runtime" memory allocated during match/search )
|
|
|
|
@rx-reset ( -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
reset-stack
|
|
|
|
!subgroup-reset
|
2022-04-18 23:17:44 -04:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( loop used during matching )
|
|
|
|
( )
|
2022-01-22 22:54:51 -05:00
|
|
|
( we don't use the return stack here since that )
|
|
|
|
( complicates the back-tracking we need to do. )
|
|
|
|
( ultimately this code will issue a JMP2r to )
|
|
|
|
( return a boolean, which is where the stack )
|
|
|
|
( effects signature comes from. )
|
|
|
|
@loop ( s* r* -> bool^ )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk #01 EQU ?do-empty
|
|
|
|
LDAk #02 EQU ?do-dot
|
|
|
|
LDAk #03 EQU ?do-literal
|
|
|
|
LDAk #04 EQU ?do-or
|
|
|
|
LDAk #05 EQU ?do-or ( same code as the or case )
|
|
|
|
LDAk #06 EQU ?do-caret
|
|
|
|
LDAk #07 EQU ?do-dollar
|
|
|
|
LDAk #08 EQU ?do-lpar
|
|
|
|
LDAk #09 EQU ?do-rpar
|
|
|
|
LDAk #0a EQU ?do-ccls
|
|
|
|
LDAk #0b EQU ?do-ncls
|
|
|
|
LDAk #dd ;unknown-node-type errorm
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( used when we hit a dead-end during matching. )
|
|
|
|
( )
|
|
|
|
( if stack is non-empty we have a point we can resume from. )
|
2022-01-22 22:54:51 -05:00
|
|
|
@goto-backtrack ( -> bool^ )
|
2023-11-28 12:08:42 -05:00
|
|
|
stack-exist ?&has-stack ( do we have stack? )
|
2022-01-22 22:54:51 -05:00
|
|
|
#00 JMP2r ( no, return false )
|
2022-04-18 23:17:44 -04:00
|
|
|
&has-stack
|
2023-11-28 12:08:42 -05:00
|
|
|
pop4
|
|
|
|
subgroup-backtrack
|
|
|
|
!goto-next ( yes, resume from the top )
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( follow the given address (next*) to continue matching )
|
2022-01-22 22:54:51 -05:00
|
|
|
@goto-next ( str* next* -> bool^ )
|
2023-11-28 12:08:42 -05:00
|
|
|
DUP2 #0000 GTH2 ?&has-next
|
|
|
|
POP2 LDAk #00 EQU ?&end-of-string
|
|
|
|
;search-mode LDA ?&end-of-search
|
|
|
|
POP2 !goto-backtrack
|
2022-02-19 01:23:39 -05:00
|
|
|
&end-of-search DUP2 ;search-end STA2
|
|
|
|
&end-of-string POP2 #01 JMP2r
|
2023-11-28 12:08:42 -05:00
|
|
|
&has-next !loop
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-05-26 21:36:47 -04:00
|
|
|
( handle the empty node -- just follow the next pointer )
|
2022-01-22 22:54:51 -05:00
|
|
|
@do-empty ( str* regex* -> bool^ )
|
|
|
|
INC2 LDA2 ( load next )
|
2023-11-28 12:08:42 -05:00
|
|
|
!goto-next ( jump to next )
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-04-18 23:17:44 -04:00
|
|
|
( FIXME: not currently used )
|
2022-02-21 15:59:13 -05:00
|
|
|
@do-lpar ( str* regex* -> bool^ )
|
|
|
|
STH2 DUP2 ( s s [r] )
|
|
|
|
INC2r LDA2kr STH2r ( s s i [r+1] )
|
2023-11-28 12:08:42 -05:00
|
|
|
subgroup-start ( s [r+1] )
|
2022-02-21 15:59:13 -05:00
|
|
|
STH2r INC2 INC2 ( s r+3 )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDA2 !goto-next ( jump to next )
|
2022-02-21 15:59:13 -05:00
|
|
|
|
2022-04-18 23:17:44 -04:00
|
|
|
( FIXME: not currently used )
|
2022-02-21 15:59:13 -05:00
|
|
|
@do-rpar ( str* regex* -> bool^ )
|
|
|
|
STH2 DUP2 ( s s [r] )
|
|
|
|
INC2r LDA2kr STH2r ( s s i [r+1] )
|
2023-11-28 12:08:42 -05:00
|
|
|
subgroup-finish ( s [r+1] )
|
2022-02-21 15:59:13 -05:00
|
|
|
STH2r INC2 INC2 ( s r+3 )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDA2 !goto-next ( jump to next )
|
2022-02-21 15:59:13 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( handle dot -- match any one character )
|
2022-01-22 22:54:51 -05:00
|
|
|
@do-dot ( str* regex* -> bool^ )
|
2022-02-21 15:59:13 -05:00
|
|
|
INC2 LDA2 STH2 ( load and stash next )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk #00 NEQ ?&non-empty ( is there a char? )
|
|
|
|
&backtrack POP2r POP2 !goto-backtrack ( no, clear stacks and backtrack )
|
|
|
|
&non-empty LDAk #0a NEQ ?&match ( yes, match unless \n in search-mode )
|
|
|
|
;search-mode LDA ?&backtrack ( if \n and search-mode, treat as EOF )
|
|
|
|
&match INC2 STH2r !goto-next ( on match: inc s, restore and jump )
|
2022-02-20 15:06:57 -05:00
|
|
|
|
2022-02-21 15:59:13 -05:00
|
|
|
( hande caret -- match string start (or possibly after newline) without advancing )
|
2022-02-20 15:06:57 -05:00
|
|
|
@do-caret ( str* regex* -> bool^ )
|
2022-02-21 15:59:13 -05:00
|
|
|
INC2 LDA2 STH2 ( load and stash next )
|
2023-11-28 12:08:42 -05:00
|
|
|
DUP2 ;string-start LDA2 EQU2 ?&at-start ( at string start? )
|
|
|
|
;match-multiline LDA ?&no-match ( are we in multi-line mode? )
|
|
|
|
DUP2 #0001 SUB2 LDA #0a EQU ?&at-start ( just after newline? )
|
|
|
|
&no-match POP2r POP2 !goto-backtrack ( clear stacks and backtrack )
|
|
|
|
&at-start STH2r !goto-next ( go to next without advancing )
|
2022-02-20 15:06:57 -05:00
|
|
|
|
2022-02-21 15:59:13 -05:00
|
|
|
( hande dollar -- match string end (or possibly before newline) without advancing )
|
2022-02-20 15:06:57 -05:00
|
|
|
@do-dollar ( str* regex* -> bool^ )
|
2022-02-21 15:59:13 -05:00
|
|
|
INC2 LDA2 STH2 ( load and stash next )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk #00 EQU ?&at-end ( at string end? )
|
|
|
|
;match-multiline LDA ?&no-match ( are we in multi-line mode? )
|
|
|
|
LDAk #0a EQU ?&at-end ( at newline? )
|
|
|
|
&no-match POP2r POP2 !goto-backtrack ( clear stacks and backtrack )
|
|
|
|
&at-end STH2r !goto-next ( go to next without advancing )
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( handle literal -- match one specific character )
|
2022-01-22 22:54:51 -05:00
|
|
|
@do-literal ( str* regex* -> bool^ )
|
|
|
|
INC2
|
|
|
|
LDAk STH ( store c )
|
|
|
|
INC2 LDA2 STH2 ROTr ( store next, move c to top )
|
|
|
|
LDAk
|
2023-11-28 12:08:42 -05:00
|
|
|
STHr EQU ?&matches ( do we match this char? )
|
|
|
|
POP2r POP2 !goto-backtrack ( no, clear stacks and backtrack )
|
2022-01-22 22:54:51 -05:00
|
|
|
&matches
|
2023-11-28 12:08:42 -05:00
|
|
|
INC2 STH2r !goto-next ( yes, inc s, restore and jump )
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( handle or -- try the left branch but backtrack to the right if needed )
|
|
|
|
( )
|
2022-01-22 22:54:51 -05:00
|
|
|
( this also handles asteration, since it ends up having the same structure )
|
|
|
|
@do-or ( str* regex* -> bool^ )
|
|
|
|
INC2 OVR2 OVR2 #0002 ADD2 ( s r+1 s r+3 )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDA2 push4 ( save (s, right) in the stack for possible backtracking )
|
|
|
|
LDA2 !loop ( continue on left branch )
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-04-15 01:40:52 -04:00
|
|
|
@matches-cls ( str* regex* -> bool^ )
|
2023-11-28 12:08:42 -05:00
|
|
|
OVR2 LDA ?¬-null
|
2022-04-15 01:40:52 -04:00
|
|
|
( needs to have a character to match )
|
2023-11-28 12:08:42 -05:00
|
|
|
POP2 POP2 !goto-backtrack
|
2022-04-15 01:40:52 -04:00
|
|
|
¬-null
|
|
|
|
DUP2 INC2 LDA2 STH2 ( str regex [next] )
|
|
|
|
OVR2 INC2 STH2 ( str regex [str+1 next] )
|
|
|
|
SWP2 LDA STH ( regex [c str+1 next] )
|
|
|
|
#0003 ADD2 LDAk #00 SWP #0002 MUL2 ( r+3 len*2 [c str+1 next] )
|
|
|
|
SWP2 INC2 STH2k ADD2 STH2r ( r+4+len*2 r+4 [c str+1 next] )
|
|
|
|
&loop ( limit addr [c str+1 next] )
|
2023-11-28 12:08:42 -05:00
|
|
|
EQU2k ?&missing
|
|
|
|
LDAk STHkr GTH ?&next1 INC2
|
|
|
|
LDAk STHkr LTH ?&next2 !&found
|
2022-04-15 01:40:52 -04:00
|
|
|
&next1 INC2
|
2023-11-28 12:08:42 -05:00
|
|
|
&next2 INC2 !&loop
|
|
|
|
&missing POP2 POP2 POPr ,&negated LDR ?&match
|
|
|
|
&no-match POP2r POP2r !goto-backtrack
|
|
|
|
&found POP2 POP2 POPr ,&negated LDR ?&no-match
|
|
|
|
&match STH2r STH2r !goto-next
|
2022-04-15 01:40:52 -04:00
|
|
|
[ &negated $1 ]
|
|
|
|
|
|
|
|
( )
|
|
|
|
@do-ccls ( str* regex* -> bool^ )
|
2023-11-28 12:08:42 -05:00
|
|
|
#00 ,matches-cls/negated STR !matches-cls
|
2022-04-15 01:40:52 -04:00
|
|
|
|
|
|
|
( )
|
|
|
|
@do-ncls ( str* regex* -> bool^ )
|
2023-11-28 12:08:42 -05:00
|
|
|
#01 ,matches-cls/negated STR !matches-cls
|
2022-04-15 01:40:52 -04:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( REGEX PARSING )
|
|
|
|
|
2022-02-20 15:06:57 -05:00
|
|
|
( do we match across lines? )
|
|
|
|
( - should be true when matching )
|
|
|
|
( - can be true or false when searching )
|
|
|
|
( - affects syntax of . ^ and $ )
|
|
|
|
@match-multiline $1
|
|
|
|
|
2022-02-19 01:23:39 -05:00
|
|
|
( are we in searching mode? )
|
2022-02-20 15:06:57 -05:00
|
|
|
( - should be true when searching )
|
|
|
|
( - should be false when matching )
|
2022-02-19 01:23:39 -05:00
|
|
|
@search-mode $1
|
2022-02-20 15:06:57 -05:00
|
|
|
|
|
|
|
( )
|
|
|
|
@string-start $2
|
2022-02-19 01:23:39 -05:00
|
|
|
@search-start $2
|
2022-02-20 15:06:57 -05:00
|
|
|
@search-end $2
|
|
|
|
|
2022-01-29 23:49:53 -05:00
|
|
|
( track the position in the input string )
|
2022-01-29 23:13:10 -05:00
|
|
|
@pos $2
|
2022-01-29 23:49:53 -05:00
|
|
|
|
|
|
|
( track how many levels deep we are in parenthesis )
|
2022-01-29 23:13:10 -05:00
|
|
|
@parens $2
|
|
|
|
|
2022-04-18 23:17:44 -04:00
|
|
|
( how many subgroups have we seen so far? )
|
|
|
|
@groupnum $1
|
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( read and increment pos )
|
2022-01-29 23:13:10 -05:00
|
|
|
@read ( -> c^ )
|
|
|
|
;pos LDA2k ( pos s )
|
|
|
|
LDAk STHk #00 EQU ( pos s c=0 [c] )
|
2023-11-28 12:08:42 -05:00
|
|
|
?&is-eof ( pos s [c] )
|
2022-01-29 23:13:10 -05:00
|
|
|
INC2 ( pos s+1 [c] )
|
2023-11-28 12:08:42 -05:00
|
|
|
SWP2 STA2 !&return ( [c] )
|
2022-05-26 21:36:47 -04:00
|
|
|
&is-eof POP2 POP2
|
2022-01-30 15:11:03 -05:00
|
|
|
&return STHr ( c )
|
|
|
|
JMP2r
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( is pos currently pointing to a star? )
|
2022-01-30 14:59:30 -05:00
|
|
|
@peek-to-star ( -> is-star^ )
|
2022-09-10 13:30:15 -04:00
|
|
|
;pos LDA2 LDA LIT "* EQU JMP2r
|
2022-01-30 14:59:30 -05:00
|
|
|
|
2022-02-03 01:32:35 -05:00
|
|
|
( is pos currently pointing to a plus? )
|
|
|
|
@peek-to-plus ( -> is-plus^ )
|
2022-09-10 13:30:15 -04:00
|
|
|
;pos LDA2 LDA LIT "+ EQU JMP2r
|
2022-02-03 01:32:35 -05:00
|
|
|
|
|
|
|
( is pos currently pointing to a qmark? )
|
|
|
|
@peek-to-qmark ( -> is-qmark^ )
|
2022-09-10 13:30:15 -04:00
|
|
|
;pos LDA2 LDA LIT "? EQU JMP2r
|
2022-02-03 01:32:35 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( just increment pos )
|
2022-01-30 14:59:30 -05:00
|
|
|
@skip
|
2022-01-30 19:13:39 -05:00
|
|
|
;pos LDA2 INC2 ;pos STA2 JMP2r
|
2022-01-30 14:59:30 -05:00
|
|
|
|
2022-01-30 19:23:44 -05:00
|
|
|
( TODO: )
|
|
|
|
( 1. character groups: [] and [^] )
|
2022-02-03 01:32:35 -05:00
|
|
|
( 2. symbolic escapes, e.g. \n )
|
2022-01-30 19:23:44 -05:00
|
|
|
|
|
|
|
( STRETCH GOALS: )
|
|
|
|
( a. ^ and $ )
|
|
|
|
( b. counts: {n} and {m,n} )
|
|
|
|
( c. substring matching, i.e. searching )
|
|
|
|
( d. subgroup extraction )
|
|
|
|
( e. back-references, e.g \1 )
|
2022-02-20 15:06:57 -05:00
|
|
|
( f. non-capturing groups, e.g. (?:) )
|
2022-01-30 19:23:44 -05:00
|
|
|
|
2022-01-22 22:54:51 -05:00
|
|
|
( compile an expression string into a regex graph )
|
2022-01-30 19:13:39 -05:00
|
|
|
( )
|
|
|
|
( the regex will be allocated in the arena; if there is not )
|
|
|
|
( sufficient space an error will be thrown. )
|
|
|
|
( )
|
|
|
|
( the stack will also be used during parsing although unlike )
|
|
|
|
( the arena it will be released once compilation ends. )
|
2022-01-22 22:54:51 -05:00
|
|
|
@compile ( expr* -> regex* )
|
2022-01-29 23:13:10 -05:00
|
|
|
;pos STA2
|
|
|
|
#0000 ;parens STA2
|
2023-11-28 12:08:42 -05:00
|
|
|
rx-reset
|
|
|
|
!compile-region
|
2022-01-29 23:13:10 -05:00
|
|
|
|
|
|
|
( the basic strategy here is to build a stack of non-or )
|
|
|
|
( expressions to be joined together at the end of the )
|
|
|
|
( region. each stack entry has two regex addresses: )
|
|
|
|
( - the start of the regex )
|
|
|
|
( - the current tail of the regex )
|
|
|
|
( when we concatenate a new node to a regex we update )
|
|
|
|
( the second of these but not the first. )
|
|
|
|
( )
|
|
|
|
( the bottom of the stack for a given region is denoted )
|
|
|
|
( by #ffff #ffff. above that we start with #0000 #0000 )
|
|
|
|
( to signal an empty node. )
|
|
|
|
@compile-region ( -> r2* )
|
2023-11-28 12:08:42 -05:00
|
|
|
#ffff #ffff push4 ( stack delimiter )
|
|
|
|
#0000 #0000 push4 ( stack frame start )
|
2022-01-29 23:13:10 -05:00
|
|
|
@compile-region-loop
|
2023-11-28 12:08:42 -05:00
|
|
|
read
|
|
|
|
DUP #00 EQU ?c-done
|
|
|
|
DUP LIT "| EQU ?c-or
|
|
|
|
DUP LIT ". EQU ?c-dot
|
|
|
|
DUP LIT "^ EQU ?c-caret
|
|
|
|
DUP LIT "$ EQU ?c-dollar
|
|
|
|
DUP LIT "( EQU ?c-lpar
|
|
|
|
DUP LIT ") EQU ?c-rpar
|
|
|
|
DUP LIT "[ EQU ?c-lbrk
|
|
|
|
DUP LIT "] EQU ?c-rbrk
|
|
|
|
DUP LIT "\ EQU ?c-esc
|
|
|
|
DUP LIT "* EQU ?c-star
|
|
|
|
DUP LIT "+ EQU ?c-plus
|
|
|
|
DUP LIT "? EQU ?c-qmark
|
|
|
|
!c-char
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( either finalize the given r0/r1 or else wrap it in )
|
|
|
|
( a star node if a star is coming up next. )
|
|
|
|
( )
|
|
|
|
( we use this look-ahead approach rather than compiling )
|
|
|
|
( star nodes directly since the implementation is simpler. )
|
2022-01-30 15:11:03 -05:00
|
|
|
@c-peek-and-finalize ( r0* r1* -> r2* )
|
2023-11-28 12:08:42 -05:00
|
|
|
peek-to-star ( r0 r1 next-is-star? ) ?&next-is-star
|
|
|
|
peek-to-plus ( r0 r1 next-is-plus? ) ?&next-is-plus
|
|
|
|
peek-to-qmark ( r0 r1 next-is-qmark? ) ?&next-is-qmark
|
|
|
|
!&finally ( r0 r1 )
|
|
|
|
&next-is-star skip POP2 alloc-star DUP2 !&finally
|
|
|
|
&next-is-plus skip POP2 alloc-plus DUP2 !&finally
|
|
|
|
&next-is-qmark skip POP2 alloc-qmark DUP2 !&finally
|
|
|
|
&finally push-next !compile-region-loop
|
2022-01-30 14:59:30 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( called when we reach EOF of the input string )
|
|
|
|
( )
|
|
|
|
( as with c-rpar we have to unroll the current level )
|
|
|
|
( of the stack, building any or-nodes that are needed. )
|
|
|
|
( )
|
|
|
|
( this is where we detect unclosed parenthesis. )
|
2022-01-29 23:13:10 -05:00
|
|
|
@c-done ( c^ -> r2* )
|
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
;parens LDA2 #0000 GTH2 ?&mismatched-parens
|
|
|
|
unroll-stack POP2 JMP2r
|
|
|
|
&mismatched-parens ;mismatched-parens errorm
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( called when we read "|" )
|
|
|
|
( )
|
|
|
|
( since we defer building or-nodes until the end of the region )
|
|
|
|
( we just start a new stack frame and continue. )
|
2022-01-29 23:13:10 -05:00
|
|
|
@c-or ( c^ -> r2* )
|
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
#0000 #0000 push4
|
|
|
|
!compile-region-loop
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-04-18 23:17:44 -04:00
|
|
|
( called when we read left parenthesis )
|
2022-01-30 19:13:39 -05:00
|
|
|
( )
|
|
|
|
( this causes us to: )
|
|
|
|
( )
|
|
|
|
( 1. increment parens )
|
|
|
|
( 2. start a new region on the stack )
|
|
|
|
( 3. jump to compile-region to start parsing the new region )
|
2022-01-29 23:13:10 -05:00
|
|
|
@c-lpar ( c^ -> r2* )
|
|
|
|
POP
|
|
|
|
;parens LDA2 INC2 ;parens STA2 ( parens++ )
|
2023-11-28 12:08:42 -05:00
|
|
|
!compile-region
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-04-18 23:17:44 -04:00
|
|
|
( called when we read right parenthesis )
|
2022-01-30 19:13:39 -05:00
|
|
|
( )
|
|
|
|
( this causes us to: )
|
|
|
|
( )
|
|
|
|
( 1. check for mismatched parens )
|
|
|
|
( 2. decrement parens )
|
|
|
|
( 3. unroll the current region on the stack into one regex node )
|
|
|
|
( 4. finalize that node and append it to the previous region )
|
2022-05-26 21:36:47 -04:00
|
|
|
( 5. continue parsing )
|
2022-01-29 23:13:10 -05:00
|
|
|
@c-rpar ( c^ -> r2* )
|
2022-01-29 23:55:05 -05:00
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
;parens LDA2 #0000 EQU2 ?&mismatched-parens
|
2022-01-29 23:55:05 -05:00
|
|
|
;parens LDA2 #0001 SUB2 ;parens STA2 ( parens-- )
|
2023-11-28 12:08:42 -05:00
|
|
|
unroll-stack
|
|
|
|
!c-peek-and-finalize
|
|
|
|
&mismatched-parens ;mismatched-parens errorm
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-04-15 01:40:52 -04:00
|
|
|
( doesn't support weird things like []abc] or [-abc] or similar. )
|
|
|
|
( doesn't currently handle "special" escapes such as \n )
|
|
|
|
@c-lbrk ( c^ -> r2* )
|
|
|
|
POP LITr 00 ;pos LDA2 ( pos [0] )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk LIT "^ NEQ ?&normal INCr INC2 ( pos [negated?^] )
|
2022-04-15 01:40:52 -04:00
|
|
|
&normal
|
|
|
|
#0a STHr ADD ( src* type^ )
|
|
|
|
;arena-pos LDA2 STH2k ( src* type^ dst* [dst*] )
|
|
|
|
STA LIT2r 0004 ADD2r ( src* [dst+4] )
|
|
|
|
&left-parse ( src* [dst*] )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk LIT "] EQU ?&done
|
|
|
|
LDAk LIT "- EQU ?&error
|
|
|
|
LDAk LIT "\ NEQ ?&left INC2
|
2022-04-15 01:40:52 -04:00
|
|
|
&left
|
|
|
|
LDAk STH2kr STA INC2r
|
2023-11-28 12:08:42 -05:00
|
|
|
DUP2 INC2 LDA LIT "- NEQ ?&pre-right INC2 INC2
|
|
|
|
LDAk LIT "] EQU ?&error
|
|
|
|
LDAk LIT "- EQU ?&error
|
2022-04-15 01:40:52 -04:00
|
|
|
&pre-right
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk LIT "\ NEQ ?&right INC2
|
2022-04-15 01:40:52 -04:00
|
|
|
&right
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk STH2kr STA INC2 INC2r !&left-parse
|
2022-04-15 01:40:52 -04:00
|
|
|
&done ( src* [dst*] )
|
|
|
|
INC2 ;pos STA2 STH2r ( dst* )
|
|
|
|
DUP2 ;arena-pos LDA2 ( dst dst a )
|
|
|
|
#0004 ADD2 SUB2 #0002 DIV2 NIP ( dst (dst-(a+4))/2 )
|
|
|
|
;arena-pos LDA2 STH2k #0003 ADD2 STA ( dst [a] )
|
|
|
|
;arena-pos STA2 STH2r ( a )
|
2022-04-15 14:17:10 -04:00
|
|
|
#0000 OVR2 INC2 STA2 ( a )
|
2023-11-28 12:08:42 -05:00
|
|
|
DUP2 !c-peek-and-finalize
|
2022-04-15 01:40:52 -04:00
|
|
|
&error
|
|
|
|
#abcd #0000 DIV ( TODO error here )
|
|
|
|
|
|
|
|
@c-rbrk ( c^ -> r2* )
|
|
|
|
POP
|
|
|
|
#0000 DIV ( invariant: should never be seen )
|
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( called when we read "." )
|
|
|
|
( )
|
|
|
|
( allocates a dot-node and continues. )
|
2022-01-29 23:13:10 -05:00
|
|
|
@c-dot ( c^ -> r2* )
|
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
#02 alloc3
|
|
|
|
DUP2 !c-peek-and-finalize
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-02-20 15:06:57 -05:00
|
|
|
( called when we read "^" )
|
|
|
|
( )
|
|
|
|
( allocates a caret-node and continues. )
|
|
|
|
@c-caret ( c^ -> r2* )
|
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
#06 alloc3
|
|
|
|
DUP2 !c-peek-and-finalize
|
2022-02-20 15:06:57 -05:00
|
|
|
|
|
|
|
( called when we read "$" )
|
|
|
|
( )
|
|
|
|
( allocates a dollar-node and continues. )
|
|
|
|
@c-dollar ( c^ -> r2* )
|
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
#07 alloc3
|
|
|
|
DUP2 !c-peek-and-finalize
|
2022-01-30 19:13:39 -05:00
|
|
|
|
|
|
|
( called when we read "\" )
|
|
|
|
( )
|
2022-02-20 15:06:57 -05:00
|
|
|
( handles special sequences: \a \b \t \n \v \f \r )
|
2022-01-30 19:13:39 -05:00
|
|
|
( )
|
2022-02-20 15:06:57 -05:00
|
|
|
( otherwise, allocates a literal of the next character. )
|
2022-01-29 23:13:10 -05:00
|
|
|
@c-esc ( c^ -> r2* )
|
2023-11-28 12:08:42 -05:00
|
|
|
POP read
|
|
|
|
DUP LIT "a EQU ?&bel
|
|
|
|
DUP LIT "b EQU ?&bs
|
|
|
|
DUP LIT "t EQU ?&tab
|
|
|
|
DUP LIT "n EQU ?&nl
|
|
|
|
DUP LIT "v EQU ?&vtab
|
|
|
|
DUP LIT "f EQU ?&ff
|
|
|
|
DUP LIT "r EQU ?&cr
|
|
|
|
&default !c-char
|
|
|
|
&bel POP #07 !&default
|
|
|
|
&bs POP #08 !&default
|
|
|
|
&tab POP #09 !&default
|
|
|
|
&nl POP #0a !&default
|
|
|
|
&vtab POP #0b !&default
|
|
|
|
&ff POP #0c !&default
|
|
|
|
&cr POP #0d !&default
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( called when we read any other character )
|
|
|
|
( )
|
|
|
|
( allocates a literal-node and continues. )
|
|
|
|
@c-char ( c^ -> r2* )
|
2023-11-28 12:08:42 -05:00
|
|
|
alloc-lit ( lit )
|
|
|
|
DUP2 !c-peek-and-finalize
|
2022-01-30 19:13:39 -05:00
|
|
|
|
|
|
|
( called if we parse a "*" )
|
|
|
|
( )
|
|
|
|
( actually calling this means the code broke an invariant somewhere. )
|
2022-01-29 23:13:10 -05:00
|
|
|
@c-star ( c^ -> regex* )
|
2022-01-29 23:55:05 -05:00
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
;star-invariant errorm
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-02-03 01:32:35 -05:00
|
|
|
( called if we parse a "+" )
|
|
|
|
( )
|
|
|
|
( actually calling this means the code broke an invariant somewhere. )
|
|
|
|
@c-plus ( c^ -> regex* )
|
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
;plus-invariant errorm
|
2022-02-03 01:32:35 -05:00
|
|
|
|
|
|
|
( called if we parse a "?" )
|
|
|
|
( )
|
|
|
|
( actually calling this means the code broke an invariant somewhere. )
|
|
|
|
@c-qmark ( c^ -> regex* )
|
|
|
|
POP
|
2023-11-28 12:08:42 -05:00
|
|
|
;qmark-invariant errorm
|
2022-02-03 01:32:35 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( ALLOCATING REGEX NDOES )
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-04-11 00:16:31 -04:00
|
|
|
@rx-node-sizes
|
2022-04-15 01:40:52 -04:00
|
|
|
( 00 01 02 03 04 05 06 07 08 09 0a 0b )
|
2022-04-18 23:17:44 -04:00
|
|
|
[ 00 03 03 04 ] [ 05 05 03 03 ] [ 04 04 00 00 ]
|
2022-04-11 00:16:31 -04:00
|
|
|
|
2022-01-29 23:13:10 -05:00
|
|
|
@alloc3 ( mode^ -> r* )
|
|
|
|
#0000 ROT ( 00 00 mode^ )
|
2023-11-28 12:08:42 -05:00
|
|
|
#03 alloc ( 00 00 mode^ addr* )
|
2022-01-30 19:13:39 -05:00
|
|
|
STH2k STA ( addr <- mode )
|
|
|
|
STH2kr INC2 STA2 ( addr+1 <- 0000 )
|
|
|
|
STH2r JMP2r ( return addr )
|
2022-01-29 23:13:10 -05:00
|
|
|
|
|
|
|
@alloc-empty ( -> r* )
|
2023-11-28 12:08:42 -05:00
|
|
|
#01 !alloc3
|
2022-01-29 23:13:10 -05:00
|
|
|
|
|
|
|
@alloc-lit ( c^ -> r* )
|
2022-01-30 19:13:39 -05:00
|
|
|
#03 #0000 SWP2 ( 0000 c^ 03 )
|
2023-11-28 12:08:42 -05:00
|
|
|
#04 alloc ( 0000 c^ 03 addr* )
|
2022-01-30 19:13:39 -05:00
|
|
|
STH2k STA ( addr <- 03 )
|
|
|
|
STH2kr INC2 STA ( addr+1 <- c )
|
|
|
|
STH2kr #0002 ADD2 STA2 ( addr+2 <- 0000 )
|
|
|
|
STH2r JMP2r ( return addr )
|
2022-01-29 23:13:10 -05:00
|
|
|
|
|
|
|
@alloc-or ( right* left* -> r* )
|
2023-11-28 12:08:42 -05:00
|
|
|
#05 alloc STH2 ( r l [x] )
|
2022-01-29 23:13:10 -05:00
|
|
|
#04 STH2kr STA ( r l [x] )
|
|
|
|
STH2kr INC2 STA2 ( r [x] )
|
|
|
|
STH2kr #0003 ADD2 STA2 ( [x] )
|
|
|
|
STH2r JMP2r
|
|
|
|
|
|
|
|
@alloc-star ( expr* -> r* )
|
2023-11-28 12:08:42 -05:00
|
|
|
#05 alloc STH2 ( expr [r] )
|
2022-01-29 23:55:05 -05:00
|
|
|
#05 STH2kr STA ( expr [r] )
|
|
|
|
DUP2 STH2kr INC2 STA2 ( expr [r] )
|
|
|
|
#0000 STH2kr #0003 ADD2 STA2 ( expr [r] )
|
|
|
|
STH2kr SWP2 ( r expr [r] )
|
2023-11-28 12:08:42 -05:00
|
|
|
set-next ( [r] )
|
2022-01-29 23:13:10 -05:00
|
|
|
STH2r JMP2r
|
|
|
|
|
2022-02-03 01:32:35 -05:00
|
|
|
@alloc-plus ( expr* -> r* )
|
2023-11-28 12:08:42 -05:00
|
|
|
#05 alloc STH2 ( expr [r] )
|
2022-02-03 01:32:35 -05:00
|
|
|
#05 STH2kr STA ( expr [r] )
|
|
|
|
DUP2 STH2kr INC2 STA2 ( expr [r] )
|
|
|
|
#0000 STH2kr #0003 ADD2 STA2 ( expr [r] )
|
|
|
|
STH2r SWP2 STH2k ( r expr [expr] )
|
2023-11-28 12:08:42 -05:00
|
|
|
set-next ( [expr] )
|
2022-02-03 01:32:35 -05:00
|
|
|
STH2r JMP2r
|
|
|
|
|
|
|
|
@alloc-qmark ( expr* -> r* )
|
2023-11-28 12:08:42 -05:00
|
|
|
alloc-empty STH2k ( expr e [e] )
|
|
|
|
OVR2 set-next ( expr [e] )
|
|
|
|
#05 alloc STH2 ( expr [r e] )
|
2022-02-03 01:32:35 -05:00
|
|
|
#04 STH2kr STA ( expr [r e] )
|
|
|
|
STH2kr INC2 STA2 ( [r e] )
|
|
|
|
SWP2r STH2r STH2kr ( e r [r] )
|
|
|
|
#0003 ADD2 STA2 ( [r] )
|
|
|
|
STH2r JMP2r
|
|
|
|
|
2022-02-02 17:39:08 -05:00
|
|
|
( if r is 0000, allocate an empty node )
|
|
|
|
@alloc-if-null ( r* -> r2* )
|
2023-11-28 12:08:42 -05:00
|
|
|
ORAk ?&return POP2 alloc-empty &return JMP2r
|
2022-02-02 17:39:08 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( unroll one region of the parsing stack, returning )
|
2022-01-29 23:13:10 -05:00
|
|
|
( a single node consisting of an alternation of )
|
|
|
|
( all elements on the stack. )
|
|
|
|
( )
|
|
|
|
( this unrolls until it hits #ffff #ffff, which it )
|
|
|
|
( also removes from the stack. )
|
|
|
|
@unroll-stack ( -> start* end* )
|
2023-11-28 12:08:42 -05:00
|
|
|
pop4 STH2 ( r )
|
2022-01-29 23:56:57 -05:00
|
|
|
#00 STH ( count items in stack frame )
|
2023-11-28 12:08:42 -05:00
|
|
|
alloc-if-null ( replace 0000 with empty )
|
2022-01-29 23:56:57 -05:00
|
|
|
&loop ( r* )
|
2023-11-28 12:08:42 -05:00
|
|
|
pop4 POP2 ( r x )
|
|
|
|
DUP2 #ffff EQU2 ( r x x-is-end? ) ?&done
|
2022-01-29 23:56:57 -05:00
|
|
|
INCr ( items++ )
|
2023-11-28 12:08:42 -05:00
|
|
|
alloc-or ( r|x ) !&loop
|
2022-01-29 23:13:10 -05:00
|
|
|
&done
|
2022-01-29 23:56:57 -05:00
|
|
|
( r ffff )
|
2022-01-29 23:13:10 -05:00
|
|
|
POP2
|
2023-11-28 12:08:42 -05:00
|
|
|
STHr ?&is-or
|
2022-01-29 23:13:10 -05:00
|
|
|
STH2r JMP2r
|
|
|
|
&is-or
|
|
|
|
POP2r
|
2023-11-28 12:08:42 -05:00
|
|
|
alloc-empty OVR2 OVR2 SWP2 ( r empty empty r )
|
|
|
|
set-next-or
|
2022-01-29 23:13:10 -05:00
|
|
|
JMP2r
|
|
|
|
|
|
|
|
( add r to the top of the stock. )
|
|
|
|
( )
|
|
|
|
( in particular, this will write r into tail.next )
|
|
|
|
( before replacing tail with r. )
|
|
|
|
@push-next ( r0 r1 -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
pop4 ( r0 r1 x0 x1 )
|
|
|
|
DUP2 #0000 EQU2 ( r0 r1 x0 x1 x1=0? ) ?&is-zero
|
2022-01-29 23:13:10 -05:00
|
|
|
STH2 ROT2 STH2r ( r1 x0 r0 x1 )
|
2023-11-28 12:08:42 -05:00
|
|
|
set-next SWP2 ( x0 r1 )
|
|
|
|
push4
|
2022-01-29 23:13:10 -05:00
|
|
|
JMP2r
|
2023-11-28 12:08:42 -05:00
|
|
|
&is-zero POP2 POP2 !push4
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( load the given address: )
|
|
|
|
( )
|
|
|
|
( 1. if it points to 0000, update it to target )
|
|
|
|
( 2. otherwise, call set-next on it )
|
2022-01-29 23:13:10 -05:00
|
|
|
@set-next-addr ( target* addr* -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDA2k #0000 EQU2 ( target addr addr=0? ) ?&is-zero
|
|
|
|
LDA2 !set-next
|
2022-01-29 23:56:57 -05:00
|
|
|
&is-zero STA2 JMP2r
|
2022-01-29 23:13:10 -05:00
|
|
|
|
|
|
|
( set regex.next to target )
|
2022-02-21 15:59:13 -05:00
|
|
|
( )
|
|
|
|
( node types 1-7 are defined. )
|
|
|
|
( )
|
|
|
|
( all node types except star (5) and lit (3) store their next )
|
|
|
|
( pointer one byte off of their own address. )
|
|
|
|
( )
|
|
|
|
( since both branches of an or (4) node are supposed to meet )
|
|
|
|
( back up we only bother taking the left branch. otherwise )
|
|
|
|
( you can end up double-appending things. )
|
2022-01-29 23:13:10 -05:00
|
|
|
@set-next ( target* regex* -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk #01 LTH ?&unknown
|
|
|
|
LDAk #0b GTH ?&unknown
|
|
|
|
LDAk #09 GTH ?&cc
|
2022-04-11 00:16:31 -04:00
|
|
|
LDAk #00 SWP ;rx-node-sizes ADD2
|
|
|
|
LDA #00 SWP ADD2 #0002 SUB2
|
2023-11-28 12:08:42 -05:00
|
|
|
!set-next-addr
|
|
|
|
&cc INC2 !set-next-addr
|
|
|
|
&unknown LDAk #ee ;unknown-node-type errorm
|
2022-02-02 17:39:08 -05:00
|
|
|
|
|
|
|
@set-next-or-addr ( target* addr* -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDA2k #0000 EQU2 ( target addr addr=0? ) ?&is-zero
|
|
|
|
LDA2 !set-next-or
|
2022-02-02 17:39:08 -05:00
|
|
|
&is-zero STA2 JMP2r
|
|
|
|
|
|
|
|
( this is used when first building or-nodes )
|
|
|
|
( structure will always be: )
|
|
|
|
( [x1, [x2, [x3, ..., [xm, xn]]]] )
|
|
|
|
( so we recurse on the right side but not the left. )
|
|
|
|
@set-next-or ( target* regex* -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
LDAk #04 NEQ ?&!4
|
|
|
|
OVR2 OVR2 INC2 set-next-addr
|
|
|
|
#0003 ADD2 !set-next-or-addr
|
|
|
|
&!4 !set-next
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( STACK OPERATIONS )
|
|
|
|
( )
|
|
|
|
( we always push/pop 4 bytes at a time. the stack has a fixed )
|
|
|
|
( maximum size it can use, defined by ;stack-top. )
|
|
|
|
( )
|
|
|
|
( the stack can be cleared using ;reset-stack, which resets )
|
|
|
|
( the stack pointers but does not zero out any memory. )
|
2022-01-30 19:26:16 -05:00
|
|
|
( )
|
|
|
|
( stack size is 4096 bytes here but is configurable. )
|
|
|
|
( in some cases it could be very small but this will limit )
|
|
|
|
( how many branches can be parsed and executed. )
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( push 4 bytes onto the stack )
|
2022-01-29 23:13:10 -05:00
|
|
|
@push4 ( str* regex* -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
assert-stack-avail ( check for space )
|
2022-01-22 22:54:51 -05:00
|
|
|
;stack-pos LDA2 #0002 ADD2 STA2 ( cell[2:3] <- regex )
|
|
|
|
;stack-pos LDA2 STA2 ( cell[0:1] <- str )
|
|
|
|
;stack-pos LDA2 #0004 ADD2 ;stack-pos STA2 ( pos += 4 )
|
|
|
|
JMP2r
|
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( pop 4 bytes from the stack )
|
2022-01-29 23:13:10 -05:00
|
|
|
@pop4 ( -> str* regex* )
|
2023-11-28 12:08:42 -05:00
|
|
|
assert-stack-exist ( check for space )
|
2022-01-22 22:54:51 -05:00
|
|
|
;stack-pos LDA2 ( load stack-pos )
|
|
|
|
#0002 SUB2 LDA2k STH2 ( pop and stash regex )
|
|
|
|
#0002 SUB2 LDA2k STH2 ( pop and stash str )
|
|
|
|
;stack-pos STA2 ( save new stack-pos )
|
|
|
|
STH2r STH2r ( restore str and regex )
|
|
|
|
JMP2r
|
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( reset stack pointers )
|
2022-01-22 22:54:51 -05:00
|
|
|
@reset-stack ( -> )
|
|
|
|
;stack-bot ;stack-pos STA2 JMP2r ( pos <- 0 )
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( can more stack be allocated? )
|
2022-01-22 22:54:51 -05:00
|
|
|
@stack-avail ( -> bool^ )
|
|
|
|
;stack-pos LDA2 ;stack-top LTH2 JMP2r
|
2022-01-30 19:13:39 -05:00
|
|
|
|
|
|
|
( is the stack non-empty? )
|
2022-01-22 22:54:51 -05:00
|
|
|
@stack-exist ( -> bool^ )
|
|
|
|
;stack-pos LDA2 ;stack-bot GTH2 JMP2r
|
2022-01-30 19:13:39 -05:00
|
|
|
|
|
|
|
( error if stack is full )
|
2022-01-30 14:34:50 -05:00
|
|
|
@assert-stack-avail ( -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
stack-avail ?&ok ;stack-is-full errorm &ok JMP2r
|
2022-01-30 19:13:39 -05:00
|
|
|
|
|
|
|
( error is stack is empty )
|
2022-01-30 14:34:50 -05:00
|
|
|
@assert-stack-exist ( -> )
|
2023-11-28 12:08:42 -05:00
|
|
|
stack-exist ?&ok ;stack-is-empty errorm &ok JMP2r
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( stack-pos points to the next free stack position (or the top if full). )
|
2024-07-07 23:43:30 -04:00
|
|
|
@stack-pos =stack-bot ( the next position to insert at )
|
2022-01-30 19:13:39 -05:00
|
|
|
|
|
|
|
( stack-bot is the address of the first stack position. )
|
|
|
|
( stack-top is the address of the first byte beyond the stack. )
|
2022-04-18 23:17:44 -04:00
|
|
|
@stack-bot $800 @stack-top ( holds 512 steps (2048 bytes) )
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( ARENA OPERATIONS )
|
|
|
|
( )
|
|
|
|
( the arena represents a heap of memory that can easily be )
|
|
|
|
( allocated in small amounts. )
|
|
|
|
( )
|
|
|
|
( the entire arena can be reclaimed using ;reset-arena, but )
|
|
|
|
( unlike systems such as malloc/free, the arena cannot relcaim )
|
|
|
|
( smaller amounts of memory. )
|
|
|
|
( )
|
|
|
|
( the arena is used to allocate regex graph nodes, which are )
|
|
|
|
( dynamically-allocated as the regex string is parsed. once )
|
|
|
|
( a regex is no longer needed the arena may be reclaimed. )
|
2022-01-30 19:26:16 -05:00
|
|
|
( )
|
|
|
|
( arena size is 1024 bytes here but is configurable. )
|
|
|
|
( smaller sizes would likely be fine but will limit the )
|
|
|
|
( overall complexity of regexes to be parsed and executed. )
|
2022-01-29 23:13:10 -05:00
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( reclaim all the memory used by the arena )
|
2022-01-22 22:54:51 -05:00
|
|
|
@reset-arena ( -> )
|
|
|
|
;arena-bot ;arena-pos STA2 JMP2r
|
|
|
|
|
2022-01-30 19:13:39 -05:00
|
|
|
( currently caller is responsible for zeroing out memory if needed )
|
2022-01-22 22:54:51 -05:00
|
|
|
@alloc ( size^ -> addr* )
|
|
|
|
#00 SWP ( size* )
|
2022-01-30 14:34:50 -05:00
|
|
|
;arena-pos LDA2 STH2k ADD2 ( pos+size* [pos] )
|
|
|
|
DUP2 ;arena-top GTH2 ( pos+size pos+size>top? [pos] )
|
2023-11-28 12:08:42 -05:00
|
|
|
?&error ( pos+size [pos] )
|
2022-01-30 14:34:50 -05:00
|
|
|
;arena-pos STA2 ( pos += size [pos] )
|
|
|
|
STH2r JMP2r ( pos )
|
2023-11-28 12:08:42 -05:00
|
|
|
&error POP2 POP2r ;arena-is-full errorm
|
2022-01-22 22:54:51 -05:00
|
|
|
|
2024-07-07 23:43:30 -04:00
|
|
|
@arena-pos =arena-bot ( the next position to allocate )
|
2022-01-22 22:54:51 -05:00
|
|
|
@arena-bot $400 @arena-top ( holds up to 1024 bytes )
|
2022-02-06 17:07:11 -05:00
|
|
|
|
2022-02-21 15:59:13 -05:00
|
|
|
( SUBGROUP OPERATIONS )
|
|
|
|
( )
|
|
|
|
( subgroups are parts of the input string that are matched by )
|
|
|
|
( parenthesized subgroup expressions in a regex. )
|
|
|
|
( )
|
|
|
|
( for example, (a*)(b*)(c*) has 3 subgroup expressions. )
|
|
|
|
( )
|
2022-04-18 23:17:44 -04:00
|
|
|
( during matching, subgroups are represented by 5-bytes: )
|
2022-02-21 15:59:13 -05:00
|
|
|
( )
|
2022-04-18 23:17:44 -04:00
|
|
|
( - byte 1: subgroup index (1-255, 0 is a marker) )
|
|
|
|
( - bytes 2-3: absolute address of the start of the subgroup )
|
|
|
|
( - bytes 4-5: absolute address of the limit of the subgroup )
|
2022-02-21 15:59:13 -05:00
|
|
|
( )
|
|
|
|
( this means that to get a null-terminated subgroup string )
|
|
|
|
( you will need to copy it somewhere else with enough space, )
|
|
|
|
( or else mutate the input string to add a null. )
|
|
|
|
( )
|
|
|
|
( since input strings themselves are null-terminated, and since )
|
|
|
|
( subgroups never include null terminators, we will always have )
|
|
|
|
( a valid limit value even for input strings that end at #ffff. )
|
|
|
|
( )
|
|
|
|
( during regex parsing we will use subgroup-pos to track the )
|
|
|
|
( next available subgroup position. )
|
2022-04-18 23:17:44 -04:00
|
|
|
( )
|
|
|
|
( some regular expressions will write to a subgroup multiple times. )
|
|
|
|
( for example when matching ((.)x)+ against "axbx": )
|
|
|
|
( )
|
|
|
|
( - subgroup 1 will contain "bx" )
|
|
|
|
( - subgroup 2 will contain "b" )
|
|
|
|
( )
|
|
|
|
( this may necessitate backtracking. when matching ((.)x|(.)y)+ )
|
|
|
|
( against "axby" we will make the following assignments: )
|
|
|
|
( )
|
|
|
|
( - position 1: )
|
|
|
|
( + start subgroup 1 )
|
|
|
|
( + start then finish subgroup 2: "a" )
|
|
|
|
( - position 2: )
|
|
|
|
( + finish subgroup 1: "ax" )
|
|
|
|
( - position 3: )
|
|
|
|
( + start subgroup 1 )
|
|
|
|
( + start then finish subgroup 2: "b" )
|
|
|
|
( - position 4: )
|
|
|
|
( + backtrack, reverting subgroup 2 to "a" )
|
|
|
|
( - back to position 3 again: )
|
|
|
|
( + start then finish subgroup 3: "b" )
|
|
|
|
( - position 4 again: )
|
|
|
|
( + finish subgruop 1: "by" )
|
|
|
|
( )
|
|
|
|
( the final subgroups will be: {1: "by", 2: "a", 3: "b"} )
|
|
|
|
|
|
|
|
@subgroup-start ( s* i^ -> )
|
|
|
|
;subgroup-pos LDA2 STH2k ( s* i^ pos* [pos*] )
|
2023-11-28 12:08:42 -05:00
|
|
|
;subgroup-top LTH2 ?&next #0000 DIV ( too many subgroups )
|
2022-04-18 23:17:44 -04:00
|
|
|
&next ( s* i^ [pos*] )
|
|
|
|
STH2kr STA
|
|
|
|
STH2r INC2 STA2
|
|
|
|
JMP2r
|
|
|
|
|
|
|
|
@subgroup-finish ( s* i^ -> )
|
|
|
|
;subgroup-pos LDA2 STH2k ( s* i^ pos* [pos*] )
|
2023-11-28 12:08:42 -05:00
|
|
|
;subgroup-top LTH2 ?&next #0000 DIV ( too many subgroups )
|
2022-04-18 23:17:44 -04:00
|
|
|
&next ( s* i^ [pos*] )
|
2023-11-28 12:08:42 -05:00
|
|
|
STH2kr LDA EQU ?&ok #0000 DIV ( mismatched subgroups )
|
2022-04-18 23:17:44 -04:00
|
|
|
&ok ( s* [pos*] )
|
|
|
|
STH2kr #0003 ADD2 STA2
|
|
|
|
STH2r #0005 ADD2 ;subgroup-pos STA2
|
|
|
|
JMP2r
|
|
|
|
|
|
|
|
@subgroup-branch ( -> )
|
|
|
|
;subgroup-pos LDA2 STH2k ( pos* [pos*] )
|
2023-11-28 12:08:42 -05:00
|
|
|
;subgroup-top LTH2 ?&next #0000 DIV ( too many subgroups )
|
2022-04-18 23:17:44 -04:00
|
|
|
&next
|
|
|
|
#00 STH2kr STA ( [*pos] )
|
|
|
|
STH2r #0005 ADD2 ;subgroup-pos STA2
|
|
|
|
JMP2r
|
|
|
|
|
|
|
|
@subgroup-backtrack ( -> )
|
|
|
|
;subgroup-bot ;subgroup-pos LDA2 ( bot* pos* )
|
|
|
|
&loop ( bot* pos* )
|
2023-11-28 12:08:42 -05:00
|
|
|
EQU2k ?&done
|
|
|
|
LDAk #00 EQU ?&done
|
|
|
|
#0005 SUB2 !&loop
|
2022-04-18 23:17:44 -04:00
|
|
|
&done ( bot* pos* )
|
|
|
|
NIP2 ;subgroup-pos STA2
|
|
|
|
JMP2r
|
|
|
|
|
|
|
|
( does not zero out the memory in question )
|
2022-02-21 15:59:13 -05:00
|
|
|
@subgroup-reset ( -> )
|
|
|
|
;subgroup-bot ;subgroup-pos STA2
|
2022-04-18 23:17:44 -04:00
|
|
|
JMP2r
|
|
|
|
|
2024-07-07 23:43:30 -04:00
|
|
|
@subgroup-pos =subgroup-bot ( the position of the first unallocated subgroup item )
|
2022-04-18 23:17:44 -04:00
|
|
|
@subgroup-bot $280 @subgroup-top ( holds up to 128 subgroup assignments (640 bytes) )
|