Generic element-grouping function
Several years ago, I wrote a simplistic HTML parser, for which I wrote a generalization of the function you request, that works on different levels of expression, and groups elements at specified start and end positions in specified heads.
Implementation
Here is the code (I made no effort to improve it, so it may not be optimal or the most elegant):
ClearAll[
listSplit,
reconstructIntervals,
groupElements,
groupPositions,
processPosList,
groupElementsNested
];
(* My analog of Internal`PartitionRagged*)
listSplit[x_List,lengthlist_List,headlist_List]:=
MapThread[
#1@@Take[x,#2]&,
{
headlist,
Transpose[({Most[#1]+1,Rest[#1]}&)[FoldList[Plus,0,lengthlist]]]
}
];
(* Reconstruct split intervals from positions *)
reconstructIntervals[listlen_Integer,ints_List]:=
Module[{missed,startint,lastint},
startint=If[ints[[1,1]]==1,{},{1,ints[[1,1]]-1}];
lastint=If[ints[[-1,-1]]==listlen,{},{ints[[-1,-1]]+1,listlen}];
missed=
(If[#1[[2,1]]-#1[[1,2]]>1,{#1[[1,2]]+1,#1[[2,1]]-1},{}]&)/@
Partition[ints,2,1];
missed=Join[missed,{lastint}];
Prepend[Flatten[Transpose[{ints,missed}],1],startint]
];
(* Groups elements and wraps them in specified heads *)
groupElements[lst_List,poslist_List,headlist_List]/;
OrderedQ[Flatten[Sort[poslist]]]&&Length[headlist]==Length[poslist]:=
Module[{totalheadlist,allints,llist},
totalheadlist=
Append[
Flatten[
Transpose[{Array[Sequence&,{Length[headlist]}],headlist}],1
],
Sequence
];
allints=reconstructIntervals[Length[lst],poslist];
llist=(If[#1==={},0,1-Subtract@@#1]&)/@allints;
listSplit[lst,llist,totalheadlist]
];
(* To work on general heads, we need this *)
groupElements[h_[x__],poslist_List,headlist_List]:=
h[Sequence@@groupElements[{x},poslist,headlist]];
(* If we have a single head *)
groupElements[expr_,poslist_List,head_]:=
groupElements[expr,poslist,Table[head,{Length[poslist]}]];
(* Group positions, to create a specification for nested grouping functon*)
groupPositions[plist_List]:=
Reap[(Sow[Last[#1],{Most[#1]}]&)/@plist,_,List][[2]];
(* Finds pairs of positions for opening and closing elements, to form intervals *)
processPosList::nomtchMessageName="Unmatched lists for positions `1`";
processPosList[{openlist_List,closelist_List}]:=
Module[{opengroup,closegroup,poslist},
{opengroup,closegroup}=groupPositions/@{openlist,closelist};
poslist=Transpose[(Transpose[Sort[#1]]&)/@{opengroup,closegroup}];
If[
UnsameQ@@poslist[[1]]
,
Message[processPosList::nomtchMessageName,{openlist,closelist}];
Return[{}]
,
(*else*)
poslist=Transpose[{poslist[[1,1]],Transpose/@Transpose[poslist[[2]]]}]
]
];
(* Main function to group elements in an expression *)
groupElementsNested[nested_,{openposlist_List,closeposlist_List},head_]/;
Head[head]=!=List:=
Fold[
Function[{x,y},MapAt[groupElements[#1,y[[2]],head]&,x,{y[[1]]}]],
nested,
Sort[
processPosList[{openposlist,closeposlist}],
Length[#2[[1]]]<Length[#1[[1]]]&
]
];
This code was instrumental for the kind of HTML parser I wanted - it was a breadth-first parser that could also (partially) parse certain types of malformed HTML documents. Speed of parsing was important, and the above functions were reasonably fast for my purposes.
Examples
Combining integers in a nested list based on a list of positions
Create a test expression (nested list here, but in general this can be a general expression)
testlist =
Partition[Partition[Sort[Table[Random[Integer, {1, 200}], {64}]], 8], {2, 8}]
(*
{{{{6, 8, 12, 13, 17, 17, 22, 23}, {26, 34, 35, 39, 39, 41,44, 49}}},
{{{51, 52, 53, 54, 61, 64, 65, 67}, {67, 69, 73, 78,82, 87, 91, 93}}},
{{{95,98,100,113,124,129,132,132}, {135,140,142,149,150,153,155, 157}}},
{{{160,162,163,165,167,167,170,181}, {183,185,185,191,193,194,196,197}}}}
*)
Here is a sample list of positions. Its first element is a list of opening positions for the element intervals, and the second list is a list of closing positions.
positions = {
{{1, 1, 1, 1}, {1, 1, 1, 6}, {2, 1, 2, 3}, {3,1, 1, 2}, {3, 1, 2, 1}},
{{1, 1, 1, 4}, {1, 1, 1, 8}, {2, 1, 2, 5}, {3, 1, 1, 7}, {3, 1, 2, 5}}
};
We now group elements according to the position specification above, and wrap them in the head hd:
groupElementsNested[testlist, positions , hd]
(*
{{{{hd[6, 8, 12, 13], 17, hd[17, 22, 23]}, {26, 34, 35, 39,39, 41, 44, 49}}},
{{{51, 52, 53, 54, 61, 64, 65, 67}, {67, 69, hd[73, 78, 82], 87, 91, 93}}},
{{{95, hd[98, 100, 113, 124, 129, 132], 132}, {hd[135, 140, 142, 149, 150], 153, 155, 157}}},
{{{160, 162, 163, 165, 167, 167, 170, 181}, {183, 185, 185, 191, 193, 194, 196, 197}}}}
*)
Combining consecutive elements matching the same pattern
One can also do more interesting things. Here are some additional functions that would allow us to group elements which all match certain pattern:
ClearAll[getOpenClosePositions];
getOpenClosePositions[expr_,patt_]:=
(If[#1==={},{},Transpose[#1]]&)[
Reap[
(Sow[#1,{Most[#1]}]&)/@Position[expr,patt],
_,
({First[#1],Last[#1]}&)[#2]&
][[2]]
];
ClearAll[groupMatched];
groupMatched[expr_,patt_,head_]:=
groupElementsNested[expr,getOpenClosePositions[expr,patt],head];
Now we can, for example, group according to some pattern, such as
groupMatched[
testlist,
x_Integer/;IntervalMemberQ[Interval[{1,15},{30,40},{100,130}],x],
hd
]
(*
{
{{{hd[6,8,12,13],17,17,22,23},{26,hd[34,35,39,39],41,44,49}}},
{{{51,52,53,54,61,64,65,67},{67,69,73,78,82,87,91,93}}},
{{{95,98,hd[100,113,124,129],132,132},{135,140,142,149,150,153,155,157}}},
{{{160,162,163,165,167,167,170,181},{183,185,185,191,193,194,196,197}}}
}
*)
Simple Mathematica FullForm parser
Here will be a less trivial example: parse a string of Mathematica code, given that it represents the FullForm (which makes the task of parsing vastly simpler). Here is one possible implementation for such a parser:
ClearAll[parse,parsedToCode,tokenize,Bracket];
(* "tokenize" our string *)
tokenize[code_String]:=
Module[{n=0,tokenrules},
tokenrules=
{
"[":>{"Open",++n},
"]":>{"Close",n--},
Whitespace|""~~","~~Whitespace|""
};
DeleteCases[StringSplit[code,tokenrules],"",\[Infinity]]
];
(*
** parses the "tokenized" string in the breadth-first manner starting
** with the outermost brackets, using Fold and groupElementsNested
*)
parse[preparsed_]:=
Module[
{
maxdepth=Max[Cases[preparsed,_Integer,\[Infinity]]],
popenlist,
parsed,
bracketPositions
},
bracketPositions[expr_,brdepth_Integer]:=
{Position[expr,{"Open",brdepth}],Position[expr,{"Close",brdepth}]};
parsed=
Fold[
groupElementsNested[#1,bracketPositions[#1,#2],Bracket]&,
preparsed,
Range[maxdepth]
];
parsed=DeleteCases[parsed,{"Open"|"Close",_},\[Infinity]];
parsed=parsed//.
h_[x___,y_,Bracket[z___],t___]:>h[x,y[z],t]
];
(* convert our parsed expression into a code that Mathematica can execute *)
parsedToCode[parsed_]:=
ReleaseHold[
(#1//.
x_String:>ToExpression[x,InputForm,HoldForm]&)//@parsed/.
HoldPattern[Sequence[x__][y__]]:>x[y]
];
This is a breadth-first parser. Basically, it starts from a list of string tokens and iteratively grows the resulting expression from it, in a breadth-first manner.
Here is an example to show how it works. First construct a string to parse:
stringToParse=ToString[DownValues[groupPositions]]
"List[RuleDelayed[HoldPattern[groupPositions[Pattern[plist, Blank[List]]]], Part[Reap[Map[Function[Sow[Last[Slot[1]], List[Most[Slot[1]]]]], plist], Blank[], List], 2]]]"
Now tokenize the string. The process of tokenizing also identifies square brackets and adds their depth.
initlist=tokenize[stringToParse]
(*
{"List", {"Open", 1}, "RuleDelayed", {"Open",2}, "HoldPattern", {"Open", 3},
"groupPositions", {"Open", 4}, "Pattern", {"Open", 5}, "plist", "Blank",
{"Open", 6}, "List", {"Close", 6}, {"Close", 5}, {"Close", 4}, {"Close", 3},
"Part", {"Open", 3}, "Reap", {"Open", 4}, "Map", {"Open", 5}, "Function",
{"Open", 6}, "Sow", {"Open", 7}, "Last", {"Open", 8}, "Slot", {"Open", 9},
"1", {"Close", 9}, {"Close", 8}, "List", {"Open", 8}, "Most", {"Open", 9},
"Slot", {"Open", 10}, "1", {"Close", 10}, {"Close", 9}, {"Close", 8},
{"Close", 7}, {"Close", 6}, "plist", {"Close", 5}, "Blank", {"Open", 5},
{"Close", 5}, "List", {"Close", 4}, "2", {"Close", 3}, {"Close", 2},
{"Close", 1}}
*)
We can now parse this. The idea is to combine together all elements in between the closest pair of opening and closing square brackets of the same depth. This is done in the line with Fold, in the parse function.
parse[tokenize[stringToParse]]
{
"List"[
"RuleDelayed"[
"HoldPattern"["groupPositions"["Pattern"["plist","Blank"["List"]]]],
"Part"[
"Reap"[
"Map"[
"Function"["Sow"["Last"["Slot"["1"]],"List"["Most"["Slot"["1"]]]]],
"plist"
],
"Blank"[],
"List"
],
"2"
]
]
]
}
At each pass (single iteration in Fold), we process all brackets at the same depth in an expression. The important thing here is that after the first pass, the resulting expression becomes nested, and is no longer a simple list of tokens. Therefore, here we do need the full power of groupElementsNested function to repeatedly combine elements deeper and deeper inside expression being built.
Finally, we can use the parsedToCode function to convert the above result (where all heads are still strings) into Mathematica code:
parsedToCode[parse[tokenize[stringToParse]]]
{
{
HoldPattern[groupPositions[plist_List]]:>
Reap[(Sow[Last[#1],{Most[#1]}]&)/@plist,_,List][[2]]
}
}
which is what we started with.
The HTML parser mentioned before is based on the same set of ideas, but it has more types of "brackets" - instead of just square brackets, all HTML tags play a role of brackets of different types.