I have a large amount of data stored in a .csv file. To ensure that the input data is OK, I do a lot of tests. Among other things, I would like to test whether, a special type (typA) has the same description several times. In my case that would be a mistake!
To detect this error, I would like to store all descriptions in an array and whenever a new one is added I would iterate through the array and see whether the sting is already part of the array.
To solve this, I started based on this old question: Storing an array of strings in a command.
I especially took egreg's "Update for TUG 2024" answer here https://tex.stackexchange.com/a/215571/104358 to add the array handling. I just added the edits from the comments to store the array persistent.
Here is my MWE:
\documentclass[10pt,oneside,a4paper]{article}
\usepackage{ifthen}
\usepackage{pgfplotstable}
\pgfplotsset{compat=newest}
\setlength\parindent{0pt}
% Got it from here: https://tex.stackexchange.com/a/215571/104358 / Just small edits, see comments
\ExplSyntaxOn
\NewDocumentCommand{\storeData}{mm}
{
\bcp_store_data:nn { #1 } { #2 }
}
\NewDocumentCommand{\appendData}{mm}
{
\bcp_append_data:nn { #1 } { #2 }
}
\NewExpandableDocumentCommand{\getData}{O{1}m}
{
\bcp_get_data:nn { #1 } { #2 }
}
\NewExpandableDocumentCommand{\getLength}{m}
{
\seq_count:c { l_bcp_data_#1_seq }
}
\NewDocumentCommand{\removeLast}{om}
{
\IfNoValueTF { #1 }
{
\bcp_remove_last:Nn \g_tmpa_tl { #2 }
}
{
\bcp_remove_last:Nn #1 { #2 }
}
}
\NewDocumentCommand{\processData}{O{,~}mo}
{% #1 = separator, #2 = list name, #3 = template
\IfNoValueTF{#3}
{% no template, just use the list with the optional separator,
% default comma-space
\seq_use:cn { l_bcp_data_#2_seq } { #1 }
}
{% template given
\seq_map_inline:cn { l_bcp_data_#2_seq } { #3 }
}
}
\cs_new_protected:Npn \bcp_store_data:nn #1 #2
{
% create the sequence if it doesn't exist or clear it if it exists
\seq_gclear_new:c { l_bcp_data_#1_seq }
% append the items
\__bcp_append_data:nn { #1 } { #2 }
}
\cs_new_protected:Npn \bcp_append_data:nn #1 #2
{
% create the sequence if it doesn't exist, do nothing if it exists
\seq_if_exist:cF { l_bcp_data_#1_seq }
{ \seq_new:c { l_bcp_data_#1_seq } }
% append the items
\__bcp_append_data:nn { #1 } { #2 }
}
\cs_new_protected:Npn \__bcp_append_data:nn #1 #2
{
% append items one at a time
\tl_map_inline:nn { #2 }
{
\seq_gput_right:cn { l_bcp_data_#1_seq } { ##1 }
}
}
\cs_new:Npn \bcp_get_data:nn #1 #2
{
% retrieve the requested item
\seq_item:cn { l_bcp_data_#2_seq } { #1 }
}
\cs_new_protected:Nn \bcp_remove_last:Nn
{
\seq_gpop_right:cN { l_bcp_data_#2_seq } #1
}
\ExplSyntaxOff
\begin{filecontents}{data.csv}
Type,Description
typA,description0 % Add this to the array
typB,description1
typA,description2 % Add this to the array
typA,description2 % Add this to the array => ERROR, because this description is already there!
\end{filecontents}
\pgfplotstableread[col sep=comma]{data.csv}{\csvdata}
\pgfplotstablegetrowsof{\csvdata}
\newcommand{\checkRoomDuplicates}{
\newcount\counter
\newcount\maxCounter
\counter=0
\maxCounter=\getLength{roomArray}
\pgfplotstablegetelem{\pgfplotstablerow}{Type}\of{\csvdata}
\ifthenelse{\equal{\pgfplotsretval}{typA}}{ % Only do something, when Typ is typA
\pgfplotstablegetelem{\pgfplotstablerow}{Description}\of{\csvdata}
MaxCounter: \the\maxCounter\newline
\edef\roomArrayLength{\getLength{roomArray}}
\ifthenelse{\equal{\roomArrayLength}{0}}{
\appendData{roomArray}{{\pgfplotsretval}}
}{
\loop
\ifnum\counter<\maxCounter
\advance\counter by 1
Counter: \the\counter\newline
RETURN VALUE:\pgfplotsretval\newline
\edef\roomDescription{\getData[\counter]{roomArray}}
DESCRIPTION:\roomDescription\newline % => Why is this value "description2" and NOT "description0"?
\ifthenelse{\equal{\pgfplotsretval}{\roomDescription}}{
DOUBLE ROOM!\newline
}{}
\repeat
\appendData{roomArray}{{\pgfplotsretval}}
}
\processData{roomArray}\newline
\vspace{5mm}
}{}
}
\pgfplotstableset{
every column/.style={
assign cell content/.code={
\ifnum\pgfplotstablecol=0 % Do this only once per row
\checkRoomDuplicates{}
\fi
},
}
}
\begin{document}
\pgfplotstabletypeset{\csvdata}
\end{document}
Here is the output:
What I don't understand is, why returns \processData{roomArray} always a repetition of the current description value and not also the old and stored ones? I think this is the reason, why my routine is not working like expected.
I would expect, that for the first time, when this routine is proceed:
\edef\roomDescription{\getData[\counter]{roomArray}}
DESCRIPTION:\roomDescription
\roomDescription is "description0" and NOT "description2" for \counter = 1
25.07.24 EDIT based on the comments
In general, the data I'm dealing with belong to my overall smartHome electric plan. All this is based on this post but much more complex now. So these data's contain Room descriptions, Component Identifier, KNX-Addresses, Fuse and Phase Information, component coordinates within my house layout and a lot of more details. So out of this, at the end I will wire my house.
All these information are given by an input.csv data. And then my LaTeX routines check and format these data. For example, the input data contain a component Identifier but in my generated table at the end, I want to see the circuit symbol. So I have implemented a routine to print the correct symbol based on the component identifier. Another example is, that I generate a hash value out of dedicated information from my input data and store them also in the table.
Additional I print a 2D graphic out of these data. That means I get a "how to wire" table and additional a TikZ image, "where to install" these components. And much more... The most of these Features are already working.
So to come back to my question, I'm trying to find a way to work with string arrays in LaTeX in an efficient and user-friendly way.
Of course, my MWE just shows a very little part of this. My original raw data have 20 columns and more than 300 rows. And I know, I could also do this with other tools, so checking the input data within python (or where ever) and then just bring the corrected data to LaTeX, but I have a running CI/CD environment which just supports LaTeX in the moment I just want to know how to do this within LaTeX :)

sort data.csv|uniq -dwould give you all duplicate lines.