Sequence Search
The sequence search is able to find the location of a given sequence in the data. The sequences in the data (and the search sequence) are considered independent of the length of their occurence, e.g. if a sequence AAABBCCCC appears in the data, a search for AABCC will find it.
The sequence search does not give a preview of its output. Instead, when you add it to the window of a time-based visualization (or create it directly therein by creating the sequence search from the context menu of the time-based visualization in the workflow explorer), the marking output is previewed in the time-based visualization itself.
Output
The output of the node is a marking that marks the found sequences in the data. The marking can be either visualized or further analyzed itself or shown in a visualization that displays the data on which the search was performed (see section Markings).
Aggregation
The sequence search node is not able to aggregate, as searching for sequences always happens on the unchanged input data.
Settings
- Column:
- The column in which the sequence is searched. You can only select columns that contain a limited number of different values.
- Available:
- All the different values that occur in the selected column. Drag and Drop values from here to the "Search Sequence" and "Ignored" lists.
- Search Sequence:
- This is the sequence that should be searched. Drag and Drop the available items from the list to the left into this list. The sequence has a specific order and it is possible to change the order in the list itself by drag & drop.
- Ignored:
- This are the items that are ignored during the search. If you ignore item X and search for the sequence ABCX, even a sequence AXBXC will be found.
- Split Results:
- Determines whether the result will be a single data set or a different data set for each found sequence. Activating this option can be useful when each found sequence should be analyzed or visualized separately in the following node. Additionally, activating this option makes it possible to determine overlapping found sequences, if the result is shown in a visualization of the base data.
- Search Type:
- Offers a selection of the different search types. "Simple Search" will find exact matches of the selected sequence, "Levenshtein Search" offers several options for a fuzzy search, finding sequences that are similar, but not necessarily equal to the required sequence. The following parameters are only available if Levenshtein Search is selected.
- Levenshtein Properties:
- Parameters that affect the Levenshtein search:
- Max Dissimilarity:
- Sequences whose Levenshtein distance to the selected sequence is less than or equal to this value are found by the search. The Levenshtein distance is influenced by the following three parameters:
- Insert Cost:
- Affects an insert operation in the data. If insert cost is X, a sequence ABD in the data has a Levenshtein distance of X to a search sequence ABCD.
- Delete Cost:
- Affects a delete operation in the data. If delete cost is X, a sequence ABYCD in the data has a Levenshtein distance of X to a search sequence ABCD.
- Replace Cost:
- Affects a replace operation in the data. If replace cost is X, a sequence ABYD in the data has a Levenshtein distance of X to a search sequence ABCD.
- Max Length Difference:
- Only sequences that have difference in length of less than or equal to this value are found. This parameter is independent of the other Levenshtein parameters.
- Subsequence Treatment:
- Determines how found sequences that are subsequences of other found sequences are treated. "Use Largest Sequence" will only report the largest sequence, i.e. the sequence that is a supersequence of the other found sequences. "Use Smallest Sequence" will only report the smallest subsequence of such a set. "Use all Sequences" will report all found sequences. Note that this last option is only relevant, if "Split Results" is active.