New step: Split column step

Assignee

Reporter

Sprint

Description

Implement a step that will allow Wrangler users to split column into multiple other columns based on a delimiter.

  • Step name: Split column

  • Step label: Split column based on a delimiter

  • Step description: Split data in column into multiple new columns based on a delimiter.

Step parameters:

  • Input column: a column to split, string

  • Delimiter: a delimiter, can be multiple characters, NOT a regular expression (we can introduce regexp delimiter later)

  • Number of columns to create: integer, must be at least 1. Defines how many columns to create at maximum (see example below).

Step label in UI

  • Split column 'inputColumnName' to 4 columns using ','

The step will take input string, search for delimiters and return text between those delimiters in newly created columns. New columns will have names like SourceColumn part 1, SourceColumn part 2, ..., SourceColumn part N where SourceColumn is the name of the input column and the parts are numbered from 1 to number of splits.

Empty value will produce empty output. If there are fewer delimiters than number of splits in the parameters, the remaining columns will be set to null. If there are more delimiters than the number of splits, the last column will receive the remainder of the data.

Note that this behaves almost like CTL split function but it does not use regexp to define a delimiter but uses just plain string.

Examples:

  • Split "1,2,3,4" into columns, delim=",", columns=4

    • Returns data in 4 columns: "1", "2", "3", "4"

  • Split "1,2,3,4" into columns, delim=",", columns=3

    • Returns data in 3 columns: "1", "2", "3,4" (note the last one which contains two values)

  • Split "1,2,3,4" into columns, delim=",", columns=5

    • Returns data in 5 columns: "1", "2", "3", "4", (null)

  • Split "1,,,4" into columns, delim=",", columns=4

    • Returns data in 4 columns: "1", "", "", "4"

    • This means that the empty parts in the middle are represented by empty strings

  • Split "  1 ,     2       ,    3      ,4" into columns, delim=",", columns=4

    • Returns data in 4 columns: "1", "2", "3", "4"

    • This means that all the individual parts should be trimmed automatically by the step

  • Split "" into columns, delim=",", columns=4

    • Returns data in 4 columns: (null), (null), (null), (null)

    • All the parts are null because this is special case - the input has no delimiter and no content therefore no parts after split

    • Same should happen if the input value is null

Steps to reproduce

None

Activity

Show:

Jiri Trnka August 2, 2023 at 6:05 AM

Implemented new 'Split column' step.

QA: Try some weird delimiters and single and multiple chars long.

Fixed

Details

Priority

Fix versions

QA Testing

UNDECIDED

Components

Created April 20, 2023 at 3:31 PM
Updated October 3, 2023 at 8:44 AM
Resolved August 2, 2023 at 6:05 AM