esProc, A Script Language for Data Analytics with Parallel Mechanism: Grouping In-memory Data Using esProc: Code Examples

Using esProc, it is quite convenient to group data in memory. There are several main types of grouping based on how data is grouped. Here we’ll illustrate equal grouping, alignment grouping and enumeration grouping respectively with an example.

Equal grouping

That data is grouped by one or more certain fields (or one or more computed fields derived from fields) of the current data set is equal grouping. By this method each group is a subset of the original data set.

Case description: Group sales orders by the year.

Data description: The original data is as follow:

The above data set (table sequence) can be imported from a database or a file. For example:

A1=file("E:/sales.txt").import@t()

esProc code:

A2=A1.group(year(OrderDate))

Computed result:

Code explanation:

1. In this example, the grouping criterion comes from the OrderDate field. The order dates can be converted to the years with year(OrderDate), then data of the same year will be grouped together.

2. There may be more than one field for grouping data. For example, the data could be grouped by both the year and the sellerID to put records of each seller in each year into a group. Below is the code for doing it:

A1.group(year(OrderDate),SellerId)

3.Often, the grouped data are to be aggregated, like getting each year’s sales amount based on A2’s data. The code is:

A2.new(year(OrderDate):y,~.sum(Amount):a)

Computed result is as follows:

Or we can combine grouping and summarizing into one step:

A1.group(year(OrderDate):y,~.sum(Amount):a)

Alternatively, we can choose the groups function with better performance yet less flexibility:

A1.groups(year(OrderDate):y; sum(Amount):a)

Of course, sometimes we have to perform grouping and aggregate separately in order to reuse the code and improve computational efficiency, like the scenario in which one of A2’s group needs filtering and another one requires relational computing. In another scenario, the summarized data of a certain group is unusual and worth further study, then this group can be used directly in the subsequent computations without the need of filtering it again.

4. By default, esProc’s group function will group data using hash algorithm. But for ordered data, the comparison of adjacent rows, which is equivalent to merge operation, may have higher performance. This approach can be implemented by using @o option with group function. For example:

A1.group@o(year(OrderDate),SellerId)

Alignment grouping

Equal grouping groups data by the field(s) coming from within the dataset. If the grouping criterion is one or more fields of another data set, a user-defined array, or a parameter list, etc., the grouping model will be referred as an alignment grouping.

Different from the equal grouping, alignment grouping may produce empty subsets, which means no members in the original data can satisfy a certain grouping condition. It may also lead to incomplete grouping, that is, there may be members that will not appear in any group. Neither would happen with equal grouping.

Case description: Group the orders table according to the list of best 10 sellers selected by KPIs.

Ungrouped data set:

The orders table in the previous example will also be used here. Data is stored in A1.

The best 10 sellers list is stored in B1 as follows:

The list of sellers may come from an intermediate table, or be generated by a piece of code. It’s not important how it is produced in this example.

esProc code:

A1.align@a(B1:empID,SellerId)

Computed result:

Code explanation:

1. In this example, the grouping criterion (list of sellers) is outside of the data set being grouped. After data is grouped, each group contains only the data of one seller, and groups are arranged according to the order of members in the sellers list.

2. Because sellers in the orders table outnumber the best sellers, some of the orders will not appear in any group. We can use function option @n to store those orders in one additional group, as shown below:

A1.align@a@n(B1:empID,SellerId)

This group will be put in the end:

3. Sometimes not all members of the grouping criterion will fall in the data set to be grouped. For instance, the grouping criterion is “a list of newly-employed sellers”. In this case, it’s normal to produce empty groups. If we modify the first record of the list into empID=100, the result will be:

Enumeration grouping

The grouping criterion for enumeration grouping is even more flexible. It could be any boolean expression. The records satisfying the value of the expression will be put into the same group.

Similar to alignment grouping, this is also the incomplete grouping as it probably produces empty subsets or a result in which some members are not included in any group. Moreover, this type of grouping may have the result that certain members appear in more than one group.

Case description: Dividing orders into four groups, they are: A. order amount is less than 1,000; B. order amount is less than 2,000; C. order amount is less than 3,000; D. order amount is less than 10,000. Special requirement: data cannot be grouped repeatedly, that is, if an order has been in group A, it must not be put into group B, C, or D.

Ungrouped data set:

The orders table in previous examples will still be used. Data is stored in A1.

esProc code:

A2=["?<=1000","?<=2000","?<=3000","?<=10000"]

A3=A1.enum(A2,Amount)

Computed result:

Case explanation:

1.In this example, grouping criteria are multiple flexible expressions. Each record will be compared with each of the expressions. Those records that can match the same expression will be put into the same group. Groups are arranged according to the order of the grouping criteria as well.

2.By default, enumeration grouping will not produce duplicate members in the result. Because after group A’s data is selected out, expression B will be matched with the rest of the records, as this example has shown so far. But the use of function option @r allows us to match expression B with all records, which will produce duplicate members. For example the result of A3=A1.enum@r(A2,Amount) is as follows:

3. Likewise, if values of an enumeration expression fall outside of the data to be grouped, it will correspond to an empty group. Besides, if certain records cannot match any expression, function option @n can be used to group these surplus records together.

menu

September 15, 2015

Grouping In-memory Data Using esProc: Code Examples

Equal grouping

Alignment grouping

Enumeration grouping

No comments:

Post a Comment