#### Foreword:

Association rules are one of the most active research methods in data mining. It refers to searching all details or transactions in a business system to find all rules that can associate one set of events or data items with another set of events or data items. Obtaining unknown or uncertain information that exists in the database, it focuses on determining the connections between different fields in the data, and is the most common form of mining local patterns in unsupervised learning systems.

Generally speaking, association rule mining refers to finding interesting associations or correlations from a large data set, that is, identifying frequently occurring sets of attribute values from the data set. , Also known as Frequent Itemsets (Frequent Itemsets, frequent sets), and then use these frequent itemsets to create rules describing the relationship between the process.

#### Association rule mining problem:

Discover frequent itemsets: All frequent itemsets are now the basis for forming association rules. Through the minimum support given by the user, find all frequent itemsets with support greater than or equal to Minsupport.

Generate association rules: Based on the minimum confidence given by the user, in each maximum frequent item set, find association rules with a confidence of not less than Minconfidence.

How to find all frequent itemsets quickly and efficiently is the core problem of association rule mining, and it is also an important criterion for measuring the efficiency of association rule mining algorithms.

The classic method for mining complete frequent itemsets is to find the complete set of frequent itemsets. Among them are improved algorithms such as the Apriori algorithm (which finds all frequent itemsets through multiple iterations) and the DHP (Direct Hashing Pruning) algorithm based on the breadth-first search algorithm; the FP-Growth algorithm based on the depth-first search strategy , ECLAT algorithm, COFI algorithm, etc., I will introduce two classic algorithms-Apriori algorithm and FP-Growth algorithm.

#### 1.Apriori algorithm

The Apriori algorithm is based on prior knowledge of the nature of frequent itemsets and uses an iterative search method from bottom to top, that is, starting from frequent 1 itemsets, using frequent k itemsets to search frequent k + 1 itemsets until it cannot find more containing Frequent itemsets for items.

The Apriori algorithm consists of the following steps, the core steps of which are the connection step and the pruning step:

##### The Apriori algorithm consists of the following steps, where the core steps are the connection step and the pruning step

(1) Generate frequent 1-item set L1.

(2) Connection step: In order to find frequent k-itemsets, a candidate set consisting of potentially frequent k-itemsets is first generated, each of which is made by two frequent itemsets with only one different belonging to k- 2 connection operation. The connection method is: suppose l1 and l2 are the itemsets in, that is, if the first k-2 elements in l1 and l2 are the same, then l1 and l2 are said to be connectable and are represented by. Assuming that the items in the transaction database are arranged in lexicographic order, and li [j] represents the jth item in li, the result item set connecting l1 and l2 is.

(3) Pruning step: The Ck generated by the connection step is a superset of Lk, which contains all the frequent itemsets Lk, and may also include some non-frequent itemsets. You can use the aforementioned prior knowledge (Theorem 3.2) to perform pruning to reduce the data size. For example, if a subset of k-1 items of candidate k-item set Ck is not in Lk-1, then this subset cannot be a frequent itemset and can be deleted directly.

(4) Generate frequent k itemsets Lk: scan transaction database D, calculate the support degree of each itemset in Ck, remove itemsets that do not meet the minimum support degree, and obtain frequent k itemsets Lk.

(5) Repeat steps (2) to (4) until the new frequent item set cannot be generated, and the algorithm is terminated.

##### Apriori algorithm is a width-first algorithm based on horizontal data distribution. Due to the use of hierarchical search strategies and pruning techniques, Apriori algorithm has higher efficiency in mining frequent patterns. However, the Apriori algorithm also has two fatal performance bottlenecks:

(1) Apriori algorithm is a multi-pass search algorithm, each transaction must scan the transaction database, I / O overhead is huge. For the candidate k itemset Ck, each element must be scanned to confirm whether to join the frequent k itemset Lk. If the candidate k itemset Ck contains n items, the transaction database needs to be scanned at least n times.

(2) A large candidate set may be generated. Due to the k-2 connection operation for frequent itemsets Lk-1, the candidate k itemset Ck generated by Lk-1 grows exponentially. Such a large number of candidate sets is a huge challenge to the computing time and storage space of the computer. .

transaction | Commodity code |

T100 | L1, L2, L3 |

T200 | L2, L3 |

T300 | L2, L3 |

T400 | L1, L2, L4 |

T500 | L1, L3 |

T600 | L2, L3 |

T700 | L1, L3 |

T800 | L1, L2, L3, L5 |

T900 | L1, L2, L3 |

When K = 1, min_sup = 1

Calculate C1 and L1

C1 | |

Itemsets | Support count |

{L1} | 6 |

{L2} | 7 |

{L3} | 6 |

{L4} | 2 |

{L5} | 2 |

L1: get L1 from C1 pruning | |

Itemsets | Support count |

{L1} | 6 |

{L2} | 7 |

{L3} | 6 |

{L4} | 2 |

{L4} | 2 |

Calculate C2 and L2

C2 | |

Itemsets | Support count |

{L1, L2} | 4 |

{L1, L3} | 4 |

{L1, L4} | 1 |

{L1, L5} | 2 |

{L2, L3} | 4 |

{L2, L4} | 2 |

{L2, L5} | 2 |

{L3, L4} | 0 |

{L3, L5} | 1 |

{L4, L5} | 0 |

L2: get L2 from C2 | |

Itemsets | Support count |

{L1, L2} | 4 |

{L1, L3} | 4 |

{L1, L5} | 2 |

{L2, L3} | 4 |

{L2, L4} | 2 |

{L2, L5} | 2 |

Calculate C3 and L3

C3: Calculate the three-item set by L2 | |

{L1, L2} + {L1, L3} | {L1, L2, L3} |

{L1, L2} + {L1, L5} | {L1, L2, L5} |

{L1, L2} + {L2, L3} | {L1, L2, L3} |

{L1, L2} + {L2, L4} | {L1, L2, L4} |

{L1, L3} + {L1, L5} | {L1, L3, L5} |

{L1, L3} + {L2, L3} | {L1, L2, L3} |

{L1, L3} + {L2, L4} | More than three |

{L1, L3} + {L2, L5} | More than three |

{L1, L5} + {L2, L3} | More than three |

{L1, L5} + {L2, L4} | More than three |

{L1, L5} + {L2, L5} | {L1, L2, L5} |

{L2, L3} + {L2, L4} | {L2, L3, L4} |

{L2, L3} + {L2, L5} | {L2, L3, L5} |

{L2, L4} + {L2, L5} | {L2, L4, L5} |

L3: get L3 from C3 pruning | |

Itemsets | Support count |

{L1, L2, L3} | 3 |

{L1, L2, L5} | 2 |

Calculate C4 and L4

C4: Calculate four itemsets by L4 | |

{L1, L2, L3} + {L1, L2, L5} | {L1, L2, L3, L5} |

**Because its subset {L2, L3, L5} is not a frequent itemset, this item set is deleted, C4 = 0;**

##### Advantages and disadvantages of Apriori algorithm:

Advantages: simple ideas; recursive calculations; easy implementation

Disadvantages: frequent database traversal; generation of candidate sets — more connections; large space occupied; large amount of calculations.

#### 2.FP-Growth algorithm

Frequent Pattern Tree Growth algorithm uses the basic idea of divide and conquer to compress the frequent itemsets in the database into a frequent pattern tree, while maintaining the relationship between the itemsets. The compressed frequent pattern tree is then divided into conditional subtrees, each conditional subtree corresponds to a frequent item, thereby obtaining a frequent item set, and finally association rule mining.

##### FP-Growth Algorithm Demonstration ——- Constructing FP Tree

Establishment of transaction database

Tid | items |

1 | L1, L2, L5 |

2 | L2, L4 |

3 | L2, L3 |

4 | L1, L2, L4 |

5 | L1, L3 |

6 | L2, L3 |

7 | L1, L3 |

8 | L1, L2, L3, L5 |

9 | L1, L2, L3 |

Scanning the transaction database for frequent itemsets F

From 1 to various points | Repeat times at each point |

1-1 | 6 |

1-2 | 7 |

1-3 | 6 |

1-4 | 2 |

1-5 | 2 |

Define minsup = 20%, that is, the minimum support is 2, rearrange F

From 1 to various points | Repeat times at each point |

1-2 | 7 |

1-1 | 6 |

1-3 | 6 |

1-4 | 2 |

1-5 | 2 |

Resizing the transaction database

Tid | items |

1 | L2, L1, L5 |

2 | L2, L4 |

3 | L2, L3 |

4 | L2, L1, L4 |

5 | L1, L3 |

6 | L2, L3 |

7 | L1, L3 |

8 | L2, L1, L3, L5 |

9 | L2, L1, L3 |

As you can see in the FP tree, there are two paths from the root node to i5: 1:

i2: 7-> i1: 4-> i5: 1

i2: 7-> i14-> i3: 2-> i5: 1

i2: 7-> i1: 4 and i2: 7-> i14-> i3: 2 Because the node that finally arrives must be i5, omitting i5 is the conditional pattern base of i5, which is denoted as {i2, i1: 1} {i2, i1, i3: 1}

Conditional pattern base: {i2, i1: 1} {i2, i1, i3: 1}

Because i3: 1x is less than the minimum support of 2, i3: 1 is omitted and the conditional FP tree of i5 is recorded as {i2: 2, I1: 2}

According to the conditional FP tree, we can perform full permutation and combination to obtain the frequent patterns that are mined (here the commodity itself, such as i5, is also included, the frequent patterns mined by each commodity must include the commodity itself)

item | Conditional pattern base | Conditional FP tree | Generate frequent patterns |

I5 | {{I2 I1: 1}, {I2 I1 I3: 1}} | {I2: 2, I1: 2} | {I2 I5: 2}, {I1 I5: 2}, {I2, I1: 2} |

I4 | {{I2 I1: 1}, {I2: 1}} | {I2: 2} | {I2 I4: 2} |

I3 | {{I2 I1: 2}, {I2: 2}, {I1: 2}} | {I2: 4, I1: 2, I1: 2} | {I2 I3: 4}, {I1 I3: 4}, {I2 I1 I3: 2} |

I1 | {{I1: 4}} | {I2: 4} | {I2 I1: 4} |