Loading

Task 2: Next Product Recommendation for Underrepresented Languages

Task 2 - Getting Started

Make your first submission on Task 2

dipam

Amazon KDD Cup 2023 - Task 2 - Next Product Recommendation

This notebook will contains instructions and example submission with random predictions.

Installations 🤖

  1. aicrowd-cli for downloading challenge data and making submissions
  2. pyarrow for saving to parquet for submissions
In [1]:
!pip install aicrowd-cli pyarrow
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.15-py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 1.9 MB/s eta 0:00:00
Requirement already satisfied: pyarrow in /usr/local/lib/python3.9/dist-packages (9.0.0)
Collecting python-slugify<6,>=5.0.0
  Downloading python_slugify-5.0.2-py2.py3-none-any.whl (6.7 kB)
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.10.1-py2.py3-none-any.whl (54 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.5/54.5 KB 2.9 MB/s eta 0:00:00
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: requests<3,>=2.25.1 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (2.27.1)
Collecting click<8,>=7.1.2
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.8/82.8 KB 4.3 MB/s eta 0:00:00
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp39-cp39-manylinux2010_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 16.1 MB/s eta 0:00:00
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.2-py3-none-any.whl (214 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 214.4/214.4 KB 13.9 MB/s eta 0:00:00
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 170.1/170.1 KB 9.1 MB/s eta 0:00:00
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (4.65.0)
Collecting semver<3,>=2.13.0
  Downloading semver-2.13.0-py2.py3-none-any.whl (12 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.10-py3-none-any.whl (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 KB 2.9 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.16.6 in /usr/local/lib/python3.9/dist-packages (from pyarrow) (1.22.4)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/dist-packages (from python-slugify<6,>=5.0.0->aicrowd-cli) (1.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2022.12.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.15)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (3.4)
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 2.2 MB/s eta 0:00:00
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.9/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Installing collected packages: commonmark, smmap, semver, pyzmq, python-slugify, colorama, click, rich, requests-toolbelt, gitdb, GitPython, aicrowd-cli
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 23.2.1
    Uninstalling pyzmq-23.2.1:
      Successfully uninstalled pyzmq-23.2.1
  Attempting uninstall: python-slugify
    Found existing installation: python-slugify 8.0.1
    Uninstalling python-slugify-8.0.1:
      Successfully uninstalled python-slugify-8.0.1
  Attempting uninstall: click
    Found existing installation: click 8.1.3
    Uninstalling click-8.1.3:
      Successfully uninstalled click-8.1.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 2.2.3 requires click>=8.0, but you have click 7.1.2 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.15 click-7.1.2 colorama-0.4.6 commonmark-0.9.1 gitdb-4.0.10 python-slugify-5.0.2 pyzmq-22.1.0 requests-toolbelt-0.10.1 rich-10.16.2 semver-2.13.0 smmap-5.0.0

Login to AIcrowd and download the data 📚

In [2]:
!aicrowd login
Please login here: https://api.aicrowd.com/auth/7Mjyt9pDM_gRJTCDnXmLqgetODx-6EEkFFtqJkNHW1Q
/usr/bin/xdg-open: 869: www-browser: not found
/usr/bin/xdg-open: 869: links2: not found
/usr/bin/xdg-open: 869: elinks: not found
/usr/bin/xdg-open: 869: links: not found
/usr/bin/xdg-open: 869: lynx: not found
/usr/bin/xdg-open: 869: w3m: not found
xdg-open: no method available for opening 'https://api.aicrowd.com/auth/7Mjyt9pDM_gRJTCDnXmLqgetODx-6EEkFFtqJkNHW1Q'
API Key valid
Gitlab access token valid
Saved details successfully!
In [3]:
!aicrowd dataset download --challenge task-2-next-product-recommendation-for-underrepresented-languages
sessions_test_task1.csv: 100% 19.4M/19.4M [00:01<00:00, 10.7MB/s]
sessions_test_task2.csv: 100% 1.92M/1.92M [00:00<00:00, 3.57MB/s]
sessions_test_task3.csv: 100% 3.15M/3.15M [00:00<00:00, 5.49MB/s]
products_train.csv: 100% 589M/589M [01:16<00:00, 7.74MB/s]
sessions_train.csv: 100% 259M/259M [00:42<00:00, 6.08MB/s]

Setup data and task information

In [4]:
import os
import numpy as np
import pandas as pd
from functools import lru_cache
In [5]:
train_data_dir = '.'
test_data_dir = '.'
task = 'task2'
PREDS_PER_SESSION = 100
In [6]:
# Cache loading of data for multiple calls

@lru_cache(maxsize=1)
def read_product_data():
    return pd.read_csv(os.path.join(train_data_dir, 'products_train.csv'))

@lru_cache(maxsize=1)
def read_train_data():
    return pd.read_csv(os.path.join(train_data_dir, 'sessions_train.csv'))

@lru_cache(maxsize=3)
def read_test_data(task):
    return pd.read_csv(os.path.join(test_data_dir, f'sessions_test_{task}.csv'))

Data Description

The Multilingual Shopping Session Dataset is a collection of anonymized customer sessions containing products from six different locales, namely English, German, Japanese, French, Italian, and Spanish. It consists of two main components: user sessions and product attributes. User sessions are a list of products that a user has engaged with in chronological order, while product attributes include various details like product title, price in local currency, brand, color, and description.


Each product as its associated information:

locale: the locale code of the product (e.g., DE)

id: a unique for the product. Also known as Amazon Standard Item Number (ASIN) (e.g., B07WSY3MG8)

title: title of the item (e.g., “Japanese Aesthetic Sakura Flowers Vaporwave Soft Grunge Gift T-Shirt”)

price: price of the item in local currency (e.g., 24.99)

brand: item brand name (e.g., “Japanese Aesthetic Flowers & Vaporwave Clothing”)

color: color of the item (e.g., “Black”)

size: size of the item (e.g., “xxl”)

model: model of the item (e.g., “iphone 13”)

material: material of the item (e.g., “cotton”)

author: author of the item (e.g., “J. K. Rowling”)

desc: description about a item’s key features and benefits called out via bullet points (e.g., “Solid colors: 100% Cotton; Heather Grey: 90% Cotton, 10% Polyester; All Other Heathers …”)

EDA 💽

In [7]:
def read_locale_data(locale, task):
    products = read_product_data().query(f'locale == "{locale}"')
    sess_train = read_train_data().query(f'locale == "{locale}"')
    sess_test = read_test_data(task).query(f'locale == "{locale}"')
    return products, sess_train, sess_test

def show_locale_info(locale, task):
    products, sess_train, sess_test = read_locale_data(locale, task)

    train_l = sess_train['prev_items'].apply(lambda sess: len(sess))
    test_l = sess_test['prev_items'].apply(lambda sess: len(sess))

    print(f"Locale: {locale} \n"
          f"Number of products: {products['id'].nunique()} \n"
          f"Number of train sessions: {len(sess_train)} \n"
          f"Train session lengths - "
          f"Mean: {train_l.mean():.2f} | Median {train_l.median():.2f} | "
          f"Min: {train_l.min():.2f} | Max {train_l.max():.2f} \n"
          f"Number of test sessions: {len(sess_test)}"
        )
    if len(sess_test) > 0:
        print(
             f"Test session lengths - "
            f"Mean: {test_l.mean():.2f} | Median {test_l.median():.2f} | "
            f"Min: {test_l.min():.2f} | Max {test_l.max():.2f} \n"
        )
    print("======================================================================== \n")
In [8]:
products = read_product_data()
locale_names = products['locale'].unique()
for locale in locale_names:
    show_locale_info(locale, task)
Locale: DE 
Number of products: 518327 
Number of train sessions: 1111416 
Train session lengths - Mean: 57.89 | Median 40.00 | Min: 27.00 | Max 2060.00 
Number of test sessions: 0
======================================================================== 

Locale: JP 
Number of products: 395009 
Number of train sessions: 979119 
Train session lengths - Mean: 59.61 | Median 40.00 | Min: 27.00 | Max 6257.00 
Number of test sessions: 0
======================================================================== 

Locale: UK 
Number of products: 500180 
Number of train sessions: 1182181 
Train session lengths - Mean: 54.85 | Median 40.00 | Min: 27.00 | Max 2654.00 
Number of test sessions: 0
======================================================================== 

Locale: ES 
Number of products: 42503 
Number of train sessions: 89047 
Train session lengths - Mean: 48.82 | Median 40.00 | Min: 27.00 | Max 792.00 
Number of test sessions: 8176
Test session lengths - Mean: 50.69 | Median 40.00 | Min: 27.00 | Max 383.00 

======================================================================== 

Locale: FR 
Number of products: 44577 
Number of train sessions: 117561 
Train session lengths - Mean: 47.25 | Median 40.00 | Min: 27.00 | Max 687.00 
Number of test sessions: 12520
Test session lengths - Mean: 51.18 | Median 40.00 | Min: 27.00 | Max 410.00 

======================================================================== 

Locale: IT 
Number of products: 50461 
Number of train sessions: 126925 
Train session lengths - Mean: 48.80 | Median 40.00 | Min: 27.00 | Max 621.00 
Number of test sessions: 13992
Test session lengths - Mean: 50.82 | Median 40.00 | Min: 27.00 | Max 555.00 

======================================================================== 

In [9]:
products.sample(5)
Out[9]:
id locale title price brand color size model material author desc
1163760 B09MVJHCTF UK Robo Alive Robo Fish Series 2 Robotic Swimming... 15.99 Robo Fish 2 Pack, Red and Blue NaN 7165G Plastic NaN HYPER REALISTIC: Robo Fish look and move just ...
1012222 B003J19E0Y UK Charlie Crow Lamb / Sheep costume for kids one... 17.50 Charlie Crow Cream 3-8 Years 80600 Polyester NaN 100% Acrylic. Machine washable.
270885 B08NP4T8VT DE deleyCON 1,5m Aux Kabel 3,5mm Verlängerung - A... 5.69 deleyCON Schwarz 1.5 M MK4650 NaN NaN 3,5mm Klinken Anschlüsse // Metallstecker // B...
1542162 B07NXG4NV9 IT Amazon Basics - Batterie ricaricabili AAA (con... 21.47 Amazon Basics NaN Confezione da 24 HFR-AAA800 NaN NaN Confezione di 24 batterie AAA ricaricabili da ...
888723 B09BFSW6DC JP 【装着感ゼロ・本体再現】CASEKOO iPhone 14 Plus / 13promax ... 2399.00 CASEKOO クリア 6.7インチ iphone13promax 強化ガラス NaN 【ガイドの中にフィルムを落とすだけ】 0.1㎜まで高精度のガイド枠があるおかげで、位置合わせ...
In [10]:
train_sessions = read_train_data()
train_sessions.sample(5)
Out[10]:
prev_items next_item locale
2763316 ['B0999HZ57Z' 'B08CZDL9DF' 'B08CZ67WCJ' 'B08CZ... B08FDJCJHZ UK
3065503 ['B000R9J1B8' 'B002KAL6NI' 'B08N693QN6' 'B071W... B083G29KWC UK
3123439 ['0007371462' '0007371462'] 0008438706 UK
1684574 ['B0B7WVWRR6' 'B00A16BT4E' 'B08G4KPZHQ' 'B08G4... B00Q3K32I8 JP
46429 ['B093BMFRZ6' 'B08M5MGHKD' 'B07Y4KZLBF' 'B09K3... B097GXF32S DE
In [11]:
test_sessions = read_test_data(task)
test_sessions.sample(5)
Out[11]:
prev_items locale
1007 ['B09M8JHWZC' 'B0BFHD93JZ'] ES
24977 ['B07V4SJF94' 'B07V4SJF94'] IT
15307 ['B09BP3C1HG' 'B09BP1G2TR' 'B085FLLXWQ' 'B0B34... FR
34158 ['B00X5WBLHG' 'B0021IOTTW' 'B0B52M4M9S' 'B008Y... IT
25944 ['B08P4FV6W2' 'B08G81KZ1C' 'B09FJC8V7X' 'B09FJ... IT

Generate Submission 🏋️‍♀️

Submission format:

  1. The submission should be a parquet file with the sessions from all the locales.
  2. Predicted products ids per locale should only be a valid product id of that locale.
  3. Predictions should be added in new column named "next_item_prediction".
  4. Predictions should be a list of string id values
In [12]:
def random_predicitons(locale, sess_test_locale):
    random_state = np.random.RandomState(42)
    products = read_product_data().query(f'locale == "{locale}"')
    predictions = []
    for _ in range(len(sess_test_locale)):
        predictions.append(
            list(products['id'].sample(PREDS_PER_SESSION, replace=True, random_state=random_state))
        ) 
    sess_test_locale['next_item_prediction'] = predictions
    sess_test_locale.drop('prev_items', inplace=True, axis=1)
    return sess_test_locale
In [13]:
test_sessions = read_test_data(task)
predictions = []
test_locale_names = test_sessions['locale'].unique()
for locale in test_locale_names:
    sess_test_locale = test_sessions.query(f'locale == "{locale}"').copy()
    predictions.append(
        random_predicitons(locale, sess_test_locale)
    )
predictions = pd.concat(predictions).reset_index(drop=True)
predictions.sample(5)
Out[13]:
locale next_item_prediction
1515 ES [B09N7CS3H2, B07W5JKHFR, B09XHDVVT8, B08NV7ZFC...
5511 ES [B00E3RKC36, B08YKL2P5P, B08LQJ8W1F, B0B5HV9DR...
31303 IT [B08JM62H1H, B003M0NURK, B01LFPL3YA, B09NR1RHN...
31436 IT [B004LHKBEI, B01BDQC1L0, B00S6FMXII, B09QXJB82...
11644 FR [B09QGMTDCS, B09ZQ3H328, B07CX6F8LM, B00820RGR...

Validate predictions ✅

In [14]:
def check_predictions(predictions, check_products=False):
    """
    These tests need to pass as they will also be applied on the evaluator
    """
    test_locale_names = test_sessions['locale'].unique()
    for locale in test_locale_names:
        sess_test = test_sessions.query(f'locale == "{locale}"')
        preds_locale =  predictions[predictions['locale'] == sess_test['locale'].iloc[0]]
        assert sorted(preds_locale.index.values) == sorted(sess_test.index.values), f"Session ids of {locale} doesn't match"

        if check_products:
            # This check is not done on the evaluator
            # but you can run it to verify there is no mixing of products between locales
            # Since the ground truth next item will always belong to the same locale
            # Warning - This can be slow to run
            products = read_product_data().query(f'locale == "{locale}"')
            predicted_products = np.unique( np.array(list(preds_locale["next_item_prediction"].values)) )
            assert np.all( np.isin(predicted_products, products['id']) ), f"Invalid products in {locale} predictions"
In [15]:
check_predictions(predictions)
In [16]:
# Its important that the parquet file you submit is saved with pyarrow backend
predictions.to_parquet(f'submission_{task}.parquet', engine='pyarrow')

Submit to AIcrowd 🚀

In [ ]:
# You can submit with aicrowd-cli, or upload manually on the challenge page.
!aicrowd submission create -c task-2-next-product-recommendation-for-underrepresented-languages -f "submission_task2.parquet"

Comments

tuotuo
Over 1 year ago

upload always failed. how should i fix it?

You must login before you can post a comment.

Execute