機械学習プロジェクトのディレクトリ構成

## 機械学習プロジェクトにおける課題 - 学習時と推論時は同じ前処理をしないといけない - 前処理や特徴量のチューニングをおこなうと作成されたモデルがどのコードから作成された特徴量を使って学習したモデルなのかがわからなくなる - どこでも同じコードで動作するようにしたい - ローカル、クラウド上のNotebook、クラウド上のGPUを持つマシン、分散処理など - デファクトが無いため、他の人が書いたコードを読むことが難しい - 試行錯誤している時は自由に書いても良いかもしれないが、長期的にメンテナンスするときには可読性や保守性も考えたコードを書きたい ## ディレクトリ構成 ### cookiecutter-data-science - cookiecutter-data-scienceを参考にし、sagemakerディレクトリだけ追加してみた ``` ├── LICENSE ├── Makefile <- Makefile with commands like `make data` or `make train` ├── README.md <- The top-level README for developers using this project. ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump. │ ├── docs <- A default Sphinx project; see sphinx-doc.org for details │ ├── models <- Trained and serialized models, model predictions, or model summaries │ ├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), │ the creator's initials, and a short `-` delimited description, e.g. │ `1.0-jqp-initial-data-exploration`. │ ├── references <- Data dictionaries, manuals, and all other explanatory materials. │ ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. │ └── figures <- Generated graphics and figures to be used in reporting │ ├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g. │ generated with `pip freeze > requirements.txt` │ ├── sagemaker <- Script for SageMaker │ ├── Dockerfile <- Dockerfile for SageMaker custom container │ └── entry.py <- Entry script to train and deploy │ ├── setup.py <- Make this project pip installable with `pip install -e` ├── src <- Source code for use in this project. │ ├── __init__.py <- Makes src a Python module │ │ │ ├── data <- Scripts to download or generate data │ │ └── make_dataset.py │ │ │ ├── features <- Scripts to turn raw data into features for modeling │ │ └── build_features.py │ │ │ ├── models <- Scripts to train models and then use trained models to make │ │ │ predictions │ │ ├── predict_model.py │ │ └── train_model.py │ │ │ └── visualization <- Scripts to create exploratory and results oriented visualizations │ └── visualize.py └── tox.ini <- tox file with settings for running tox; see tox.testrun.org ``` ### 機械学習で泣かないためのコード設計 2018 - 以下のモジュールに分ける - Dataset - Transformer - Trainer - Model - Storage - Model API ## 参考リンク - [https://drivendata.github.io/cookiecutter-data-science/#directory-structure](https://drivendata.github.io/cookiecutter-data-science/#directory-structure) - [https://www.slideshare.net/takahirokubo7792/2018-97367311](https://www.slideshare.net/takahirokubo7792/2018-97367311)