Contribution au projet

Pré-requis

Il faut installer python3-venv, curl, make et git avant Poetry :

sudo apt-get install -y curl make git python3-venv

Poetry

curl -sSL https://install.python-poetry.org | python3 -

Ajouter la commande suivante dans le .bashrc :

export PATH="$HOME/.local/bin:$PATH"

Specifier la version de Python à Poetry : python 3.10

Si on a une version de python > 3.10 on peut utiliser pyenv pour spécifier la version de python à utiliser en local sur le dossier :

pyenv local 3.10

poetry env use 3.10

Installation des dépendances

poetry config virtualenvs.in-project true
poetry install

poetry config virtualenvs.in-project true permet d’installer l’environnement comme un sous-dossier du projet plutôt que dans le home. C’est recommandé pour que VSCode trouve l’environnement.

Pour développer la pipeline, il faut des packages supplémentaires :

poetry install --extras "pipeline"

Debug Poetry

Pour supprimer un environnement : https://python-poetry.org/docs/managing-environments/

poetry env list
poetry env remove 3.7

Pour nettoyer tout

rm poetry.lock 
poetry env list
poetry env remove leximpact-prepare-data-0Rkp9wuO-py3.8
poetry cache clear --all pypi
poetry env use -vvv 3.8
poetry install

Pour afficher l’arbre des dépendances:

 poetry show --tree

En cas de problèmes d’install:

rm poetry.lock

Pour supprimer un environnement : https://python-poetry.org/docs/managing-environments/

How to develop

Lien sécurisé vers l’ERFS-FPR

To use hosted protected data with local algorithm:

sudo mkdir -p /mnt/data-in /mnt/data-out
sudo chown $USER:$USER /mnt/data-*
sshfs ysabell:/data/private-data/input /mnt/data-in
sshfs ysabell:/data/private-data/output /mnt/data-out

as local $USER and where ysabell is defined in local ~/.ssh/config.

Create symlink

!ln -s ../leximpact_prepare_data
!cd analyses && ln -s ../../leximpact_prepare_data
!cd extractions_base_des_impots && ln -s ../../leximpact_prepare_data
!cd retraitement_erfs-fpr && ln -s ../../leximpact_prepare_data

Update package to last version

poetry update

Jupyter

First time, and after adding a librairy :

poetry run python -m ipykernel install --name leximpact-prepare-data-kernel --user

Launch jupyter

poetry run jupyter lab

Check style

make precommit

Update precommit

A faire de temps en temps pour rester à jour:

poetry run pre-commit autoupdate

NBDev

Run pre-commit before converting notebooks poetry run pre-commit run --all-files

Build lib from notebook poetry run nbdev_build_lib

Build docs from notebook poetry run nbdev_build_docs

Re-run pre-commit poetry run pre-commit run --all-files

# Pour formater automatiquement le code (voir l'entrée precommit dans Makefile pour le détail)
!make precommit

# Build docs from notebookµ
#!poetry run nbdev_build_docs
!cd .. && make docs

How we build the docs

The documentation is available at https://documentation.leximpact.dev/leximpact_prepare_data/

It’s build with NBDev in the GitLab CI.

We do it like this: - Use Poetry env for default environnnement - Use venv for specific env to remove notebook output, because --clear-output do not work with nbconvert < 6 that is needed by other dependencies. We do it to avoid publishing sensitive data. We have to find a better way to publish outputs without sensitive data. - Build the docs with poetry run nbdev_docs.

Then we copy the docs via scp to our server and serve them with Nginx.

Since NBDev v2 the doc is a pure static site.

After upgrading NBDev, do not forget to upgrade Quarto with: curl -LO https://www.quarto.org/download/latest/quarto-linux-amd64.deb && dpkg -i quarto-linux-amd64.deb

The CI also push the doc to a branch. To do it we need a token from https://git.leximpact.dev/admin/users/project_18_bot/impersonation_tokens to be put in the CI variable API_TOKEN.

Test de la doc en local

poetry run nbdev_preview

Anaconda sur CASD

Construction du paquet

docker run -i -t -v $PWD:/src continuumio/miniconda3 /bin/bash
cd /src
python3 gitlab-ci/src/get_pypi_info.py -p leximpact-prepare-data
conda install -y conda-build anaconda-client
conda config --set anaconda_upload yes
conda build -c conda-forge -c leximpact -c openfisca .conda

Pour faire l’upload:

anaconda login
anaconda upload \
    /opt/conda/conda-bld/noarch/leximpact-prepare-data-0.0.8-py_0.tar.bz2 \
    /opt/conda/conda-bld/noarch/leximpact-prepare-data-casd-0.0.8-py_0.tar.bz2 \
    /opt/conda/conda-bld/noarch/leximpact-prepare-data-dev-0.0.8-py_0.tar.bz2

Test en local

Installer le paquet dans un environnement propre:

mkdir -p casd-test
cd casd-test
git clone https://git.leximpact.dev/leximpact/simulateur-socio-fiscal/budget/leximpact-prepare-data.git
rm -r  ./conda-env
conda create  --prefix ./conda-env python=3.8
conda activate ./conda-env
conda config --add channels conda-forge
conda config --set channel_priority strict
conda install -c conda-forge -c openfisca -c leximpact leximpact-prepare-data-casd
ipython kernel install --user --name=prepare-data-conda-env

Pour vérifier que tout a fonctionné:

jupyter lab

Puis ouvrir le fichier leximpact-prepare-data/notebook/extractions_base_des_impots/test_install.ipynb et l’exécuter.

Pour sortir de l’environnement

conda deactivate

Initialisation de la base ERFS-FPR

Nous recevons de l’INSEE des fichiers SAS concernant des ménages.

Or nous avons besoin de foyers fiscaux pour nos traitements.

Pour passer des ménages aux foyers fiscaux nous utilisons OpenFisca France Data.

L’intégration continue de OpenFisca France Data effectue ce traitement. On le trouve sur le serveur dans /mnt/data-out/leximpact/erfs-fpr/, cela nous permet d’obtenir le fichier openfisca_erfs_fpr_2021.h5 que l’on va utiliser à l’étape suivante.

Si jamais vous souhaitez le refaire à la main :

clone git@git.leximpact.dev:benjello/openfisca-france-data.git
cd openfisca-france-data/
python3 -m venv .venv
source .venv/bin/activate
make install
cp /mnt/data-out/openfisca-france-data/openfisca_survey_manager_config-after-build-collection.ini ~/.config/openfisca-survey-manager/config.ini
cp /mnt/data-out/data_collections/bilal/erfs_fpr.json ./erfs_fpr.json
nano ~/.config/openfisca-survey-manager/config.ini
nano /home/jupyter-benoit/openfisca-france-data/erfs_fpr.json
cp /mnt/data-out/erfs_fpr_2021.h5 /home/jupyter-benoit/openfisca-france-data/erfs_fpr_2021.h5
build-erfs-fpr -y 2021

Le script build-erfs-fpr exécute le code openfisca_france_data.erfs_fpr.input_data_builder:main.