Discussion

"In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful" - Torvalds, Linus (2006-06-27)

In his 2007 Google talk, Torvalds discussed his horrible experience with existing version control systems (VCSs), and also the way he evaluated the existing VCSs (e.g. cps, svn, bitkeeper, etc..) He explained that every existing VCS failed the following criteria in some way:

  1. Not distributed, not worth using: distribution means "you do not have 1 central location that keeps track of your data"
    1. commit changes without disturbing others
    2. trust your data without trusting everyone. anyone can push
    3. work without needing an internet connection
    4. release engineering
  2. Perform badly, not worth using
  3. Guarentee that input comes back out guarenteed the same.
    1. Memory corruption
    2. Disk corruption https://youtu.be/4XpnKHJAok8

W. Trevor King's questioning of dat "Why not just use git?" (2014), calling for further evidence that git is not sufficient in the reasons listed on the dat docs. Namely King writes: Why are dat commits lighter weight? Or does dat not have commits at all?:

  1. git status can take minutes or hours
  2. Git only stores full history

What is data, isn't code also data? For us, data is any dataset that is too large to use git on it. I personally have 10 deep learning projects where there are complicated discussions of how to source the data and reproduce it. There are 2 reasons why my projects are tough to reproduce: (1) the data is not easy to get (2) the steps of the process are not easy to reproduce

Objective

  1. Reproducible data science (Shareability vs Reproducibility)
    1. commit my data files
    2. commit my code
    3. push code and data to remote(s)
    4. share repo with a friend
    5. they pull and reproduce
  2. Trustless collaboration
    1. branch dataset
    2. edit dataset
    3. resolve merge conflicts
    4. commit and push to remote
      1. preview changes (diff)
    5. submit dataset for review
      1. compare old to new
      2. are there restrictions / access controls / etc..?
    6. merge dataset

Should the models data (e.g. weights) be combined with datasets (e.g. large training datasets, low latency streaming observations)? What is the role of a model and model pipeline in the data version control? I think that the criteria used to accept a merge of a new model is so different from accepting changes to a dataset, that they should be considered separate.

2 topics:

  • dataset version control
  • model version control

Current tools are not data versioning. They are pipeline versioning. Both dvc and pachyderm couple models with data and call it a pipeline.

Design of delta compression for large datasets

The issue is that the diffing at a row level requires non-trivial time. In that time, new pushes can occur. You can represent the files as chunks to speed this up and distribute the compute across nodes. dvc push

- make exact copy on remote(s)
- if a delta compression job is in process, compare it to the last push
- if a delta compression job is not in process, start one
- complete the diff calc job -> save the diffs
- delete dataset copy

Dan@Pach (10 Feb 2021)

Yeah this is one of the big problems with pushing and pulling real data sets to Github. It’s big and it won’t scale because as the data gets bigger and bigger transferring that data becomes quickly untenable. Some background history on Pachyderm:We explored both the external metadata approach (recording data that changed but not enforcing immutability) and using Git/Github instead of building our own Git like system. We quickly discovered the limitations of scale. Without immutability you have nothing. You can record that I added or deleted 1000 JPEGs to a directory to an external server but as soon as that data changes to a state that you can’t replicate the metadata is worthless. Imagine I do 50 training runs on 1000 JPEGs and then I alter them by crunching them down from 1024x768 to 500x500. My metadata server still tells me I ran 50 training runs on the 1000 Jpegs at 1024x768 but now I don’t have those files so that data is useless. I can’t do those training runs again unless I can restore from backup. The copy-on-write filesystem of Pachyderm is essential to doing data version control right. You need to record metadata and take a snapshot of changes at the same time to have real reproducibility.

Using Git is another failure state. It wasn’t designed for lots of data. It was designed for code, aka hand-built logic. Data is not code. Code is usually much smaller and allows people to push and pull and check things out and have local copies. For data, the reason Pachyderm choose to have a centralized data system is to deal with the data transfer problem. Pachyderm can cache files locally but usually doesn’t unless you specifically ask it to because you don’t want to be pushing and pulling massive data sets around, as you are discovering now. You want to work on datasets directly in the cloud but appear to be working locally. This is a big difference in terms of design approach. DVC’s approach is good for toy data sets but it won’t work in the real world with real datasets. You’re just working with 1.3 GB but now imagine you’re working with 100 GB of data and 5 other data scientists. Now you’re all keeping a local 100GB copy of data and pushing and pulling it around and trying to keep it in sync. Awful experience that won’t work no matter how hard any team tries.

I think if you are focusing on all the things we discussed that would be brilliant. What I would absolutely love to see is to show a use case where immutability is not optional. Run a training on a dataset with DVC, than overwrite the files and now try to recreate the model. I’d also like you to focus on the problem of scale and what it would be like to push and pull a big dataset locally, especially with limited laptop space and a distributed team.

On trustless collaboration of large datasets

  • Decibel: The Relational Dataset Branching System
    • uses git for large versioned db
  • Aditya Parameswaran (latest work on dataframes)
 

Code

%matplotlib inline

from pathlib import Path
import random
import traceback

from bs4 import BeautifulSoup
import numpy as np

from fastai.basics import *
from fastai.vision import models
from fastai.vision.all import *
from fastai.metrics import *
from fastai.data.all import *
from fastai.callback import *

import wandb
from fastai.callback.wandb import *
import ray
from prcvd.tabular import get_tags_in_order, get_attribute

def get_y_fn(fp):
    l_str = str(img_to_l[fp])
    out = l_str \
        .replace('labels', 'labels_int') \
        .replace('png', 'tif')
    return out

Data Versioning

Using the tool called DVC, we will version the relevant artifacts. These files are too large to version inside git because pushing to the github remote would be expensive. The input dataset is large so it cannot be stored on the root volume of the instance we're running the experiments. The experiment instance has a separate volume configured for this purpose. DVC encapsulates the steps required to source data, so the result of our data versioning implementation will be to enable a new data scientist to reproduce the last training environment commited by simply:

git clone https://github.com/prcvd/blog.git
cd blog
# git checkout whatever-branch

dvc pull
dvc repro

That new data scientist can then use the same dvc commands dvc add, dvc commit and dvc push to ensure their changes to the dataset are recorded in the git repo for the project.

Key outcomes to explore

  1. share the project with simple steps where project now includes:

    1. dataset
    2. code
    3. evaluation
    4. pipeline files
  2. utilize branches to protect the master version of datasets (along with code)

The inputs we want to version are:

  • the labels (training.xml)
  • the pre-processed data
  • label definitions

The outputs we want to version are:

  • the trained model
!dvc add --external /ws/data/skin-tone/headsegmentation_dataset_ccncsa/training.xml
100% Add|██████████████████████████████████████████████|1/1 [00:00,  2.50file/s]

To track the changes with git, run:

	git add training.xml.dvc
!git add training.xml.dvc
!git commit -m 'first commit'
[dvc 4d2b78f] first commit
 1 file changed, 4 insertions(+)
 create mode 100644 _notebooks/training.xml.dvc
!dvc remote add -d aws s3://mlops-datavc/face-profile
Setting 'aws' as a default remote.
!dvc push
  0% Uploading|                                      |0/1 [00:00<?,     ?file/s]
!
  0%|          |/ws/data/skin-tone/headsegmenta0.00/10.5M [00:00<?,        ?B/s]
  2%|▏         |/ws/data/skin-tone/headsegm256k/10.5M [00:00<00:13,     820kB/s]
  5%|▍         |/ws/data/skin-tone/headsegm512k/10.5M [00:00<00:13,     789kB/s]
  7%|▋         |/ws/data/skin-tone/headsegm768k/10.5M [00:00<00:10,     961kB/s]
 10%|▉         |/ws/data/skin-tone/headseg1.00M/10.5M [00:01<00:13,     732kB/s]
 12%|█▏        |/ws/data/skin-tone/headseg1.25M/10.5M [00:01<00:12,     758kB/s]
 14%|█▍        |/ws/data/skin-tone/headseg1.50M/10.5M [00:02<00:13,     707kB/s]
 17%|█▋        |/ws/data/skin-tone/headseg1.75M/10.5M [00:02<00:15,     585kB/s]
 19%|█▉        |/ws/data/skin-tone/headseg2.00M/10.5M [00:02<00:12,     739kB/s]
 21%|██▏       |/ws/data/skin-tone/headseg2.25M/10.5M [00:03<00:16,     524kB/s]
 24%|██▍       |/ws/data/skin-tone/headseg2.50M/10.5M [00:03<00:12,     688kB/s]
 26%|██▌       |/ws/data/skin-tone/headseg2.75M/10.5M [00:04<00:14,     567kB/s]
 29%|██▊       |/ws/data/skin-tone/headseg3.00M/10.5M [00:04<00:12,     647kB/s]
 31%|███       |/ws/data/skin-tone/headseg3.25M/10.5M [00:05<00:14,     512kB/s]
 36%|███▌      |/ws/data/skin-tone/headseg3.75M/10.5M [00:06<00:13,     533kB/s]
 41%|████      |/ws/data/skin-tone/headseg4.25M/10.5M [00:07<00:12,     539kB/s]
 45%|████▌     |/ws/data/skin-tone/headseg4.75M/10.5M [00:08<00:10,     571kB/s]
 48%|████▊     |/ws/data/skin-tone/headseg5.00M/10.5M [00:09<00:12,     442kB/s]
 50%|████▉     |/ws/data/skin-tone/headseg5.23M/10.5M [00:09<00:09,     561kB/s]
 52%|█████▏    |/ws/data/skin-tone/headseg5.48M/10.5M [00:09<00:11,     470kB/s]
 55%|█████▍    |/ws/data/skin-tone/headseg5.73M/10.5M [00:10<00:10,     471kB/s]
 57%|█████▋    |/ws/data/skin-tone/headseg5.98M/10.5M [00:10<00:09,     506kB/s]
 59%|█████▉    |/ws/data/skin-tone/headseg6.23M/10.5M [00:11<00:08,     542kB/s]
 62%|██████▏   |/ws/data/skin-tone/headseg6.48M/10.5M [00:11<00:07,     553kB/s]
 64%|██████▍   |/ws/data/skin-tone/headseg6.73M/10.5M [00:12<00:07,     558kB/s]
 67%|██████▋   |/ws/data/skin-tone/headseg6.98M/10.5M [00:12<00:06,     566kB/s]
 69%|██████▉   |/ws/data/skin-tone/headseg7.23M/10.5M [00:13<00:05,     577kB/s]
 71%|███████▏  |/ws/data/skin-tone/headseg7.48M/10.5M [00:13<00:05,     586kB/s]
 74%|███████▍  |/ws/data/skin-tone/headseg7.73M/10.5M [00:14<00:04,     592kB/s]
 76%|███████▌  |/ws/data/skin-tone/headseg7.98M/10.5M [00:14<00:05,     519kB/s]
 79%|███████▊  |/ws/data/skin-tone/headseg8.23M/10.5M [00:15<00:04,     520kB/s]
 81%|████████  |/ws/data/skin-tone/headseg8.48M/10.5M [00:15<00:03,     547kB/s]
 83%|████████▎ |/ws/data/skin-tone/headseg8.73M/10.5M [00:16<00:03,     532kB/s]
 86%|████████▌ |/ws/data/skin-tone/headseg8.98M/10.5M [00:16<00:02,     556kB/s]
 88%|████████▊ |/ws/data/skin-tone/headseg9.23M/10.5M [00:16<00:02,     585kB/s]
 90%|█████████ |/ws/data/skin-tone/headseg9.48M/10.5M [00:17<00:01,     594kB/s]
 93%|█████████▎|/ws/data/skin-tone/headseg9.73M/10.5M [00:17<00:01,     601kB/s]
 95%|█████████▌|/ws/data/skin-tone/headseg9.98M/10.5M [00:18<00:00,     606kB/s]
 98%|█████████▊|/ws/data/skin-tone/headseg10.2M/10.5M [00:18<00:00,     608kB/s]
100%|██████████|/ws/data/skin-tone/headseg10.5M/10.5M [00:19<00:00,     610kB/s]
1 file pushed                                                                   
!dvc add --external /ws/data/skin-tone/headsegmentation_dataset_ccncsa/labels
Adding...                                                                       
!
  0%|          |Computing file/dir hashes (only0.00/19.2k [00:00<?,      ?md5/s]
  0%|          |Computing file/dir hashes (5.00/19.2k [00:00<08:24,   38.1md5/s]
  0%|          |Computing file/dir hashes (7.00/19.2k [00:00<12:28,   25.7md5/s]
  0%|          |Computing file/dir hashes (11.0/19.2k [00:00<11:09,   28.7md5/s]
  0%|          |Computing file/dir hashes (15.0/19.2k [00:00<11:06,   28.9md5/s]
  0%|          |Computing file/dir hashes (65.0/19.2k [00:00<07:56,   40.2md5/s]
  1%|          |Computing file/dir hashes (o151/19.2k [00:00<04:04,   78.1md5/s]
  1%|          |Computing file/dir hashes (o240/19.2k [00:00<02:10,    146md5/s]
  2%|▏         |Computing file/dir hashes (o344/19.2k [00:00<01:36,    196md5/s]
  2%|▏         |Computing file/dir hashes (o438/19.2k [00:01<01:13,    257md5/s]
  3%|▎         |Computing file/dir hashes (o562/19.2k [00:01<00:36,    516md5/s]
  4%|▎         |Computing file/dir hashes (o699/19.2k [00:01<00:24,    750md5/s]
  4%|▍         |Computing file/dir hashes (o833/19.2k [00:01<00:21,    862md5/s]
  5%|▌         |Computing file/dir hashes (o965/19.2k [00:01<00:16,  1.11kmd5/s]
  6%|▌         |Computing file/dir hashes 1.10k/19.2k [00:01<00:15,  1.16kmd5/s]
  6%|▋         |Computing file/dir hashes 1.23k/19.2k [00:01<00:17,  1.06kmd5/s]
  7%|▋         |Computing file/dir hashes 1.37k/19.2k [00:01<00:15,  1.15kmd5/s]
  8%|▊         |Computing file/dir hashes 1.51k/19.2k [00:01<00:13,  1.27kmd5/s]
  9%|▊         |Computing file/dir hashes 1.68k/19.2k [00:01<00:12,  1.36kmd5/s]
  9%|▉         |Computing file/dir hashes 1.82k/19.2k [00:02<00:12,  1.40kmd5/s]
 10%|█         |Computing file/dir hashes 1.96k/19.2k [00:02<00:12,  1.35kmd5/s]
 11%|█         |Computing file/dir hashes 2.10k/19.2k [00:02<00:13,  1.32kmd5/s]
 12%|█▏        |Computing file/dir hashes 2.23k/19.2k [00:02<00:13,  1.26kmd5/s]
 12%|█▏        |Computing file/dir hashes 2.36k/19.2k [00:02<00:13,  1.21kmd5/s]
 13%|█▎        |Computing file/dir hashes 2.48k/19.2k [00:02<00:14,  1.16kmd5/s]
 14%|█▎        |Computing file/dir hashes 2.60k/19.2k [00:02<00:14,  1.14kmd5/s]
 14%|█▍        |Computing file/dir hashes 2.77k/19.2k [00:02<00:13,  1.25kmd5/s]
 15%|█▌        |Computing file/dir hashes 2.90k/19.2k [00:02<00:13,  1.24kmd5/s]
 16%|█▌        |Computing file/dir hashes 3.02k/19.2k [00:03<00:13,  1.19kmd5/s]
 16%|█▋        |Computing file/dir hashes 3.15k/19.2k [00:03<00:13,  1.16kmd5/s]
 17%|█▋        |Computing file/dir hashes 3.28k/19.2k [00:03<00:13,  1.19kmd5/s]
 18%|█▊        |Computing file/dir hashes 3.43k/19.2k [00:03<00:12,  1.26kmd5/s]
 19%|█▊        |Computing file/dir hashes 3.57k/19.2k [00:03<00:11,  1.32kmd5/s]
 19%|█▉        |Computing file/dir hashes 3.71k/19.2k [00:03<00:12,  1.27kmd5/s]
 20%|██        |Computing file/dir hashes 3.86k/19.2k [00:03<00:11,  1.32kmd5/s]
 21%|██        |Computing file/dir hashes 4.00k/19.2k [00:03<00:11,  1.34kmd5/s]
 21%|██▏       |Computing file/dir hashes 4.13k/19.2k [00:03<00:11,  1.32kmd5/s]
 22%|██▏       |Computing file/dir hashes 4.29k/19.2k [00:04<00:11,  1.36kmd5/s]
 23%|██▎       |Computing file/dir hashes 4.43k/19.2k [00:04<00:18,    782md5/s]
 24%|██▎       |Computing file/dir hashes 4.54k/19.2k [00:04<00:18,    787md5/s]
 24%|██▍       |Computing file/dir hashes 4.64k/19.2k [00:04<00:17,    846md5/s]
 25%|██▍       |Computing file/dir hashes 4.76k/19.2k [00:04<00:14,    976md5/s]
 25%|██▌       |Computing file/dir hashes 4.88k/19.2k [00:04<00:13,  1.03kmd5/s]
 26%|██▌       |Computing file/dir hashes 4.99k/19.2k [00:04<00:13,  1.08kmd5/s]
 27%|██▋       |Computing file/dir hashes 5.11k/19.2k [00:05<00:13,  1.08kmd5/s]
 27%|██▋       |Computing file/dir hashes 5.26k/19.2k [00:05<00:11,  1.26kmd5/s]
 28%|██▊       |Computing file/dir hashes 5.39k/19.2k [00:05<00:11,  1.24kmd5/s]
 29%|██▊       |Computing file/dir hashes 5.52k/19.2k [00:05<00:10,  1.26kmd5/s]
 29%|██▉       |Computing file/dir hashes 5.65k/19.2k [00:05<00:10,  1.25kmd5/s]
 30%|███       |Computing file/dir hashes 5.80k/19.2k [00:05<00:10,  1.28kmd5/s]
 31%|███       |Computing file/dir hashes 5.93k/19.2k [00:05<00:10,  1.27kmd5/s]
 32%|███▏      |Computing file/dir hashes 6.09k/19.2k [00:05<00:09,  1.34kmd5/s]
 33%|███▎      |Computing file/dir hashes 6.26k/19.2k [00:05<00:09,  1.42kmd5/s]
 33%|███▎      |Computing file/dir hashes 6.42k/19.2k [00:06<00:10,  1.23kmd5/s]
 34%|███▍      |Computing file/dir hashes 6.54k/19.2k [00:06<00:10,  1.21kmd5/s]
 35%|███▍      |Computing file/dir hashes 6.68k/19.2k [00:06<00:10,  1.25kmd5/s]
 36%|███▌      |Computing file/dir hashes 6.85k/19.2k [00:06<00:08,  1.45kmd5/s]
 36%|███▋      |Computing file/dir hashes 7.00k/19.2k [00:06<00:09,  1.34kmd5/s]
 37%|███▋      |Computing file/dir hashes 7.14k/19.2k [00:06<00:10,  1.19kmd5/s]
 38%|███▊      |Computing file/dir hashes 7.26k/19.2k [00:06<00:10,  1.16kmd5/s]
 38%|███▊      |Computing file/dir hashes 7.38k/19.2k [00:06<00:10,  1.17kmd5/s]
 39%|███▉      |Computing file/dir hashes 7.54k/19.2k [00:06<00:09,  1.27kmd5/s]
 40%|███▉      |Computing file/dir hashes 7.68k/19.2k [00:07<00:08,  1.31kmd5/s]
 41%|████      |Computing file/dir hashes 7.82k/19.2k [00:07<00:08,  1.33kmd5/s]
 41%|████▏     |Computing file/dir hashes 7.96k/19.2k [00:07<00:08,  1.26kmd5/s]
 42%|████▏     |Computing file/dir hashes 8.11k/19.2k [00:07<00:08,  1.28kmd5/s]
 43%|████▎     |Computing file/dir hashes 8.25k/19.2k [00:07<00:08,  1.24kmd5/s]
 44%|████▎     |Computing file/dir hashes 8.38k/19.2k [00:07<00:09,  1.20kmd5/s]
 44%|████▍     |Computing file/dir hashes 8.50k/19.2k [00:07<00:09,  1.09kmd5/s]
 45%|████▍     |Computing file/dir hashes 8.64k/19.2k [00:07<00:08,  1.24kmd5/s]
 46%|████▌     |Computing file/dir hashes 8.77k/19.2k [00:07<00:08,  1.23kmd5/s]
 46%|████▌     |Computing file/dir hashes 8.89k/19.2k [00:07<00:08,  1.23kmd5/s]
 47%|████▋     |Computing file/dir hashes 9.04k/19.2k [00:08<00:07,  1.30kmd5/s]
 48%|████▊     |Computing file/dir hashes 9.18k/19.2k [00:08<00:07,  1.29kmd5/s]
 48%|████▊     |Computing file/dir hashes 9.31k/19.2k [00:08<00:07,  1.31kmd5/s]
 49%|████▉     |Computing file/dir hashes 9.45k/19.2k [00:08<00:07,  1.32kmd5/s]
 50%|████▉     |Computing file/dir hashes 9.59k/19.2k [00:08<00:07,  1.33kmd5/s]
 51%|█████     |Computing file/dir hashes 9.81k/19.2k [00:08<00:06,  1.50kmd5/s]
 52%|█████▏    |Computing file/dir hashes 9.98k/19.2k [00:08<00:06,  1.47kmd5/s]
 53%|█████▎    |Computing file/dir hashes 10.1k/19.2k [00:08<00:06,  1.48kmd5/s]
 53%|█████▎    |Computing file/dir hashes 10.3k/19.2k [00:08<00:06,  1.45kmd5/s]
 54%|█████▍    |Computing file/dir hashes 10.4k/19.2k [00:09<00:06,  1.35kmd5/s]
 55%|█████▍    |Computing file/dir hashes 10.6k/19.2k [00:09<00:06,  1.30kmd5/s]
 56%|█████▌    |Computing file/dir hashes 10.7k/19.2k [00:09<00:10,    794md5/s]
 56%|█████▋    |Computing file/dir hashes 10.8k/19.2k [00:09<00:08,    938md5/s]
 57%|█████▋    |Computing file/dir hashes 11.0k/19.2k [00:09<00:08,  1.02kmd5/s]
 58%|█████▊    |Computing file/dir hashes 11.1k/19.2k [00:09<00:07,  1.05kmd5/s]
 58%|█████▊    |Computing file/dir hashes 11.2k/19.2k [00:09<00:07,  1.14kmd5/s]
 59%|█████▉    |Computing file/dir hashes 11.3k/19.2k [00:09<00:07,  1.12kmd5/s]
 60%|█████▉    |Computing file/dir hashes 11.5k/19.2k [00:10<00:06,  1.16kmd5/s]
 60%|██████    |Computing file/dir hashes 11.6k/19.2k [00:10<00:06,  1.23kmd5/s]
 61%|██████    |Computing file/dir hashes 11.7k/19.2k [00:10<00:06,  1.24kmd5/s]
 62%|██████▏   |Computing file/dir hashes 11.9k/19.2k [00:10<00:05,  1.27kmd5/s]
 62%|██████▏   |Computing file/dir hashes 12.0k/19.2k [00:10<00:05,  1.24kmd5/s]
 63%|██████▎   |Computing file/dir hashes 12.1k/19.2k [00:10<00:05,  1.20kmd5/s]
 64%|██████▎   |Computing file/dir hashes 12.2k/19.2k [00:10<00:05,  1.17kmd5/s]
 64%|██████▍   |Computing file/dir hashes 12.4k/19.2k [00:10<00:06,  1.12kmd5/s]
 65%|██████▌   |Computing file/dir hashes 12.5k/19.2k [00:10<00:05,  1.26kmd5/s]
 66%|██████▌   |Computing file/dir hashes 12.7k/19.2k [00:10<00:05,  1.29kmd5/s]
 67%|██████▋   |Computing file/dir hashes 12.8k/19.2k [00:11<00:05,  1.28kmd5/s]
 67%|██████▋   |Computing file/dir hashes 13.0k/19.2k [00:11<00:05,  1.22kmd5/s]
 68%|██████▊   |Computing file/dir hashes 13.1k/19.2k [00:11<00:04,  1.25kmd5/s]
 69%|██████▊   |Computing file/dir hashes 13.2k/19.2k [00:11<00:05,  1.19kmd5/s]
 69%|██████▉   |Computing file/dir hashes 13.3k/19.2k [00:11<00:05,  1.17kmd5/s]
 70%|██████▉   |Computing file/dir hashes 13.5k/19.2k [00:11<00:04,  1.16kmd5/s]
 71%|███████   |Computing file/dir hashes 13.6k/19.2k [00:11<00:04,  1.15kmd5/s]
 71%|███████▏  |Computing file/dir hashes 13.7k/19.2k [00:11<00:04,  1.32kmd5/s]
 72%|███████▏  |Computing file/dir hashes 13.9k/19.2k [00:11<00:03,  1.38kmd5/s]
 73%|███████▎  |Computing file/dir hashes 14.0k/19.2k [00:12<00:03,  1.38kmd5/s]
 74%|███████▎  |Computing file/dir hashes 14.2k/19.2k [00:12<00:03,  1.33kmd5/s]
 74%|███████▍  |Computing file/dir hashes 14.3k/19.2k [00:12<00:03,  1.26kmd5/s]
 75%|███████▌  |Computing file/dir hashes 14.4k/19.2k [00:12<00:03,  1.28kmd5/s]
 76%|███████▌  |Computing file/dir hashes 14.6k/19.2k [00:12<00:03,  1.24kmd5/s]
 76%|███████▋  |Computing file/dir hashes 14.7k/19.2k [00:12<00:03,  1.18kmd5/s]
 77%|███████▋  |Computing file/dir hashes 14.8k/19.2k [00:12<00:03,  1.20kmd5/s]
 78%|███████▊  |Computing file/dir hashes 15.0k/19.2k [00:12<00:03,  1.23kmd5/s]
 79%|███████▊  |Computing file/dir hashes 15.1k/19.2k [00:12<00:03,  1.30kmd5/s]
 79%|███████▉  |Computing file/dir hashes 15.3k/19.2k [00:13<00:03,  1.28kmd5/s]
 80%|███████▉  |Computing file/dir hashes 15.4k/19.2k [00:13<00:03,  1.18kmd5/s]
 81%|████████  |Computing file/dir hashes 15.5k/19.2k [00:13<00:03,  1.14kmd5/s]
 81%|████████▏ |Computing file/dir hashes 15.6k/19.2k [00:13<00:02,  1.20kmd5/s]
 82%|████████▏ |Computing file/dir hashes 15.8k/19.2k [00:13<00:02,  1.27kmd5/s]
 83%|████████▎ |Computing file/dir hashes 15.9k/19.2k [00:13<00:02,  1.23kmd5/s]
 83%|████████▎ |Computing file/dir hashes 16.1k/19.2k [00:13<00:02,  1.18kmd5/s]
 84%|████████▍ |Computing file/dir hashes 16.2k/19.2k [00:13<00:02,  1.24kmd5/s]
 85%|████████▍ |Computing file/dir hashes 16.3k/19.2k [00:13<00:02,  1.40kmd5/s]
 86%|████████▌ |Computing file/dir hashes 16.5k/19.2k [00:14<00:02,  1.35kmd5/s]
 86%|████████▋ |Computing file/dir hashes 16.6k/19.2k [00:14<00:02,  1.29kmd5/s]
 87%|████████▋ |Computing file/dir hashes 16.8k/19.2k [00:14<00:02,  1.18kmd5/s]
 88%|████████▊ |Computing file/dir hashes 16.9k/19.2k [00:14<00:02,  1.13kmd5/s]
 88%|████████▊ |Computing file/dir hashes 17.0k/19.2k [00:14<00:03,    662md5/s]
 89%|████████▉ |Computing file/dir hashes 17.2k/19.2k [00:14<00:02,    803md5/s]
 90%|████████▉ |Computing file/dir hashes 17.3k/19.2k [00:14<00:02,    903md5/s]
 90%|█████████ |Computing file/dir hashes 17.4k/19.2k [00:15<00:01,    963md5/s]
 91%|█████████ |Computing file/dir hashes 17.5k/19.2k [00:15<00:01,  1.04kmd5/s]
 92%|█████████▏|Computing file/dir hashes 17.6k/19.2k [00:15<00:01,  1.04kmd5/s]
 92%|█████████▏|Computing file/dir hashes 17.8k/19.2k [00:15<00:01,  1.08kmd5/s]
 93%|█████████▎|Computing file/dir hashes 17.9k/19.2k [00:15<00:01,  1.27kmd5/s]
 94%|█████████▍|Computing file/dir hashes 18.1k/19.2k [00:15<00:00,  1.32kmd5/s]
 95%|█████████▍|Computing file/dir hashes 18.2k/19.2k [00:15<00:00,  1.27kmd5/s]
 95%|█████████▌|Computing file/dir hashes 18.3k/19.2k [00:15<00:00,  1.22kmd5/s]
 96%|█████████▌|Computing file/dir hashes 18.5k/19.2k [00:15<00:00,  1.33kmd5/s]
 97%|█████████▋|Computing file/dir hashes 18.7k/19.2k [00:15<00:00,  1.58kmd5/s]
 98%|█████████▊|Computing file/dir hashes 18.9k/19.2k [00:16<00:00,  1.45kmd5/s]
 99%|█████████▉|Computing file/dir hashes 19.0k/19.2k [00:16<00:00,  1.32kmd5/s]
100%|█████████▉|Computing file/dir hashes 19.2k/19.2k [00:16<00:00,  1.36kmd5/s]
                                                                                
!
Saving labels                                         |0.00 [00:00,     ?file/s]
Saving labels                                         |70.0 [00:00,   694file/s]
Saving labels                                          |139 [00:00,   692file/s]
Saving labels                                          |248 [00:00,   776file/s]
Saving labels                                          |360 [00:00,   855file/s]
Saving labels                                          |470 [00:00,   916file/s]
Saving labels                                          |555 [00:00,   878file/s]
Saving labels                                          |669 [00:00,   942file/s]
Saving labels                                          |765 [00:00,   945file/s]
Saving labels                                          |878 [00:00,   993file/s]
Saving labels                                          |983 [00:01, 1.01kfile/s]
Saving labels                                        |1.09k [00:01, 1.04kfile/s]
Saving labels                                        |1.21k [00:01, 1.06kfile/s]
Saving labels                                        |1.31k [00:01, 1.00kfile/s]
Saving labels                                        |1.43k [00:01, 1.04kfile/s]
Saving labels                                        |1.54k [00:01, 1.06kfile/s]
Saving labels                                        |1.65k [00:01,   992file/s]
Saving labels                                        |1.75k [00:01, 1.00kfile/s]
Saving labels                                        |1.86k [00:01, 1.04kfile/s]
Saving labels                                        |1.98k [00:01, 1.06kfile/s]
Saving labels                                        |2.09k [00:02, 1.08kfile/s]
Saving labels                                        |2.20k [00:02, 1.06kfile/s]
Saving labels                                        |2.31k [00:02, 1.08kfile/s]
Saving labels                                        |2.42k [00:02, 1.10kfile/s]
Saving labels                                        |2.54k [00:02, 1.11kfile/s]
Saving labels                                        |2.65k [00:02, 1.07kfile/s]
Saving labels                                        |2.76k [00:02, 1.06kfile/s]
Saving labels                                        |2.86k [00:02, 1.04kfile/s]
Saving labels                                        |2.98k [00:02, 1.07kfile/s]
Saving labels                                        |3.09k [00:02, 1.04kfile/s]
Saving labels                                        |3.20k [00:03, 1.06kfile/s]
Saving labels                                        |3.30k [00:03, 1.06kfile/s]
Saving labels                                        |3.41k [00:03, 1.06kfile/s]
Saving labels                                        |3.52k [00:03, 1.07kfile/s]
Saving labels                                        |3.64k [00:03, 1.09kfile/s]
Saving labels                                        |3.75k [00:03, 1.10kfile/s]
Saving labels                                        |3.86k [00:03,   950file/s]
Saving labels                                        |3.98k [00:03,   998file/s]
Saving labels                                        |4.09k [00:03, 1.03kfile/s]
Saving labels                                        |4.20k [00:04, 1.06kfile/s]
Saving labels                                        |4.31k [00:04, 1.05kfile/s]
Saving labels                                        |4.42k [00:04, 1.07kfile/s]
Saving labels                                        |4.53k [00:04, 1.00kfile/s]
Saving labels                                        |4.65k [00:04, 1.03kfile/s]
Saving labels                                        |4.75k [00:04, 1.03kfile/s]
Saving labels                                        |4.86k [00:04, 1.05kfile/s]
Saving labels                                        |4.97k [00:04, 1.07kfile/s]
Saving labels                                        |5.09k [00:04, 1.08kfile/s]
Saving labels                                        |5.20k [00:05, 1.06kfile/s]
Saving labels                                        |5.31k [00:05, 1.08kfile/s]
Saving labels                                        |5.42k [00:05, 1.09kfile/s]
Saving labels                                        |5.53k [00:05, 1.10kfile/s]
Saving labels                                        |5.64k [00:05, 1.04kfile/s]
Saving labels                                        |5.76k [00:05, 1.06kfile/s]
Saving labels                                        |5.87k [00:05, 1.08kfile/s]
Saving labels                                        |5.98k [00:05, 1.03kfile/s]
Saving labels                                        |6.08k [00:05, 1.03kfile/s]
Saving labels                                        |6.19k [00:05, 1.02kfile/s]
Saving labels                                        |6.29k [00:06, 1.02kfile/s]
Saving labels                                        |6.40k [00:06, 1.05kfile/s]
Saving labels                                        |6.51k [00:06,   684file/s]
Saving labels                                        |6.62k [00:06,   775file/s]
Saving labels                                        |6.72k [00:06,   753file/s]
Saving labels                                        |6.82k [00:06,   818file/s]
Saving labels                                        |6.94k [00:06,   891file/s]
Saving labels                                        |7.05k [00:06,   950file/s]
Saving labels                                        |7.16k [00:07,   998file/s]
Saving labels                                        |7.27k [00:07,   918file/s]
Saving labels                                        |7.37k [00:07,   887file/s]
Saving labels                                        |7.46k [00:07,   895file/s]
Saving labels                                        |7.57k [00:07,   950file/s]
Saving labels                                        |7.68k [00:07,   999file/s]
Saving labels                                        |7.79k [00:07, 1.01kfile/s]
Saving labels                                        |7.90k [00:07, 1.04kfile/s]
Saving labels                                        |8.01k [00:07,   988file/s]
Saving labels                                        |8.11k [00:08,   997file/s]
Saving labels                                        |8.22k [00:08, 1.03kfile/s]
Saving labels                                        |8.33k [00:08, 1.03kfile/s]
Saving labels                                        |8.67k [00:08, 1.31kfile/s]
Saving labels                                        |8.85k [00:08, 1.25kfile/s]
Saving labels                                        |9.00k [00:08, 1.11kfile/s]
Saving labels                                        |9.14k [00:08, 1.12kfile/s]
Saving labels                                        |9.28k [00:08, 1.18kfile/s]
Saving labels                                        |9.41k [00:09, 1.16kfile/s]
Saving labels                                        |9.53k [00:09,   833file/s]
Saving labels                                        |9.64k [00:09,   900file/s]
Saving labels                                        |9.76k [00:09,   958file/s]
Saving labels                                        |9.87k [00:09, 1.00kfile/s]
Saving labels                                        |9.98k [00:09, 1.04kfile/s]
Saving labels                                        |10.1k [00:09, 1.06kfile/s]
Saving labels                                        |10.2k [00:09, 1.08kfile/s]
Saving labels                                        |10.3k [00:09, 1.09kfile/s]
Saving labels                                        |10.4k [00:10, 1.08kfile/s]
Saving labels                                        |10.5k [00:10, 1.10kfile/s]
Saving labels                                        |10.7k [00:10, 1.11kfile/s]
Saving labels                                        |10.8k [00:10, 1.11kfile/s]
Saving labels                                        |10.9k [00:10, 1.12kfile/s]
Saving labels                                        |11.0k [00:10, 1.10kfile/s]
Saving labels                                        |11.1k [00:10, 1.11kfile/s]
Saving labels                                        |11.2k [00:10, 1.11kfile/s]
Saving labels                                        |11.3k [00:10, 1.12kfile/s]
Saving labels                                        |11.4k [00:11, 1.08kfile/s]
Saving labels                                        |11.6k [00:11, 1.06kfile/s]
Saving labels                                        |11.7k [00:11, 1.07kfile/s]
Saving labels                                        |11.8k [00:11, 1.08kfile/s]
Saving labels                                        |11.9k [00:11, 1.10kfile/s]
Saving labels                                        |12.0k [00:11, 1.11kfile/s]
Saving labels                                        |12.1k [00:11, 1.11kfile/s]
Saving labels                                        |12.2k [00:11, 1.12kfile/s]
Saving labels                                        |12.3k [00:11, 1.12kfile/s]
Saving labels                                        |12.5k [00:11, 1.12kfile/s]
Saving labels                                        |12.6k [00:12, 1.12kfile/s]
Saving labels                                        |12.7k [00:12, 1.11kfile/s]
Saving labels                                        |12.8k [00:12, 1.11kfile/s]
Saving labels                                        |12.9k [00:12, 1.08kfile/s]
Saving labels                                        |13.0k [00:12, 1.10kfile/s]
Saving labels                                        |13.1k [00:12, 1.10kfile/s]
Saving labels                                        |13.2k [00:12, 1.11kfile/s]
Saving labels                                        |13.4k [00:12, 1.11kfile/s]
Saving labels                                        |13.5k [00:12, 1.06kfile/s]
Saving labels                                        |13.6k [00:12, 1.08kfile/s]
Saving labels                                        |13.7k [00:13,   969file/s]
Saving labels                                        |13.8k [00:13,   931file/s]
Saving labels                                        |13.9k [00:13,   981file/s]
Saving labels                                        |14.0k [00:13, 1.02kfile/s]
Saving labels                                        |14.1k [00:13, 1.05kfile/s]
Saving labels                                        |14.2k [00:13,   983file/s]
Saving labels                                        |14.4k [00:13, 1.02kfile/s]
Saving labels                                        |14.5k [00:13, 1.05kfile/s]
Saving labels                                        |14.6k [00:13,   995file/s]
Saving labels                                        |14.7k [00:14, 1.02kfile/s]
Saving labels                                        |14.8k [00:14, 1.05kfile/s]
Saving labels                                        |14.9k [00:14, 1.07kfile/s]
Saving labels                                        |15.0k [00:14, 1.09kfile/s]
Saving labels                                        |15.1k [00:14, 1.10kfile/s]
Saving labels                                        |15.2k [00:14, 1.11kfile/s]
Saving labels                                        |15.4k [00:14, 1.11kfile/s]
Saving labels                                        |15.5k [00:14, 1.11kfile/s]
Saving labels                                        |15.6k [00:14, 1.12kfile/s]
Saving labels                                        |15.7k [00:14, 1.12kfile/s]
Saving labels                                        |15.8k [00:15, 1.07kfile/s]
Saving labels                                        |15.9k [00:15, 1.07kfile/s]
Saving labels                                        |16.0k [00:15,   914file/s]
Saving labels                                        |16.1k [00:15,   968file/s]
Saving labels                                        |16.3k [00:15, 1.01kfile/s]
Saving labels                                        |16.4k [00:15, 1.03kfile/s]
Saving labels                                        |16.5k [00:15, 1.06kfile/s]
Saving labels                                        |16.6k [00:15, 1.08kfile/s]
Saving labels                                        |16.7k [00:15, 1.09kfile/s]
Saving labels                                        |16.8k [00:16, 1.11kfile/s]
Saving labels                                        |16.9k [00:16, 1.06kfile/s]
Saving labels                                        |17.0k [00:16, 1.03kfile/s]
Saving labels                                        |17.1k [00:16, 1.05kfile/s]
Saving labels                                        |17.3k [00:16, 1.07kfile/s]
Saving labels                                        |17.4k [00:16, 1.08kfile/s]
Saving labels                                        |17.5k [00:16, 1.09kfile/s]
Saving labels                                        |17.6k [00:16, 1.10kfile/s]
Saving labels                                        |17.7k [00:16, 1.10kfile/s]
Saving labels                                        |17.8k [00:16, 1.10kfile/s]
Saving labels                                        |17.9k [00:17, 1.11kfile/s]
Saving labels                                        |18.0k [00:17, 1.08kfile/s]
Saving labels                                        |18.1k [00:17, 1.08kfile/s]
Saving labels                                        |18.3k [00:17, 1.09kfile/s]
Saving labels                                        |18.4k [00:17, 1.09kfile/s]
Saving labels                                        |18.5k [00:17, 1.09kfile/s]
Saving labels                                        |18.6k [00:17, 1.08kfile/s]
Saving labels                                        |18.7k [00:17, 1.08kfile/s]
Saving labels                                        |18.8k [00:17, 1.09kfile/s]
Saving labels                                        |18.9k [00:17, 1.10kfile/s]
Saving labels                                        |19.0k [00:18, 1.10kfile/s]
Saving labels                                        |19.1k [00:18, 1.10kfile/s]
100% Add|██████████████████████████████████████████████|1/1 [00:40, 40.59s/file]

To track the changes with git, run:

	git add labels.dvc
!git add labels.dvc
!git commit -m 'first commit.'
[dvc 3b19978] first commit.
 1 file changed, 5 insertions(+)
 create mode 100644 _notebooks/labels.dvc

path = Path("/ws/data/skin-tone/headsegmentation_dataset_ccncsa")
xml_file = path/'training.xml'

test_name = "test"

tags_to_track = ['srcimg', 'labelimg']

tags = get_tags_in_order(xml_file=xml_file, tags_to_track=tags_to_track)

srcimg_name = [
    get_attribute(elem=srcimg, tag='srcimg', attrib='name') 
    for srcimg in tags['srcimg']
]
labelimg_name = [
    get_attribute(elem=labelimg, tag='labelimg', attrib='name') 
    for labelimg in tags['labelimg']
]

pairs = []
for i, srcimg in enumerate(srcimg_name):
    pairs.append({
        'srcimg': srcimg,
        'labelimg': labelimg_name[i]
    })

fnames = [path/pair['srcimg'] for pair in pairs]
lnames = [path/pair['labelimg'] for pair in pairs]
img_to_l = {fname: lnames[i] for i, fname in enumerate(fnames)}

header = ('R','G','B','L')
mutiny_labels = [
    (0,0,0,'Background/undefined'),
    (255,0,0,'Lips'),
    (0,255,0,'Eyes'),
    (0,0,255,'Nose'),
    (255,255,0,'Hair'),
    (0,255,255,'Ears'),
    (255,0,255,'Eyebrows'),
    (255,255,255,'Teeth'),
    (128,128,128,'General face'),
    (255,192,192,'Facial hair'),
    (0,128,128,'Specs/sunglasses'),
    (255, 128, 128, '')
]

mutiny_labels = pd.DataFrame(mutiny_labels, columns=header)
mutiny_labels['I'] = mutiny_labels.index
label_map = {
    (rec['R'], rec['G'], rec['B']): rec['I'] 
    for rec in mutiny_labels.to_dict('records')
}
int_to_label = {
    rec['I']: rec['L']
    for rec in mutiny_labels.to_dict('records')
}
codes = mutiny_labels.L.values
# codes = np.append(codes, ['Error'])

name2id = {v:k for k,v in enumerate(codes)}
!dvc add --external /ws/data/skin-tone/headsegmentation_dataset_ccncsa/labels
Adding...                                                                       
!
Saving labels                                         |0.00 [00:00,     ?file/s]
Saving labels                                         |98.0 [00:00,   976file/s]
Saving labels                                          |172 [00:00,   888file/s]
Saving labels                                          |271 [00:00,   915file/s]
Saving labels                                          |370 [00:00,   936file/s]
Saving labels                                          |469 [00:00,   950file/s]
Saving labels                                          |567 [00:00,   957file/s]
Saving labels                                          |659 [00:00,   944file/s]
Saving labels                                          |755 [00:00,   947file/s]
Saving labels                                          |845 [00:00,   932file/s]
Saving labels                                          |937 [00:01,   926file/s]
Saving labels                                        |1.03k [00:01,   917file/s]
Saving labels                                        |1.12k [00:01,   845file/s]
Saving labels                                        |1.20k [00:01,   778file/s]
Saving labels                                        |1.30k [00:01,   826file/s]
Saving labels                                        |1.40k [00:01,   865file/s]
Saving labels                                        |1.50k [00:01,   895file/s]
Saving labels                                        |1.59k [00:01,   833file/s]
Saving labels                                        |1.68k [00:01,   851file/s]
Saving labels                                        |1.77k [00:01,   885file/s]
Saving labels                                        |1.87k [00:02,   911file/s]
Saving labels                                        |1.97k [00:02,   933file/s]
Saving labels                                        |2.07k [00:02,   949file/s]
Saving labels                                        |2.17k [00:02,   961file/s]
Saving labels                                        |2.27k [00:02,   959file/s]
Saving labels                                        |2.36k [00:02,   853file/s]
Saving labels                                        |2.46k [00:02,   882file/s]
Saving labels                                        |2.56k [00:02,   904file/s]
Saving labels                                        |2.65k [00:02,   917file/s]
Saving labels                                        |2.75k [00:03,   937file/s]
Saving labels                                        |2.85k [00:03,   940file/s]
Saving labels                                        |2.94k [00:03,   954file/s]
Saving labels                                        |3.04k [00:03,   954file/s]
Saving labels                                        |3.14k [00:03,   950file/s]
Saving labels                                        |3.23k [00:03,   959file/s]
Saving labels                                        |3.33k [00:03,   878file/s]
Saving labels                                        |3.43k [00:03,   910file/s]
Saving labels                                        |3.53k [00:03,   930file/s]
Saving labels                                        |3.62k [00:03,   932file/s]
Saving labels                                        |3.72k [00:04,   942file/s]
Saving labels                                        |3.82k [00:04,   958file/s]
Saving labels                                        |3.92k [00:04,   953file/s]
Saving labels                                        |4.02k [00:04,   960file/s]
Saving labels                                        |4.11k [00:04,   942file/s]
Saving labels                                        |4.21k [00:04,   952file/s]
Saving labels                                        |4.31k [00:04,   963file/s]
Saving labels                                        |4.41k [00:04,   970file/s]
Saving labels                                        |4.51k [00:04,   971file/s]
Saving labels                                        |4.61k [00:04,   975file/s]
Saving labels                                        |4.70k [00:05,   963file/s]
Saving labels                                        |4.80k [00:05,   938file/s]
Saving labels                                        |4.90k [00:05,   951file/s]
Saving labels                                        |5.00k [00:05,   961file/s]
Saving labels                                        |5.10k [00:05,   970file/s]
Saving labels                                        |5.20k [00:05,   976file/s]
Saving labels                                        |5.30k [00:05,   981file/s]
Saving labels                                        |5.40k [00:05,   973file/s]
Saving labels                                        |5.50k [00:05,   971file/s]
Saving labels                                        |5.59k [00:06,   964file/s]
Saving labels                                        |5.69k [00:06,   968file/s]
Saving labels                                        |5.79k [00:06,   938file/s]
Saving labels                                        |5.89k [00:06,   951file/s]
Saving labels                                        |5.98k [00:06,   944file/s]
Saving labels                                        |6.08k [00:06,   954file/s]
Saving labels                                        |6.18k [00:06,   949file/s]
Saving labels                                        |6.28k [00:06,   957file/s]
Saving labels                                        |6.37k [00:06,   961file/s]
Saving labels                                        |6.47k [00:06,   965file/s]
Saving labels                                        |6.69k [00:07, 1.16kfile/s]
Saving labels                                        |6.82k [00:07, 1.11kfile/s]
Saving labels                                        |6.95k [00:07, 1.07kfile/s]
Saving labels                                        |7.06k [00:07, 1.07kfile/s]
Saving labels                                        |7.18k [00:07, 1.03kfile/s]
Saving labels                                        |7.28k [00:07, 1.00kfile/s]
Saving labels                                        |7.39k [00:07,   991file/s]
Saving labels                                        |7.49k [00:07,   989file/s]
Saving labels                                        |7.59k [00:07,   987file/s]
Saving labels                                        |7.69k [00:08,   986file/s]
Saving labels                                        |7.79k [00:08,   988file/s]
Saving labels                                        |7.89k [00:08,   989file/s]
Saving labels                                        |7.99k [00:08,   946file/s]
Saving labels                                        |8.09k [00:08,   951file/s]
Saving labels                                        |8.19k [00:08,   959file/s]
Saving labels                                        |8.28k [00:08,   614file/s]
Saving labels                                        |8.36k [00:08,   662file/s]
Saving labels                                        |8.46k [00:09,   734file/s]
Saving labels                                        |8.56k [00:09,   793file/s]
Saving labels                                        |8.66k [00:09,   843file/s]
Saving labels                                        |8.75k [00:09,   865file/s]
Saving labels                                        |8.85k [00:09,   892file/s]
Saving labels                                        |8.95k [00:09,   913file/s]
Saving labels                                        |9.04k [00:09,   933file/s]
Saving labels                                        |9.14k [00:09,   842file/s]
Saving labels                                        |9.24k [00:09,   877file/s]
Saving labels                                        |9.34k [00:10,   906file/s]
Saving labels                                        |9.44k [00:10,   924file/s]
Saving labels                                        |9.53k [00:10,   912file/s]
Saving labels                                        |9.62k [00:10,   911file/s]
Saving labels                                        |9.72k [00:10,   918file/s]
Saving labels                                        |9.81k [00:10,   937file/s]
Saving labels                                        |9.91k [00:10,   953file/s]
Saving labels                                        |10.0k [00:10,   964file/s]
Saving labels                                        |10.1k [00:10,   969file/s]
Saving labels                                        |10.2k [00:10,   961file/s]
Saving labels                                        |10.3k [00:11,   909file/s]
Saving labels                                        |10.4k [00:11,   909file/s]
Saving labels                                        |10.5k [00:11,   914file/s]
Saving labels                                        |10.6k [00:11,   932file/s]
Saving labels                                        |10.7k [00:11,   945file/s]
Saving labels                                        |10.8k [00:11,   956file/s]
Saving labels                                        |10.9k [00:11,   964file/s]
Saving labels                                        |11.0k [00:11,   955file/s]
Saving labels                                        |11.1k [00:11,   950file/s]
Saving labels                                        |11.2k [00:11,   944file/s]
Saving labels                                        |11.3k [00:12,   939file/s]
Saving labels                                        |11.4k [00:12,   952file/s]
Saving labels                                        |11.5k [00:12,   958file/s]
Saving labels                                        |11.6k [00:12,   966file/s]
Saving labels                                        |11.7k [00:12,   971file/s]
Saving labels                                        |11.8k [00:12,   959file/s]
Saving labels                                        |11.9k [00:12,   948file/s]
Saving labels                                        |12.0k [00:12,   931file/s]
Saving labels                                        |12.1k [00:12,   934file/s]
Saving labels                                        |12.2k [00:12,   948file/s]
Saving labels                                        |12.2k [00:13,   956file/s]
Saving labels                                        |12.3k [00:13,   917file/s]
Saving labels                                        |12.4k [00:13,   934file/s]
Saving labels                                        |12.5k [00:13,   933file/s]
Saving labels                                        |12.6k [00:13,   933file/s]
Saving labels                                        |12.7k [00:13,   927file/s]
Saving labels                                        |12.8k [00:13,   938file/s]
Saving labels                                        |12.9k [00:13,   950file/s]
Saving labels                                        |13.0k [00:13,   954file/s]
Saving labels                                        |13.1k [00:13,   962file/s]
Saving labels                                        |13.2k [00:14,   966file/s]
Saving labels                                        |13.3k [00:14,   959file/s]
Saving labels                                        |13.4k [00:14,   957file/s]
Saving labels                                        |13.5k [00:14,   946file/s]
Saving labels                                        |13.6k [00:14,   936file/s]
Saving labels                                        |13.7k [00:14,   954file/s]
Saving labels                                        |13.8k [00:14,   970file/s]
Saving labels                                        |13.9k [00:14,   978file/s]
Saving labels                                        |14.0k [00:14,   983file/s]
Saving labels                                        |14.1k [00:15,   982file/s]
Saving labels                                        |14.2k [00:15,   981file/s]
Saving labels                                        |14.3k [00:15,   967file/s]
Saving labels                                        |14.4k [00:15,   949file/s]
Saving labels                                        |14.5k [00:15,   942file/s]
Saving labels                                        |14.6k [00:15,   952file/s]
Saving labels                                        |14.7k [00:15,   966file/s]
Saving labels                                        |14.8k [00:15,   968file/s]
Saving labels                                        |14.9k [00:15,   972file/s]
Saving labels                                        |15.0k [00:15,   978file/s]
Saving labels                                        |15.1k [00:16,   983file/s]
Saving labels                                        |15.2k [00:16,   984file/s]
100% Add|██████████████████████████████████████████████|1/1 [00:19, 19.57s/file]

To track the changes with git, run:

	git add labels.dvc

Trying to recreate my actual steps, I decided instead of having 2 separate directories for labels that I would just use the single directory and update the files to match the required format for the dataloader. That was accomplished by doing:

dvc add labels

cd /ws/data/skin-tone/headsegmentation_dataset_ccncsa/
cp -r labels labels-archive
rm -rf labels
cp -r labels_int labels

I noticed that dvc seemed to be pretty sluggish when I tried running a dvc diff. It ran in two steps: 1) computing new hashes 2) some unknown step that took minutes where my shell was just blank

Forget it, I just pushed. So dvc push seemed also sluggish, maybe taking 2 or 3x what I think it should from past experience. Here's the timer data from the dvc push after adding the 1.3GB of labels files.

real    28m58.997s
user    5m29.589s
sys     0m26.556s

Now just to check my memory because I haven't pushed to s3 in a bit, I timed the base s3 cp with aws-cli:

time aws s3 cp labels s3://mlops-datavc/face-profile-labels-test --recursive

real    28m39.026s
user    3m27.958s
sys     0m21.944s

It also occurs to me that I am pushing from my 100Mbps connection in Indiana to Virginia (us-east-1). So, this makes sense.

Interestingly, the cache created by dvc found in s3://mlops-datavc/face-profile is highly similar in consistency to the "raw upload" found in s3://mlops-datavc/face-profile-labels-test, but not exactly the same:

Source
s3://mlops-datavc/face-profile/
Total number of objects
15,074
Total size
1.2 GB

s3://mlops-datavc/face-profile-labels-test
Total number of objects: 15,260
Total size: 1.2 GB

Technically, the dvc cache should contain one additional file labels.xml which was added to the tracked files, however what we see is that the dvc cache is approximately the same size as the raw cache in s3, but it contains 186 fewer files. We suspect this is because there are duplicate images in the label dataset.

Q: What if I wanted to set up continuous benchmarks on the dvc push transfer rate for this pipleline?

What is going on with .dvc/cache?

  1. how are the folders defined?
    • folders: 00,01,..,09,0a,0b,...,0f,10,..,1f,...90,...,9f,f0,...ff,?
    • files named by md5 hash (see ETag vs key)
    1. How do file get allocated to folders?
  2. does the cache locally match the remote file structure?

    • yes, except dvc removes duplicate files and renames files by md5 hash
  3. does dvc keep the correspondence between the filename and the md5 hash? where is that kept?

    • sqllite (referred to as the "state db"). see issue 3366 -- waiting for response.
  4. What if the local file structure changes and I do a dvc pull
    • According to issue 2676, dvc default is to trust the remote hash (replace? waiting for dvc response) and has the option to compute hashes at every turn (paranoid mode, old default). Changing this default decreased pull latency by approximately a factor of 2.
  5. Compression
    1. Am I getting any compression natively? what format is it?
    • There is no compression that is implemented automatically. waiting for response on whether it's configurable
    1. Am I getting any delta compression on the data?
      1. create file; add it to tracker
      2. compare the cache size of a single value alteration vs cache size of a full file change.
      3. if delta compression is applied, at which level is it? (block, row, value)
  6. one cache per project, is it possible to split off caches? Different caches for different pipelines?
    • Encapsulating each pipeline so that dvc pull/push does not pull unnecessary data
    • Enabling different settings on different caches files for example, one might want to adopt different conditions for determining whether a file is duplicate (see issue 1676
  7. What happens when I pull on top of local changes?

Waiting:

Correspondence between md5 hash and filename https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory

remote compression https://github.com/iterative/dvc/issues/1239

Transitioning from another approach to versioning datasets:

  • versioned buckets
  • pachyderm
  • versioned databases

Recommendations: 1) interoperate (at least ingress)

- make it super easy to ingest data from everywhere, and that includes history (not just the current state)
    - versioned s3 buckets
    - versioned sql dbs
    - versioned dvc repos
    - versioned everything!!

2) interoperate on egress? Would go a long way with the community.

Now I need to add the full training dataset to the version control. This will take a while...

The base images share a root with the label directories, so there are 2 choices:

  1. move the labels out into their own structure. This requires updating the training.xml file to reflect the new location of the label files (or baking it into the code) 2) adding files to the tracker individually instead of full directory at a time. To do this, we can loop over the srcimg instances in the training.xml file.

Since we have a lot of experience adding whole directories, I want to see how adding individual files works at this scale.

We already have the list of srcimg in an object. One limitation of dvc is that it doesn't have a very good python api. So I am going to write a little subprocess function to add a single file to the dvc tracker.

class VCRemote:
    def __init__(self, cloud, shortname, uri, masterbranch='master'):
        self.cloud = cloud
        self.shortname = uri
        self.uri = uri
        self.masterbranch = master
        # TODO: check if it exists and can be reached
    
    
class VCPipeline:
    def __init__(self, name, logger):
        """
        name: must be unique
        """
        self.name = name
        self.remotes = {}
        self.logger = logger
        self.branchname = None
    
    
    def _make_repo_url(self, ):
        pass
    
        
    def _initialize(self):
        """e.g. git configure; git init"""
        # set up credentials - identity
        # set up repo
        pass
    
    
    def _add(self, key):
        """e.g. git add <key>"""
        pass
    
    
    def _commit(self, msg):
        """e.g. git commit -m '<msg>'"""
        pass
    
    def _push(self,):
        """e.g. git push"""
        pass
    
    
    def _status(self,):
        """e.g. git status"""
        pass
    
    
    def _merge(self, mergebranch):
        """e.g. git merge <mergebranch>"""
        pass
    
    
    def _checkout(self, branchname):
        """git checkout branchname"""
        pass
    
    
    def _create_branch(self, branchname, push=False):
        """e.g. git checkout -b <branchname>"""
        # TODO: check if branch exists, if so, just check it out
        pass
    
    
    def _rm_local(self, deletebranch):
        """e.g. git branch -d <deletebranch>"""
        pass
    
    def _rm_remote(self, deletebranch):
        """e.g. git push origin --delete <deletebranch>"""
        pass
        
    
    def rm_branch(self, deletebranch, local=True, remote=False)
        """git branch -d <branchname>"""
        if deletebranch == self.branchname:
            currentbranch=self.branchname
            self.switch_branch(
                branchname=self.masterbranch, create_on_fail=False
            )
        
        if local:
            self._rm_local(deletebranch=deletebranch)
            
        if remote:
            self._rm_remote(deletebranch=deletebranch)
    
    
    def switch_branch(self, branchname, create_on_fail=True):
        """User-facing function to change branches."""
        try:
            self._checkout(branchname=branchname)
            
        except Exception as e:
            if create_on_fail:
                self.create_branch(branchname=branchname)
            
            else:
                raise e
        
        self.branchname = branchname
    
    
    def update_and_return(self, updatebranch):
        currentbranch = self.branchname
        self.switch_branch(branchname=mergebranch)
        self.pull()
        self.switch_branch(branchname=currentbranch)
        
        
    def merge(self, mergebranch, update_all=True):
        """e.g. git merge <mergebranch>"""
        if update_all:
            self.update_and_return(updatebranch=mergebranch)
        
        self._merge(mergebranch=mergebranch)
        
    
    def add_remote(self):
        if shortname in self.remotes:
            raise Exception(
                'Remote shortname, {} already exists'
                    .format(shortname)
            )
            
        self.remotes[shortname] = VCRemote(
            cloud=cloud, shortname=shortname, uri=uri
        )
        

class GithubPipeline(VCPipeline):
    def __init__(self, name, logger=None):
        """
        name: must be unique
        """
        import subprocess
        super().__init__(name=name, logger=logger)
        
        
    def _initialize(self, username, password):
        """e.g. git configure; git init"""
        # set up credentials - identity, auth
        # get or set up repo
        # check it works
        pass
    
    
    def _add(self, key):
        """e.g. git add <key>"""
        pass
    
    
    def _commit(self, msg):
        """e.g. git commit -m '<msg>'"""
        pass
    
    def _push(self,):
        """e.g. git push"""
        pass
    
    
    def _status(self,):
        """e.g. git status"""
        pass
    
    
    def _merge(self, mergebranch):
        """e.g. git merge <mergebranch>"""
        pass
    
    
    def _checkout(self, branchname):
        """git checkout branchname"""
        pass
    
    
    def _create_branch(self, branchname, push=False):
        """e.g. git checkout -b <branchname>"""
        # TODO: check if branch exists, if so, just check it out
        pass
    
    
    def _rm_local(self, deletebranch):
        """e.g. git branch -d <deletebranch>"""
        pass
    
    def _rm_remote(self, deletebranch):
        """e.g. git push origin --delete <deletebranch>"""
        pass
        
# Pep374 talking about version control objects in python development
# https://www.python.org/dev/peps/pep-0374/#id1
import os
from pathlib import Path
from git import Repo
import git
def get_or_create_git_repo(path:Path, create_new=False):
    try:
        repo = Repo(path)
        
    except git.InvalidGitRepositoryError:
        if create_new:
            repo = Repo(path, bare=True)
            
        else:
            return None
    
    return repo
repopath = Path(os.getcwd()).parent
repo = get_or_create_git_repo(path=repopath)
repo_reader = repo.config_reader()
repo_reader.

defaults.use_cuda = True
number_of_the_seed = 2020
random.seed(number_of_the_seed)
set_seed(number_of_the_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
monitor_training="valid_loss"
comp_training=np.less
monitor_evaluating="dice"
comp_evaluating=np.greater
patience=2
size = 448
bs = 12
valid_pct=0.35

dls = SegmentationDataLoaders.from_label_func(
    path, 
    bs=bs, 
    valid_pct=valid_pct,
    fnames=fnames,
    label_func=get_y_fn,
    codes=codes, 
    item_tfms=[Resize((size,size),)],
    batch_tfms=[Normalize.from_stats(*imagenet_stats)]
)

wandb.init()
learn = unet_learner(dls, resnet34, cbs=WandbCallback())
Failed to query for notebook name, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable
wandb: Currently logged in as: soellingeraj (use `wandb login --relogin` to force relogin)
wandb: wandb version 0.10.17 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
Tracking run with wandb version 0.10.12
Syncing run quiet-darkness-2 to Weights & Biases (Documentation).
Project page: https://wandb.ai/soellingeraj/blog-_notebooks
Run page: https://wandb.ai/soellingeraj/blog-_notebooks/runs/hubg1q3a
Run data is saved locally in /ws/forks/blog/_notebooks/wandb/run-20210205_132907-hubg1q3a

Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /root/.cache/torch/hub/checkpoints/resnet34-333f7ec4.pth

learn.fine_tune(1)

learn.freeze() # Freezing the backbone

Epoch, Batch Size and Learning Rate

An epoch in model training is the number of forward passes, one per image in the "batch". The batch is a randomly chosen shuffle from the training set. In this case, the forward pass results are accounted for before a backward pass is completed. Additionally, FastAI attempts to load all the images in the batch into GPU memory at the beginning of the epoch, thus there is a certain batch size sufficiently large that will cause the GPU to run out of memory. For us, that was 12 images. For segmentation models on R, G, B encoded images, the labels i.e. segmentation mask could increase the GPU memory required by 1/3 for each batch because there is one additional channel per pixel; 4 instead of 3. That is assuming an 8 bit integer is used to specify the R, G and B channels, as well as the label. This is not necessarily the case because we don't really need an 8 bit integer because we have only about 10 different target values (Skin, Nose, etc..) So we could do with less, but the smallest integer type provided by torch is 8-bit.13 GPU memory reduction is likely the reason that the PILMask needs to be a single integer. We can see in this case that if we were to load the mask as it was provided by Mut1ny, as its own R, G, B encoded image, it would roughly double the memory required with an addition 3 8-bit integers representing the R,G and B components of the mask.

The batch size is set by the one training the model, and that defines the number of images that are evaluated per epoch. In this process, we chose the batch size that was maximized given that it didn't run out of memory. The batch size chosen, 12 images, seemed to seldom run itself out of memory, after they were resized to 256 px by 256 px. (There was some additional load on the GPU because it was running some other processes as well, which would spawn themselves unpredictably)

Note: A forward pass consists of a model prediction from input -> output and the backward pass consists of the gradient descent step where model weights and biases are updated as a function of the gradients of the node activation functions, and the learning rate. The most intuitive explanation of this process I have heard is Jeremy Howard's explanation of Stochastic Gradient Descent12

Increasing Loss(Learning Rate)

As I trained the model, I would evaluate the Loss to Learning Rate chart periodically (that is lr_find). I noticed with interest that the loss was always increasing with the learning rate, even from the first epochs run. One open question is why there are no decreasing segments of the Loss to Learning Rate Chart like we see in the FastAI tutorial6. In that, Jeremy Howard instructs users to look for regions where the Loss is decreasing as a function of Learning Rate, but on this chart it's hard to find that.

Choosing a Learning Rate

Given each epoch took around 25 minutes to complete, I was only able to do limited experimentation with the learning rate.

learn.lr_find() # find learning rate
SuggestedLRs(lr_min=9.12010818865383e-08, lr_steep=4.365158383734524e-05)

learn.recorder.plot_loss() # plot learning rate graph

lrs = slice(10e-6, 10e-5)
learn.fit_one_cycle(12, lrs)
16.67% [2/12 51:20<4:16:42]
epoch train_loss valid_loss time
0 0.034851 0.085248 25:39
1 0.032958 0.089029 25:40

76.09% [210/276 01:37<00:30 0.0339]
</div> </div> </div> </div> </div>

learn.show_results()

Model Validation

Validation Data Set

After iterating the process of sampling the model to determine the best variable learning rate bounds through about 24 hrs worth of compute cycles, I started looking at the target data set provided from Schniter and Shields2. The data set contains 2 frontal photos per subject showing faces of participants with microphone and speaker headsets, while seated at computer terminal cubicles in a laboratory setting, which were captured from video recordings of the participants.

The full-face photographs of 96 young adults aged 18 to 25 years old (51 men, 45 women) were captured from video recorded under standardized videographic conditions. All individuals video recorded were between # and # years old and were students at Chapman University in Orange, CA. Videos were taken in a computer terminal cubicle against a background of either a gray carpeted cubicle or a brown wall under standardized diffuse lighting conditions, and participants were instructed to sit upright and look at the camera mounted above the computer screen they were facing. Camera-to-head distance was controlled by the cubicle space and chair position, and camera settings were held constant. Video was taken using Logitech C920 1080P HD digital cameras. Photographs (640 x 480 pixel .jpg files, 24 Bit depth (R,G,B)) were captured as from video frames using the VLC media player (3.0.11) “snapshot” tool.

The subjects were participants in an economic experiment, interacting anonymously with matched partners two rounds of a repeated Prisoner’s Dilemma game. Participants knew that at no time would there image be transmitted to other participants during the original economic experiment. The researchers captured video of the participants with their permission for later reuse. The researchers were interested in whether information in the faces of participants (e.g. face proportions, coloration, expressions) was predictive of their game play (i.e., cooperate or cheat) and whether third parties could accurately guess their game play based on viewing their faces (from photos or thin-slice videos).

mod_dir = path/'Models'
mod_fp_chk = mod_dir/'checkpoint_20201007'
mod_fp.parent.mkdir(parents=True, exist_ok=True)
# learn.save(mod_fp)
learn.export(mod_fp_chk)
erics_img = Path('/ws/data/skin-tone/ScreenshotFaceAfterStatement/erics_imgs')
ls = erics_img.ls()

Example Image

We can see in the example image, and its predicted mask that the model generalized decently well, but there were some issues. Below, it struggled with the headphones, and misclassifies the General Face down the subject's neck. For many tasks, this will be a sufficient regional localization because the Nose, Eyes and Mouth are accurately labeled allowing for post-processing to eliminate the inaccuracies. For example measuring skin-tone, we care about the region that is labeled pink in the mask that is "General face", and the portion that is labeled orange that is "Nose". Those are the labels that are of "skin tone". The main objective before shipping this model is to determine from a metric standpoint (accuracy) and from an intuitive standpoint (looking at the images) whether or not the nose and face labels are accurate enough.

img = PILImage.create(ls[11])
pmask = learn.predict(item=img)
img.show(figsize=(5, 5), alpha=1)
<matplotlib.axes._subplots.AxesSubplot at 0x7fed4fa5a3c8>
pmask[0].show(figsize=(5, 5), alpha=1)
<matplotlib.axes._subplots.AxesSubplot at 0x7fed4fa19c88>

Filtering on a Confidence Threshold

I was curious whether we could apply a simple rule to the output from the UNet prediction that would improve effective accuracy without any more training epochs, since training epochs are expensive (25 minutes). If I were to get this model training faster with more GPUs or more GPU Memory or code enhancements, I would try a lot of different things in earlier stages to make the network more accurate. Here is the code that implements the confidence threshold.

thresh = 0.99
low_conf = []
errors = []
new_decision = pmask[1].clone()
for i in range(pmask[1].shape[0]):
    for j in range(pmask[1].shape[1]):
        decision = pmask[1][i, j]
        decision_me, conf = pred_with_prob(
            pix_vals=pmask[2][:, i, j]
        )
        
        if not decision == decision_me[0,0]:
            errors.append([i,j])
        if conf < thresh:
            low_conf.append([i,j])
            new_decision[i,j] = 0

Base Image

img.show(figsize=(5, 5), alpha=1)
<matplotlib.axes._subplots.AxesSubplot at 0x7fed4f982fd0>

Unfiltered Prediction

plt.imshow(pmask[0])
<matplotlib.image.AxesImage at 0x7fed4f8602e8>

Filtered Prediction

plt.imshow(new_decision)
<matplotlib.image.AxesImage at 0x7fed31fc69b0>

Conclusions

In Summation

  1. The UNet worked well out-of-the-box using the default process. Yay!
  2. The UNet trains slow on my experimental deep learning box
  3. The regions of the face that are topologically interesting like the nose, mouth, eyes and hair are predicted well. These regions can be leveraged to improve the accuracy of a skin-tone predictor by further isolating regions such as cheeks and forehead.

Next Steps

  1. Improve Unet model accuracy
    a. Parameter Sweeps
    b. Learning Rate Selections
    c. Code Efficiency + More Hardware (Google Colab?) = More Epochs
    d. Try different backbones: resnet18, resnet50, other...
    e. Try different pre-trained weights: Try fine-tuning the pre-trained weights using faces

Citations

1. Mut1ny.com. (2020). "Mut1ny Facial/Headsegmentation dataset. https://www.mut1ny.com/face-headsegmentation-dataset

2. Schniter, E., Shields, T. (2020). "Participant Faces From a Repeated Prisoner’s Dilemma". Unpublished raw data.

5. Biewald, L. "The story of Fast.ai & why Python is not the future of ML with Jeremy Howard". (2020). https://www.wandb.com/podcast/jeremy-howard

6. Howard, J. "Lesson 3: Deep Learning 2019 - Data blocks; Multi-label classification; Segmentation". (2019). https://youtu.be/MpZxV6DVsmM?t=4176. 1:09:00

9. Ronneberger, O. "U-Net: Convolutional Networks for Biomedical Image Segmentation". (2015). https://arxiv.org/abs/1505.04597

10. He, K., Zhang, X., Ren, S. and Sun, J. "Deep Residual Learning for Image Recognition". (2015). https://arxiv.org/abs/1512.03385

11. Pytorch, Torchvision. "TORCHVISION.MODELS". (2020). https://pytorch.org/docs/stable/torchvision/models.html

12. Google Machine Learning Glossary. "Epoch". https://developers.google.com/machine-learning/glossary#epoch

12. Howard, J. "Lesson 3 - Deep Learning for Coders (2020)". (2020). https://youtu.be/5L3Ao5KuCC4?t=5988. 1:40:00

13. PyTorch. "TORCH.TENSOR". (2020). https://pytorch.org/docs/stable/tensors.html.

14. Yang, S., Luo, P., Loy C.C. "Faceness-Net: Face Detection through Deep Facial Part Responses". (2017). https://arxiv.org/pdf/1701.08393.pdf.

</div>