csvdiff

command module

v1.0.0 Latest Latest Go to latest Published: Apr 29, 2018 License: MIT Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/erriapo/csvdiff

Links

Open Source Insights

README ¶

csvdiff

A Blazingly fast diff tool for comparing csv files.

What is csvdiff?

Csvdiff is a difftool to compute changes between two csv files.

It is not a traditional diff tool. It is most suitable for comparing csv files dumped from database tables. GNU diff tool is orders of magnitude faster on comparing line by line.
Supports specifying group of columns as primary-key.
Supports selective comparison of fields in a row.
Compares csvs of million records csv in under 2 seconds. Comparisons and benchmarks here.

Demo

demo

Usage

$ csvdiff base.csv delta.csv
# Additions: 1
# Modifications: 20
# Rows:
...

Installation

For MacOS

curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_darwin_amd64.tar.gz | tar xfz -

For centos

yum install https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_linux_64-bit.rpm

For debian

curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_linux_64-bit.deb -O
dpkg --install csvdiff_*_linux_64-bit.deb

For Linux

curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_linux_amd64.tar.gz | tar xfz -

For Windows
Build using Go

go get -u github.com/aswinkarthik93/csvdiff

Usecase

Cases where you have a base database dump as csv. If you receive the changes as another database dump as csv, this tool can be used to figure out what are the additions and modifications to the original database dump. The additions.csv can be used to create an insert.sql and with the modifications.csv an update.sql data migration.
The delta file can either contain just the changes or the entire table dump along with the changes.

Supported

Additions
Modifications

Not Supported

Deletions
Non comma separators
Cannot be used as a generic difftool. Requires a column to be used as a primary key from the csv.

Miscellaneous features

By default, it marks the row as ADDED or MODIFIED by introducing a new column at last.

% csvdiff examples/base-small.csv examples/delta-small.csv
Additions 1
Modifications 1
Rows:
24564,907,completely-newsite.com,com,19827,32902,completely-newsite.com,com,1621,909,19787,32822,ADDED
69,1048,aol.com,com,97543,225532,aol.com,com,70,49,97328,224491,MODIFIED

The --primary-key in an integer array. Specify comma separated positions if the table has a compound key. Using this primary key, it can figure out modifications. If the primary key changes, it is an addition.

% csvdiff base.csv delta.csv --primary-key 0,1

If you want to compare only few columns in the csv when computing hash,

% csvdiff base.csv delta.csv --primary-key 0,1 --columns 2

Supports JSON format for post processing

% csvdiff examples/base-small.csv examples/delta-small.csv --format json
{
  "Additions": [
    "24564,907,completely-newsite.com,com,19827,32902,completely-newsite.com,com,1621,909,19787,32822"
  ],
  "Modifications": [
    "69,1048,aol.com,com,97543,225532,aol.com,com,70,49,97328,224491"
  ]
}

Build locally

$ git clone https://github.com/aswinkarthik93/csvdiff
$ go get ./...
$ go build

# To run tests
$ go get github.com/stretchr/testify/assert
$ go test -v ./...

Algorithm

Creates a map of <uint64, uint64> for both base and delta file
- key is a hash of the primary key values as csv
- value is a hash of the entire row
Two maps as initial processing output
- base-map
- delta-map
The delta map is compared with the base map. As long as primary key is unchanged, they row will have same key. An entry in delta map is a
- Addition, if the base-map's does not have a value.
- Modification, if the base-map's value is different.

Credits

Uses 64 bit xxHash algorithm, an extremely fast non-cryptographic hash algorithm, for creating the hash. Implementations from cespare
Used Majestic million data for demo.

Benchmark tests can be found here.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
cmd
pkg
digest

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL