semantic-csv

0.1.0-alpha4


A Clojure library with higher level CSV parsing functionality

dependencies

org.clojure/clojure
1.5.1
clojure-csv/clojure-csv
2.0.1



(this space intentionally left almost blank)
 

Higher level CSV parsing/processing functionality

The two most popular CSV parsing libraries for Clojure presently - clojure/data.csv and clojure-csv - concern themselves only with the syntax of CSV; They take CSV text, transform it into a collection of vectors of string values, and that's it. Semantic CSV takes the next step by giving you tools for addressing the semantics of your data, helping you put it in a form that better reflects what it represents.

Features

  • Absorb header row as a vector of column names, and return remaining rows as maps of column-name -> row-val
  • Write from a collection of maps, given a header
  • Apply casting/formatting functions by column name, while reading or writing
  • Numeric casting function helpers
  • Remove commented out lines (by default, those starting with #)
  • Compatible with any CSV parsing library returning/writing a sequence of row vectors
  • (SOON) A "sniffer" that reads in N lines, and uses them to guess column types

Structure

Semantic CSV is structured around a number of composable processing functions for transforming data as it comes out of or goes into a CSV file. This leaves room for you to use whatever parsing/formatting tools you like, reflecting a nice decoupling of grammar and semantics. However, a couple of convenience functions are also provided which wrap these individual steps in an opinionated but customizable manner, helping you move quickly while prototyping or working at the REPL.


Core API namespace

(ns semantic-csv.core
  (:require [clojure.java.io :as io]
            [clojure-csv.core :as csv]
            [semantic-csv.impl.core :as impl :refer [?>>]]))

To start, require this namespace, clojure.java.io, and your favorite CSV parser (e.g., clojure-csv or clojure/data.csv; we'll mostly be using the former).

(require '[semantic-csv.core :refer :all]
         '[clojure-csv.core :as csv]
         '[clojure.java.io :as io])

Now let's take a tour through some of the processing functions we have available, starting with the input processing functions.


Input processing functions

Note that all of these processing functions leave the rows collection as the final argument. This is to make these functions interoperable with other standard collection processing functions (map, filter, take, etc.) in the context of the ->> threading macro.

Let's start with what may be the most basic and frequently needed function:


mappify

Takes a sequence of row vectors, as commonly produced by csv parsing libraries, and returns a sequence of maps. By default, the first row vector will be interpreted as a header, and used as the keys for the maps. However, this and other behaviour are customizable via an optional opts map with the following options:

  • :keyify - bool; specify whether header/column names should be turned into keywords (default: true).
  • :header - specify the header to use for map keys, preventing first row of data from being consumed as header.
  • :structs - bool; use structs instead of hash-maps or array-maps, for performance boost (default: false).
(defn mappify
  ([rows]
   (mappify {} rows))
  ([{:keys [keyify header structs] :or {keyify true} :as opts}
    rows]
   (let [consume-header (not header)
         header (if header
                  header
                  (first rows))
         header (if keyify (mapv keyword header) header)
         map-fn (if structs
                  (let [s (apply create-struct header)]
                    (partial apply struct s))
                  (partial impl/mappify-row header))]
     (map map-fn
          (if consume-header
            (rest rows)
            rows)))))

Here's an example to whet our whistle:

=> (with-open [in-file (io/reader "test/test.csv")]
     (->>
       (csv/parse-csv in-file)
       mappify
       doall))

({:this "# some comment lines..."}
 {:this "1", :that "2", :more "stuff"}
 {:this "2", :that "3", :more "other yeah"})


Note that "# some comment lines..." was really intended to be left out as a comment. We can address this with the following function:


remove-comments

Removes rows which start with a comment character (by default, #). Operates by checking whether the first item of every row in the collection matches a comment pattern. Also removes empty lines. Options include:

  • :comment-re - Specify a custom regular expression for determining which lines are commented out.
  • :comment-char - Checks for lines lines starting with this char.

    Note: this function only works with rows that are vectors, and so should always be used before mappify.

(defn remove-comments
  ([rows]
   (remove-comments {:comment-re #"^\#"} rows))
  ([{:keys [comment-re comment-char]} rows]
   (let [commented? (if comment-char
                      #(= comment-char (first %))
                      (partial re-find comment-re))]
     (remove
       (fn [row]
         (let [x (first row)]
           (when x
             (commented? x))))
       rows))))

Let's see this in action with the same data we looked at in the last example:

=> (with-open [in-file (io/reader "test/test.csv")]
     (->>
       (csv/parse-csv in-file)
       remove-comments
       mappify
       doall))

({:this "1", :that "2", :more "stuff"}
 {:this "2", :that "3", :more "other yeah"})

Much better :-)


Next, let's observe that :this and :that point to strings, while they should really be pointing to numeric values. This can be addressed with the following function:


cast-with

Casts the vals of each row according to cast-fns, which must either be a map of column-name -> casting-fn or a single casting function to be applied towards all columns. Additionally, an opts map can be used to specify:

  • :except-first - Leaves the first row unaltered; useful for preserving header row.
  • :exception-handler - If cast-fn raises an exception, this function will be called with args colname, value, and the result used as the parse value.
  • :only - Only cast the specified column(s); can be either a single column name, or a vector of them.
(defn cast-with
  ([cast-fns rows]
   (cast-with cast-fns {} rows))
  ([cast-fns {:keys [except-first exception-handler only] :as opts} rows]
   (->> rows
        (?>> except-first (drop 1))
        (map #(impl/cast-row cast-fns % :only only :exception-handler exception-handler))
        (?>> except-first (cons (first rows))))))

Let's try casting a numeric column using this function:

=> (with-open [in-file (io/reader "test/test.csv")]
     (->>
       (csv/parse-csv in-file)
       remove-comments
       mappify
       (cast-with {:this #(Integer/parseInt %)})
       doall))

({:this 1, :that "2", :more "stuff"}
 {:this 2, :that "3", :more "other yeah"})

Alternatively, if we want to cast multiple columns using a single function, we can do so by passing a single casting function as the first argument.

=> (with-open [in-file (io/reader "test/test.csv")]
     (->>
       (csv/parse-csv in-file)
       remove-comments
       mappify
       (cast-with #(Integer/parseInt %) {:only [:this :that]})
       doall))

({:this 1, :that 2, :more "stuff"}
 {:this 2, :that 3, :more "other yeah"})

Note that this function handles either map or vector rows. In particular, if you’ve imported data without consuming a header (by either not using mappify or by passing :header false to process or slurp-csv below), then the columns can be keyed by their zero-based index. For instance, (cast-with {0 #(Integer/parseInt %) 1 #(Double/parseDouble %)} rows) will parse the first column as integers and the second as doubles.


It's worth pointing out this function isn't strictly an input processing function, but could be used for intermediate or output preparation processing. Similarly, the following macro provides functionality which could be useful at multiple points of a pipeline.


except-first

Takes any number of forms and a final data argument. Threads the data through the forms, as though with ->>, except that the first item in data remains unaltered. This is intended to operate within the context of an actual ->> threading macro for processing, where you might want to leave a header column unmodified by your processing functions.

(defmacro except-first
  [& forms-and-data]
  (let [data (last forms-and-data)
        forms (butlast forms-and-data)]
    `((fn [rows#]
        (let [first-row# (first rows#)
              rest-rows# (rest rows#)]
          (cons first-row# (->> rest-rows# ~@forms))))
        ~data)))

This macro generalizes the :except-first option of the cast-with function for arbitrary processing, operating on every row except for the first. For example:

=> (->> [["a" "b" "c"] [1 2 3] [4 5 6]]
        (except-first (cast-with inc)
                      (cast-with #(/ % 2))))
(["a" "b" "c"] [1 3/2 2] [5/2 3 7/2])

This could be useful if you know you want to do some processing on all non-header rows, but don't really need to know which columns are which to do this, and still want to keep the header row.


Hopefully you see the value of these functions being separate, composable units. However, sometimes this modularity isn't necessary, and you want something quick and dirty to get the job done, particularly at the REPL. To meet this need, the following convenience functions are provided:


process

This function wraps together the most frequently used input processing capabilities, controlled by an opts hash with opinionated defaults:

  • :mappify - bool; transform rows from vectors into maps using mappify.
  • :keyify - bool; specify whether header/column names should be turned into keywords (default: true).
  • :header - specify header to be used in mappify; as per mappify, first row will not be consumed as header
  • :structs - bool; use structs instead of array-maps or hash-maps in mappify.
  • :remove-comments - bool; remove comment lines, as specified by :comment-re or :comment-char. Also removes empty lines. Defaults to true.
  • :comment-re - specify a regular expression to use for commenting out lines.
  • :comment-char - specify a comment character to use for filtering out comments; overrides comment-re.
  • :cast-fns - optional map of colname | index -> cast-fn; row maps will have the values as output by the assigned cast-fn.
  • :cast-exception-handler - If cast-fn raises an exception, this function will be called with args colname, value, and the result used as the parse value.
  • :cast-only - Only cast the specified column(s); can be either a single column name, or a vector of them.
(defn process
  ([{:keys [mappify keify header remove-comments comment-re comment-char structs cast-fns cast-exception-handler cast-only]
     :or   {mappify         true
            keify           true
            remove-comments true
            comment-re      #"^\#"}
     :as opts}
    rows]
   (->> rows
        (?>> remove-comments (semantic-csv.core/remove-comments {:comment-re comment-re :comment-char comment-char}))
        (?>> mappify (semantic-csv.core/mappify {:keify keify :header header :structs structs}))
        (?>> cast-fns (cast-with cast-fns {:exception-handler cast-exception-handler :only cast-only}))))
  ; Use all defaults
  ([rows]
   (process {} rows)))

Using this function, the code we've been building above is reduced to the following:

(with-open [in-file (io/reader "test/test.csv")]
  (doall
    (process {:cast-fns {:this #(Integer/parseInt %)}}
             (csv/parse-csv in-file))))


parse-and-process

This is a convenience function for reading a csv file using clojure/data.csv and passing it through process with the given set of options (specified last as kw_args, in contrast with our other processing functions). Note that :parser-opts can be specified and will be passed along to clojure-csv/parse-csv

(defn parse-and-process
  [csv-readable & {:keys [parser-opts]
                   :or   {parser-opts {}}
                   :as   opts}]
  (let [rest-options (dissoc opts :parser-opts)]
    (process
      rest-options
      (impl/apply-kwargs csv/parse-csv csv-readable parser-opts))))
(with-open [in-file (io/reader "test/test.csv")]
  (doall
    (parse-and-process in-file
                       :cast-fns {:this #(Integer/parseInt %)})))


slurp-csv

This convenience function let's you parse-and-process csv data given a csv filename. Note that it is not lazy, and must read in all data so the file handle can be closed.

(defn slurp-csv
  [csv-filename & {:as opts}]
  (let [rest-options (dissoc opts :parser-opts)]
    (with-open [in-file (io/reader csv-filename)]
      (doall
        (impl/apply-kwargs parse-and-process in-file opts)))))

For the ultimate in programmer laziness:

(slurp-csv "test/test.csv"
           :cast-fns {:this #(Integer/parseInt %)})


Some casting functions for your convenience

These functions can be imported and used in your :cast-fns specification. They focus on handling some of the mess of dealing with numeric casting.

->int

Translate to int from string or other numeric. If string represents a non integer value, it will be rounded down to the nearest int.

(defn ->int
  [v]
  (if (string? v)
    (-> v clojure.string/trim Double/parseDouble int)
    (int v)))

->long

Translate to long from string or other numeric. If string represents a non integer value, will be rounded down to the nearest long.

(defn ->long
  [v]
  (if (string? v)
    (-> v clojure.string/trim Double/parseDouble long)
    (long v)))

->float

Translate to float from string or other numeric.

(defn ->float
  [v]
  (if (string? v)
    (-> v clojure.string/trim Float/parseFloat)
    (float v)))

->double

Translate to double from string or other numeric.

(defn ->double
  [v]
  (if (string? v)
    (-> v clojure.string/trim Double/parseDouble)
    (double v)))
(slurp-csv "test/test.csv"
           :cast-fns {:this ->int})

Note these functions place a higher emphasis on flexibility and convenience than performance, as you can likely see from their implementations. If maximum performance is a concern for you, and your data is fairly regular, you may be able to get away with less robust functions, which shouldn't be hard to implement yourself. For most cases though, the performance of those provided here should be fine.


Output processing functions

As with the input processing functions, the output processing functions are designed to be small, composable pieces which help you push your data through to a third party writer. And as with the input processing functions, higher level, opinionated, but configurable functions are offered which automate some of this for you.

We've already looked at cast-with, which can be useful in output as well as input processing. Another important function we'll need is one that takes a sequence of maps and turns it into a sequence of vectors since this is what most of our csv writing/formatting libraries will want.


vectorize

Take a sequence of maps, and transform them into a sequence of vectors. Options:

  • :header - The header to be used. If not specified, this defaults to (-> rows first keys). Only values corresponding to the specified header will be included in the output, and will be included in the order corresponding to this argument.
  • :prepend-header - Defaults to true, and controls whether the :header vector should be prepended to the output sequence.
  • :format-header - If specified, this function will be called on each element of the :header vector, and the result prepended to the output sequence. The default behaviour is to leave strings alone but stringify keyword names such that the : is removed from their string representation. Passing a falsey value will leave the header unaltered in the output.
(defn vectorize
  ([rows]
   (vectorize {} rows))
  ([{:keys [header prepend-header format-header]
     :or {prepend-header true format-header impl/stringify-keyword}}
    rows]
   ;; Grab the specified header, or the keys from the first row. We'll
   ;; use these to `get` the appropriate values for each row.
   (let [header     (or header (-> rows first keys))
         ;; This will be the formatted version we prepend if desired
         out-header (if format-header (mapv format-header header) header)]
     (->> rows
          (map
            (fn [row] (mapv (partial get row) header)))
          (?>> prepend-header (cons out-header))))))

Let's see this in action:

=> (let [data [{:this "a" :that "b"}
               {:this "x" :that "y"}]]
     (vectorize data))
(["this" "that"]
 ["a" "b"]
 ["x" "y"])

With some options:

=> (let [data [{:this "a" :that "b"}
               {:this "x" :that "y"}]]
     (vectorize {:header [:that :this]
                 :prepend-header false}
                data))
(["b" "a"]
 ["y" "x"])


batch

Takes sequence of items and returns a sequence of batches of items from the original sequence, at most n long.

(defn batch
  [n rows]
  (partition n n [] rows))

This function can be useful when working with clojure-csv when writing lazily. The clojure-csv.core/write-csv function does not actually write to a file, but just formats the data you pass in as a CSV string. If you're working with a lot of data, calling this function would build a single massive string of the results, and likely crash. To write lazily, you have to take some number of rows, write them, and repeat till you're done. Our batch function helps by giving you a lazy sequence of batches of n rows at a time, letting you pass that through to something that writes off the chunks lazily.


And as promised, a function for doing all the dirty work for you:


spit-csv

Convenience function for spitting out CSV data to a file using clojure-csv.

  • file - Can be either a filename string, or a file handle.
  • opts - Optional hash of settings.
  • rows - Can be a sequence of either maps or vectors; if the former, vectorize will be called on the input with :header argument specifiable through opts.

    The Options hash can have the following mappings:

  • :batch-size - How many rows to format and write at a time?

  • :cast-fns - Formatter(s) to be run on row values. As with cast-with function, can be either a map of column-name -> cast-fn, or a single function to be applied to all values. Note that str is called on all values just before writing regardless of :cast-fns.
  • :writer-opts - Options hash to be passed along to clojure-csv.core/write-csv.
  • :header - Header to be passed along to vectorize, if necessary.
  • :prepend-header - Should the header be prepended to the rows written if vectorize is called?
(defn spit-csv
  ([file rows]
   (spit-csv file {} rows))
  ([file
    {:keys [batch-size cast-fns writer-opts header prepend-header]
     :or   {batch-size 20 prepend-header true}
     :as   opts}
    rows]
   (if (string? file)
     (with-open [file-handle (io/writer file)]
       (spit-csv file-handle opts rows))
     ; Else assume we already have a file handle
     (->> rows
          (?>> cast-fns (cast-with cast-fns))
          (?>> (-> rows first map?)
               (vectorize {:header header
                           :prepend-header prepend-header}))
          ; For safe measure
          (cast-with str)
          (batch batch-size)
          (map #(impl/apply-kwargs csv/write-csv % writer-opts))
          (reduce
            (fn [w rowstr]
              (.write w rowstr)
              w)
            file)))))

Note that since we use clojure-csv here, we offer a :batch-size option that lets you format and write small batches of rows out at a time, to avoid constructing a massive string representation of all the data in the case of bigger data sets.


One last example showing everything together

Let's see how Semantic CSV works in the context of a little data pipeline. We're going to thread data in, transform into maps, run some computations for each row and assoc in, then write the modified data out to a file, all lazily.

First let's show this with clojure/data.csv, which I find a little easier to use for writing:

(require '[clojure.data.csv :as cd-csv])

(with-open [in-file (io/reader "test/test.csv")
            out-file (io/writer "test-out.csv")]
  (->>
    (cd-csv/read-csv in-file)
    remove-comments
    mappify
    (cast-with {:this ->int :that ->float})
    ;; Do your processing...
    (map
      (fn [row]
        (assoc row :jazz (* (:this row)
                            (:that row)))))
    vectorize
    (cd-csv/write-csv out-file)))

Now let's see what this looks like with clojure-csv:

As mentioned above, clojure-csv doesn't handle file writing for you, just formatting. So, to maintain laziness, you'll have to add a couple steps to the end. Additionally, it doesn't accept row items with anything that isn't a string, in contrast with clojure/data.csv, so we'll have to account for this as well.

(with-open [in-file (io/reader "test/test.csv")
            out-file (io/writer "test-out.csv")]
  (->>
    (csv/parse-csv in-file)
    ...
    (cast-with str)
    (batch 1)
    (map csv/write-csv)
    (reduce
      (fn [w row]
        (.write w row)
        w)
      out-file)))


And there you have it. A simple, composable, and easy to use library for taking you the extra mile with CSV processing in Clojure.


That's it for the core API

Hope you find this library useful. If you have questions or comments please either submit an issue or join us in the dedicated chat room.

 

This namespace consists of implementation details for the main API

(ns semantic-csv.impl.core)

Translates a single row of values into a map of colname -> val, given colnames in header.

(defn mappify-row
  [header row]
  (into {} (map vector header row)))

Utility that takes a function f, any number of regular args, and a final kw-args argument which will be splatted in as a final argument

(defn apply-kwargs
  [f & args]
  (apply
    (apply partial
           f
           (butlast args))
    (apply concat (last args))))

Leaves strings alone. Turns keywords into the stringified version of the keyword, sans the initial : character. On anything else, calls str.

(defn stringify-keyword
  [x]
  (cond
    (string? x)   x
    (keyword? x)  (->> x str (drop 1) (apply str))
    :else         (str x)))

Returns a function that casts casts a single row value based on specified casting function and optionally excpetion handler

(defn row-val-caster
  [cast-fns exception-handler]
  (fn [row col]
    (let [cast-fn (if (map? cast-fns) (cast-fns col) cast-fns)]
      (try
        (update-in row [col] cast-fn)
        (catch Exception e
          (update-in row [col] (partial exception-handler col)))))))

Format the values of row with the given function. This gives us some flexbility with respect to formatting both vectors and maps in similar fashion.

(defn cast-row
  [cast-fns row & {:keys [only exception-handler]}]
  (let [cols (cond
               ; If only is specified, just use that
               only
                 (flatten [only])
               ; If cast-fns is a map, use those keys
               (map? cast-fns)
                 (keys cast-fns)
               ; Then assume cast-fns is single fn, and fork on row type
               (map? row)
                 (keys row)
               :else
                 (range (count row)))]
    (reduce (row-val-caster cast-fns exception-handler) row cols)))

The following is ripped off from prismatic/plumbing:

Conditional double-arrow operation (->> nums (?>> inc-all? (map inc)))

(defmacro ?>>
  [do-it? & args]
  `(if ~do-it?
     (->> ~(last args) ~@(butlast args))
     ~(last args)))

We include it here in lieue of depending on the full library due to dependency conflicts with other libraries.