jksrcなmatrix scanツール: dnaMotifFind

% dnaMotifFind
dnaMotifFind - Locate preexisting motifs in DNA sequence
usage:
   dnaMotifFind motifFile sequence.fa output.tab
options:
   -markov=level  Level of Markov background model - 0 1 or 2
   -background=seq.fa  Sequence to use for background model
   -threshold=N  significance threshold (ln based, default 8.0)
   -rc Include reverse complement

例によって、ファイルはオレオレかつ変態的なわけで

% cat ~/tmp/kent/src/hg/lib/dnaMotif.as
table dnaMotif
"A gapless DNA motif"
    (
    string name;                        "Motif name."
    int columnCount;                    "Count of columns in motif."
    float[columnCount] aProb;           "Probability of A's in each column."
    float[columnCount] cProb;           "Probability of C's in each column."
    float[columnCount] gProb;           "Probability of G's in each column."
    float[columnCount] tProb;           "Probability of T's in each column."
    )

つまり、こんな感じでmatrixを扱う

% cat ~/tmp/kent/src/hg/geneBounds/dnaMotifFind/fkh1.motif 
FKH1    8       0.115000,0.037000,0.165000,0.017000,0.005000,0.019000,0.754000,0.051000,        0.085000,0.024000,0.017000,0.038000,0.012000,0.021000,0.068000,0.599000,        0.051000,0.025000,0.775000,0.021000,0.021000,0.042000,0.080000,0.035000,        0.750000,0.914000,0.042000,0.924000,0.962000,0.918000,0.098000,0.315000,

う〜ん、変態的。

で、そんなヒトとつきあうためのコード。

#!/usr/bin/env ruby 
def pfm2motif(infile)
  ### read
  buf = []
  open(infile) do |ifh|
    ifh.each{|line| buf << line.strip.split(/\s+/)}  
  end

  ### column check
  columnCount = buf.collect{|row| row.length}.uniq
  raise "Inconsistent column number" unless columnCount.length == 1

  ### format
  name = File.basename(infile,".pfm")
  buf = buf.collect{|row| row.collect{|cell| cell.to_f/100}.join(",")+","}
  return [name, columnCount.first, buf].flatten.join("\t")
end

infiles = ARGV
infiles.each do |infile|
  puts pfm2motif(infile)
end