Saturday, 23 June 2012

Extracting metadata using Apache Tika

Looking at using Apache Tika to pull metadata from photos and mp3 files. It is a single jar that runs as a command line, GUI or http server.

MP3


java -jar ~andy/tika-app-1.1.jar 12\ Time\ Of\ The\ Season.mp3
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="xmpDM:audioCompressor" content="MP3"/>
<meta name="xmpDM:releaseDate" content="1968"/>
<meta name="Content-Length" content="5632909"/>
<meta name="xmpDM:album" content="Odessey &amp; Oracle"/>
<meta name="xmpDM:artist" content="Zombies"/>
<meta name="Author" content="Zombies"/>
<meta name="xmpDM:genre" content=""/>
<meta name="xmpDM:logComment" content=""/>
<meta name="Content-Type" content="audio/mpeg"/>
<meta name="resourceName" content="12 Time Of The Season.mp3"/>
<title>Time Of The Season</title>
</head>
<body><h1>Time Of The Season</h1>
<p>Zombies</p>
<p>Odessey &amp; Oracle, track 12</p>
<p>1968</p>
</body></html>
view raw gistfile1.xml hosted with ❤ by GitHub

JPEG

java -jar ~andy/tika-app-1.1.jar skating.jpg
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Software" content="Picasa"/>
<meta name="GPS Altitude Ref" content="Sea level"/>
<meta name="subject" content="greenwich"/>
<meta name="subject" content="ice"/>
<meta name="subject" content="skate"/>
<meta name="Content-Length" content="96154"/>
<meta name="Exif Version" content="2.20"/>
<meta name="date" content="2012-06-23T20:10:18"/>
<meta name="Component 1" content="Y component: Quantization table 0, Sampling factors 2 horiz/2 vert"/>
<meta name="tiff:ImageLength" content="768"/>
<meta name="Component 2" content="Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert"/>
<meta name="GPS Latitude" content="51&quot;28'58.75122"/>
<meta name="Component 3" content="Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert"/>
<meta name="GPS Latitude Ref" content="N"/>
<meta name="description" content="solange, ann and oliver ice skating collage"/>
<meta name="tiff:ImageWidth" content="1024"/>
<meta name="Image Width" content="1024 pixels"/>
<meta name="resourceName" content="skating.jpg"/>
<meta name="Keywords" content="greenwich"/>
<meta name="Keywords" content="ice"/>
<meta name="Keywords" content="skate"/>
<meta name="GPS Longitude Ref" content="W"/>
<meta name="GPS Longitude" content="0&quot;0'25.3368"/>
<meta name="Caption/Abstract" content="solange, ann and oliver ice skating collage"/>
<meta name="tiff:Software" content="Picasa"/>
<meta name="Number of Components" content="3"/>
<meta name="Image Height" content="768 pixels"/>
<meta name="Data Precision" content="8 bits"/>
<meta name="tiff:BitsPerSample" content="8"/>
<meta name="geo:lat" content="51.48299"/>
<meta name="Last-Modified" content="2012-06-23T20:10:18"/>
<meta name="Unknown tag (0xa420)" content="45925b38462f517f73a2bab9760ae5ac"/>
<meta name="Date/Time" content="2012:06:23 20:10:18"/>
<meta name="Directory Version" content="4"/>
<meta name="geo:long" content="-0.00704"/>
<meta name="GPS Version ID" content="2 2 0 0"/>
<meta name="Content-Type" content="image/jpeg"/>
<title/>
</head>
view raw gistfile1.xml hosted with ❤ by GitHub
Looking good.

No comments:

Post a Comment