跳至主要内容
小龙虾小龙虾AI
🤖

Office To Md V2

Convert PDF, DOC, DOCX, and PPTX office documents to Markdown, supporting legacy .doc files with text extraction and basic formatting preservation.

下载617
星标2
版本0.1.0
效率工具
安全通过
⚙️脚本

技能说明

Office to Markdown Converter Skill (v2)

Description

Convert office documents (PDF, DOC, DOCX, PPTX) to Markdown format. This skill uses the word-extractor library for .doc support and provides full OpenClaw integration.

When to Use

  • When you need to extract text from office documents
  • When you want to convert documents to readable Markdown format
  • When analyzing document content in OpenClaw
  • Specifically when dealing with legacy .doc format files

Supported Formats

  • PDF (.pdf): Text extraction using pdf-parse
  • Word (.docx): Formatting preservation using mammoth + turndown
  • Legacy Word (.doc): Text extraction using word-extractor (supports Chinese encoding)
  • PowerPoint (.pptx): Basic text extraction using python-pptx

Dependencies

  • Node.js with npm packages: pdf-parse, mammoth, turndown, word-extractor
  • Python3 with python-pptx (for PPTX conversion, optional)
  • OpenClaw exec tool permission

Installation

1. Copy the skill to your workspace:

cp -r /root/.openclaw/workspace/office-to-md-v2/office-to-md /path/to/your/workspace/

2. Install dependencies:

cd /path/to/your/workspace/office-to-md
npm install

3. For PPTX support (optional):

pip3 install python-pptx

Usage in OpenClaw

Method 1: Direct exec call

// Convert any supported document
const result = await exec(
  'node /path/to/office-to-md/openclaw-skill.js /path/to/document.doc',
  { workdir: '/path/to/workspace', timeout: 60000 }
);

if (result.exitCode === 0) {
  console.log('✅ Document converted successfully');
  // Output file: /path/to/document.md
} else {
  console.error('❌ Conversion failed:', result.stderr);
}

Method 2: Using the wrapper function

// Import the converter
const { convertOfficeToMarkdown } = require('/path/to/office-to-md/openclaw-skill.js');

// Convert document
const conversionResult = await convertOfficeToMarkdown('/path/to/document.pdf');
if (conversionResult.success) {
  console.log(`Output: ${conversionResult.outputPath}`);
  console.log(`Preview: ${conversionResult.preview}`);
} else {
  console.error(`Error: ${conversionResult.error}`);
}

Method 3: Complete OpenClaw integration function

async function convertDocumentToMarkdown(filePath) {
  // Validate file exists
  try {
    await read(filePath);
  } catch (error) {
    return { success: false, error: `File not found: ${filePath}` };
  }
  
  // Check file extension
  const ext = filePath.toLowerCase().slice(-5);
  const supported = ['.pdf', '.doc', '.docx', '.pptx'];
  if (!supported.some(s => ext.endsWith(s))) {
    return { 
      success: false, 
      error: `Unsupported file type. Supported: ${supported.join(', ')}` 
    };
  }
  
  // Convert using the skill
  const cmd = `node /path/to/office-to-md/openclaw-skill.js "${filePath}"`;
  const result = await exec(cmd, { 
    workdir: '/path/to/workspace',
    timeout: 120000 // 2 minutes for large files
  });
  
  if (result.exitCode === 0) {
    const outputPath = filePath.replace(/\.[^/.]+$/, '.md');
    return {
      success: true,
      outputPath: outputPath,
      message: `Converted to: ${outputPath}`
    };
  } else {
    return {
      success: false,
      error: result.stderr || 'Conversion failed'
    };
  }
}

// Usage example
const result = await convertDocumentToMarkdown('/path/to/document.doc');
if (result.success) {
  const markdown = await read(result.outputPath);
  console.log(markdown.substring(0, 1000));
}

Examples

Example 1: Convert and analyze a document

// Convert a .doc file and analyze its content
const docPath = '/path/to/document.doc';
const convertResult = await exec(
  `node /path/to/office-to-md/openclaw-skill.js "${docPath}"`,
  { workdir: '/path/to/workspace' }
);

if (convertResult.exitCode === 0) {
  const mdPath = docPath.replace('.doc', '.md');
  const content = await read(mdPath);
  
  // Analyze the content
  const wordCount = content.split(/\s+/).length;
  const lines = content.split('\n').length;
  const hasChinese = /[\u4e00-\u9fff]/.test(content);
  
  console.log(`Document analysis:`);
  console.log(`- Word count: ${wordCount}`);
  console.log(`- Lines: ${lines}`);
  console.log(`- Contains Chinese: ${hasChinese}`);
  console.log(`- Preview: ${content.substring(0, 200)}...`);
}

Example 2: Batch conversion

// Convert multiple documents of different formats
const documents = [
  '/path/to/report.pdf',
  '/path/to/legacy.doc',
  '/path/to/modern.docx',
  '/path/to/presentation.pptx'
];

const results = [];
for (const doc of documents) {
  console.log(`Converting ${doc}...`);
  const result = await exec(
    `node /path/to/office-to-md/openclaw-skill.js "${doc}"`,
    { workdir: '/path/to/workspace', timeout: 90000 }
  );
  
  const success = result.exitCode === 0;
  results.push({
    file: doc,
    success: success,
    error: success ? null : result.stderr
  });
  
  console.log(success ? '✅ Success' : '❌ Failed');
}

// Summary
const successful = results.filter(r => r.success).length;
console.log(`\nConversion summary: ${successful}/${results.length} successful`);

API Reference

convertOfficeToMarkdown(filePath)

Returns a Promise that resolves to:

{
  success: boolean,
  outputPath?: string,
  markdown?: string,
  preview?: string,
  fileType?: string,
  message?: string,
  stats?: {
    lines: number,
    characters: number,
    words: number
  },
  error?: string,
  stack?: string
}

Configuration

Timeout Settings

  • Small files (<1MB): 30 seconds
  • Medium files (1-10MB): 60 seconds
  • Large files (>10MB): 120 seconds

Memory Limits

  • Default Node.js memory limit is sufficient for most documents
  • For very large files, you may need to increase memory:
    node --max-old-space-size=4096 openclaw-skill.js large-file.doc
    

Troubleshooting

Common Issues

  1. "File not found"

    • Check file path and permissions
    • Use absolute paths for reliability
  2. "Unsupported file type"

    • Ensure file has correct extension
    • Check if file is actually the claimed format
  3. Conversion errors with .doc files

    • The file may be corrupted or in an unusual format
    • Try opening in Word and saving as .docx first
  4. Chinese text appears as gibberish

    • word-extractor should handle Chinese encoding automatically
    • If issues persist, the file may use unusual encoding
  5. Timeout errors

    • Increase timeout for large files
    • Check system resources

Debug Mode

Enable debug logging by setting environment variable:

DEBUG=office-to-md node openclaw-skill.js document.doc

Performance

  • PDF: Fast, depends on file size
  • DOCX: Fast to medium, good formatting preservation
  • DOC: Medium, requires binary parsing
  • PPTX: Slow, requires Python and external library

Limitations

  • Images in documents are not extracted
  • Complex formatting may not be fully preserved
  • Tables may convert imperfectly to Markdown
  • Very old or corrupted .doc files may fail
  • Password-protected files are not supported

Changelog

v2.0.0 (2026-02-15)

  • Added full .doc support using word-extractor
  • Fixed ESM compatibility issues with pptConverter
  • Added comprehensive OpenClaw integration
  • Improved Chinese text extraction
  • Added structured output with statistics

v1.0.0 (Initial)

  • Basic PDF, DOCX, PPTX support
  • Simple conversion without .doc support

License

This skill is provided as-is. The underlying libraries have their own licenses:

  • pdf-parse: MIT
  • mammoth: BSD-2-Clause
  • turndown: MIT
  • word-extractor: MIT
  • python-pptx: MIT

如何使用「Office To Md V2」?

  1. 打开小龙虾AI(Web 或 iOS App)
  2. 点击上方「立即使用」按钮,或在对话框中输入任务描述
  3. 小龙虾AI 会自动匹配并调用「Office To Md V2」技能完成任务
  4. 结果即时呈现,支持继续对话优化

相关技能