🤖

Office To Md V2

Convert PDF, DOC, DOCX, and PPTX office documents to Markdown, supporting legacy .doc files with text extraction and basic formatting preservation.

下载3.3k

星标2

版本0.1.0

图像视频

安全通过

⚙️脚本

在 App 中使用在 ClawHub 查看 ↗

技能说明

Office to Markdown Converter Skill (v2)

Description

Convert office documents (PDF, DOC, DOCX, PPTX) to Markdown format. This skill uses the word-extractor library for .doc support and provides full OpenClaw integration.

When to Use

When you need to extract text from office documents
When you want to convert documents to readable Markdown format
When analyzing document content in OpenClaw
Specifically when dealing with legacy .doc format files

Supported Formats

PDF (.pdf): Text extraction using pdf-parse
Word (.docx): Formatting preservation using mammoth + turndown
Legacy Word (.doc): Text extraction using word-extractor (supports Chinese encoding)
PowerPoint (.pptx): Basic text extraction using python-pptx

Dependencies

Node.js with npm packages: pdf-parse, mammoth, turndown, word-extractor
Python3 with python-pptx (for PPTX conversion, optional)
OpenClaw exec tool permission

Installation

1. Copy the skill to your workspace:

cp -r /root/.openclaw/workspace/office-to-md-v2/office-to-md /path/to/your/workspace/

2. Install dependencies:

cd /path/to/your/workspace/office-to-md
npm install

3. For PPTX support (optional):

pip3 install python-pptx

Usage in OpenClaw

Method 1: Direct exec call

// Convert any supported document
const result = await exec(
  'node /path/to/office-to-md/openclaw-skill.js /path/to/document.doc',
  { workdir: '/path/to/workspace', timeout: 60000 }
);

if (result.exitCode === 0) {
  console.log('✅ Document converted successfully');
  // Output file: /path/to/document.md
} else {
  console.error('❌ Conversion failed:', result.stderr);
}

Method 2: Using the wrapper function

// Import the converter
const { convertOfficeToMarkdown } = require('/path/to/office-to-md/openclaw-skill.js');

// Convert document
const conversionResult = await convertOfficeToMarkdown('/path/to/document.pdf');
if (conversionResult.success) {
  console.log(`Output: ${conversionResult.outputPath}`);
  console.log(`Preview: ${conversionResult.preview}`);
} else {
  console.error(`Error: ${conversionResult.error}`);
}

Method 3: Complete OpenClaw integration function

async function convertDocumentToMarkdown(filePath) {
  // Validate file exists
  try {
    await read(filePath);
  } catch (error) {
    return { success: false, error: `File not found: ${filePath}` };
  }
  
  // Check file extension
  const ext = filePath.toLowerCase().slice(-5);
  const supported = ['.pdf', '.doc', '.docx', '.pptx'];
  if (!supported.some(s => ext.endsWith(s))) {
    return { 
      success: false, 
      error: `Unsupported file type. Supported: ${supported.join(', ')}` 
    };
  }
  
  // Convert using the skill
  const cmd = `node /path/to/office-to-md/openclaw-skill.js "${filePath}"`;
  const result = await exec(cmd, { 
    workdir: '/path/to/workspace',
    timeout: 120000 // 2 minutes for large files
  });
  
  if (result.exitCode === 0) {
    const outputPath = filePath.replace(/\.[^/.]+$/, '.md');
    return {
      success: true,
      outputPath: outputPath,
      message: `Converted to: ${outputPath}`
    };
  } else {
    return {
      success: false,
      error: result.stderr || 'Conversion failed'
    };
  }
}

// Usage example
const result = await convertDocumentToMarkdown('/path/to/document.doc');
if (result.success) {
  const markdown = await read(result.outputPath);
  console.log(markdown.substring(0, 1000));
}

Examples

Example 1: Convert and analyze a document

// Convert a .doc file and analyze its content
const docPath = '/path/to/document.doc';
const convertResult = await exec(
  `node /path/to/office-to-md/openclaw-skill.js "${docPath}"`,
  { workdir: '/path/to/workspace' }
);

if (convertResult.exitCode === 0) {
  const mdPath = docPath.replace('.doc', '.md');
  const content = await read(mdPath);
  
  // Analyze the content
  const wordCount = content.split(/\s+/).length;
  const lines = content.split('\n').length;
  const hasChinese = /[\u4e00-\u9fff]/.test(content);
  
  console.log(`Document analysis:`);
  console.log(`- Word count: ${wordCount}`);
  console.log(`- Lines: ${lines}`);
  console.log(`- Contains Chinese: ${hasChinese}`);
  console.log(`- Preview: ${content.substring(0, 200)}...`);
}

Example 2: Batch conversion

// Convert multiple documents of different formats
const documents = [
  '/path/to/report.pdf',
  '/path/to/legacy.doc',
  '/path/to/modern.docx',
  '/path/to/presentation.pptx'
];

const results = [];
for (const doc of documents) {
  console.log(`Converting ${doc}...`);
  const result = await exec(
    `node /path/to/office-to-md/openclaw-skill.js "${doc}"`,
    { workdir: '/path/to/workspace', timeout: 90000 }
  );
  
  const success = result.exitCode === 0;
  results.push({
    file: doc,
    success: success,
    error: success ? null : result.stderr
  });
  
  console.log(success ? '✅ Success' : '❌ Failed');
}

// Summary
const successful = results.filter(r => r.success).length;
console.log(`\nConversion summary: ${successful}/${results.length} successful`);

API Reference

convertOfficeToMarkdown(filePath)

Returns a Promise that resolves to:

{
  success: boolean,
  outputPath?: string,
  markdown?: string,
  preview?: string,
  fileType?: string,
  message?: string,
  stats?: {
    lines: number,
    characters: number,
    words: number
  },
  error?: string,
  stack?: string
}

Configuration

Timeout Settings

Small files (<1MB): 30 seconds
Medium files (1-10MB): 60 seconds
Large files (>10MB): 120 seconds

Memory Limits

Default Node.js memory limit is sufficient for most documents

For very large files, you may need to increase memory:

node --max-old-space-size=4096 openclaw-skill.js large-file.doc

Troubleshooting

Common Issues

"File not found"
- Check file path and permissions
- Use absolute paths for reliability
"Unsupported file type"
- Ensure file has correct extension
- Check if file is actually the claimed format
Conversion errors with .doc files
- The file may be corrupted or in an unusual format
- Try opening in Word and saving as .docx first
Chinese text appears as gibberish
- word-extractor should handle Chinese encoding automatically
- If issues persist, the file may use unusual encoding
Timeout errors
- Increase timeout for large files
- Check system resources

Debug Mode

Enable debug logging by setting environment variable:

DEBUG=office-to-md node openclaw-skill.js document.doc

Performance

PDF: Fast, depends on file size
DOCX: Fast to medium, good formatting preservation
DOC: Medium, requires binary parsing
PPTX: Slow, requires Python and external library

Limitations

Images in documents are not extracted
Complex formatting may not be fully preserved
Tables may convert imperfectly to Markdown
Very old or corrupted .doc files may fail
Password-protected files are not supported

Changelog

v2.0.0 (2026-02-15)

Added full .doc support using word-extractor
Fixed ESM compatibility issues with pptConverter
Added comprehensive OpenClaw integration
Improved Chinese text extraction
Added structured output with statistics

v1.0.0 (Initial)

Basic PDF, DOCX, PPTX support
Simple conversion without .doc support

License

This skill is provided as-is. The underlying libraries have their own licenses:

pdf-parse: MIT
mammoth: BSD-2-Clause
turndown: MIT
word-extractor: MIT
python-pptx: MIT

如何使用「Office To Md V2」？

打开小龙虾AI（Web 或 iOS App）
点击上方「立即使用」按钮，或在对话框中输入任务描述
小龙虾AI 会自动匹配并调用「Office To Md V2」技能完成任务
结果即时呈现，支持继续对话优化